CS 678 Course Project

Advanced Natural Language Processing

Students will form groups (each with 2-3 people) for the course project. There will be in total two submissions toward completing the project: project proposal and baseline, and final project report & presentation. Requirements are stated in this page. We will also provide a LaTex template file for the proposal and the report writing (which can also be downloaded from Blackboard).

Acknowledgement: Instructions and the template are adapted from UMass COMPSCI692A by Andrew McCallum, CMU CS711 by Graham Neubig, and from the ML Reproducibility Challenge 2022.

Checkpoint 1: Project Proposal and Baseline

Due: 3/31

In this project, you will have to choose a paper published either in ACL 2022 or NAACL 2022, and attempt to reproduce their main result. The objective is to assess if the experiments are reproducible, and to determine if the conclusions of the paper are supported by your findings. Your results can be either positive (i.e. confirm reproducibility), or negative (i.e. explain what you were unable to reproduce, and potentially explain why).

Essentially, think of your role as an inspector verifying the validity of the experimental results and conclusions of the paper.

We suggest that you first attempt to reimplement the experiments of the papers from scratch. Using any published code, is allowed, as long as this is clear in your report. In the ACL Anthology some papers are marked as having accompanying software (with a little green image in the paper’s metadata) but a lot of papers simply provide a github link in the first (or last page). We suggest you first pick a paper that you find interesting e.g., because of the task or language(s) involved – do not just pick a paper because it has code available! Note that not all papers are eligible for our project, because not all papers introduce a new NLP method/task/implementation; some introduce datasets, some are opinion/position pieces, etc.

Scope We recommend you focus on the central claim of the paper. For example, if a paper introduces a new RL learning algorithm that performs better in sparse-reward environments, verify that you can re-implement the algorithm, run it on the same benchmarks and get results that are close to those in the original paper (exact reproducibility is in most cases very difficult due to minor implementation details). You do not need to reproduce all experiments in your selected paper, but only those that you feel are sufficient for you to verify the validity of the central claim.

Just re-running code is not a reproducibility study, and you need to approach any code with critical thinking and verify it does what is described in the paper and that these are sufficient to support the conclusions of the papers. Consider designing and running unit tests on the code to verify it works well and as described. Alternately, the methods presented can also be fully re-implemented according to the description in the paper. This is a higher bar for reproducibility that can take much more time, but may be helpful in detecting anomalies in the code, or shedding light on aspects of the implementation that affect results.

Some questions to consider:

Were you able to reproduce the exact experiments described in the paper?
If not, why not?
How sensitive is the implementation/model to hyperparameter choices?
What about running with different random seeds? Or with different data splits?
What about hardware limitations or other issues?

We suggest you take a look at the ML Reproducibility Challange website, which provides guidelines and resources for good reproducility research. The instructors will encourage/help the teams with the best reports to submit their reproduction as a paper at the next iteration of the MLRC in 2023.

Report Specifics Generally, a report should include any information future researchers or practitioners would find useful for reproducing or building upon the chosen paper. The results of any experiments should be included; a “negative result” which doesn’t support the main claims of the original paper is still valuable.
Your project proposal and reimplementation should be around 4 pages.
Make sure to include the following in your project proposal:

Title for your project (this can also change later on). Something descriptive but also memorable.
Introduction section, which should explain the context of the paper. It should contain the following subsections:
- Task / Research Question Description: What is the task the paper is trying to solve or what is the research question they are trying to answer?
- Motivation & Limitations of existing work: Have others tried to solve the same task or answer a similar research question? What are they trying to do differently and why? What were the limitations or shortcomings of prior work?
- Proposed Approach: Briefly describe the core contribution of the paper's proposed approach.
- Likely challenges and mitigations: What is hard about this task / research question? What are your contingency plans if the reproduction turns out to be harder than expected or experiments do not go as planned?
Related Work Section: Include 3-4 sentence descriptions of no less than 4 papers directly relevant to the proposed research. Also mention how the paper you are working on differs from these.
Experiments section, which must contain the following subsections:
- Datasets - Please list which datasets you are using, whether or not you have access to them, and whether or not they are publicly available with the same preprocessing and train / dev / tests as the work you will be reproducing.
- Implementation - Please provide a link to a repo of your reimplementation (if applicable) and appropriately cite any resources you have used
- Results - Provide a table comparing your results to the published results.
- Discussion - Discuss any issues you faced. Do your results differ from the published ones? If yes, why do you think that is? Did you do a sensitivity analysis (e.g. multiple runs with different random seeds)?
- Resources - Discuss the cost of your reproduction in terms of resources: computation, time, people, development effort, communication with the authors (if applicable).
- Error Analysis - Perform an error analysis on the model. Include at least 2-3 instances where the model fails. Discuss the error analysis in the paper -- what other analyses could the authors have ran? If you were able to perform additional error analyses, report it here.
Conclusion - Is the paper reproducible?

Submission: A single PDF should be turned in through Blackboard by the due date (only one person from each team needs to do this). The PDF should follow the proposed template and it should include a link to a Github repository with your reproduction code.

Grading: This component is worth 15% of your overall grade. The grading rubric for the report and reproduction component is as follows.

	Grade
Component (portion)	<=5 (fail)	7 (satisfactory)	9 (good)	10 (excellent)
Project (2)	Poor choice of paper. No effort at reproduction.	Adequate use of research and design methodologies. OK literature review. Minimum requirements satisfied.	Uses the correct research and design methodologies. Good literature review. At least one original contribution to reproduce the work and/ot go beyond the original results or error analysis of the paper.	Excellent demonstration of research and design methodologies. Very good discussion of the literature. The project team goes above and beyond in analysis ideas.
Code (5)	Insufficient effort. Simple copy of existing code.	Satisfactory: the TAs could use the code to reproduce the results	Very Good: The TAs can use the provided repository to easily reproduce the results	Exceptional: Anyone could easily reproduce everything. Code is well-annotated, README files and bash scripts to aid reproduction, include comments on best practices and findings.
Report (8)	No coherence, content missing, or serious grammar/spelling errors.	Report fulfils the minimum requirements. Structure is acceptable. Some arguments and discussion.	Good report in terms of content, well structured and formulated. Meaningful discussion and argumentation.	Excellent Report in terms of content, well structured, effective use of tables and figures.

Checkpoint 2: Final Project Report

Due: 5/9

Your final project should culminate in a novel research contribution, building on top of your baseline implementation. In this checkpoint we will focus primarily on robustness and multilinguality.

Your final project should explore these two dimensions:

Robustness In the previous checkpoint you (hopefully) performed sensitivity analysis with regards to hyperparameters and other modeling choices. Now, we turn our attention to sensitivity/robustness to data perturbation. Imagine you deployed your model, and suddenly your model had to deal with real-world data: ones with noise, spelling errors, typos, grammar mistakes, ambiguity, etc.
Study the (award-winning) "Beyond Accuracy: Behavioral Testing of NLP Models with CheckList" and its accompanying code and python package. How does the model fair with regards to this checklist? What type of robustness exploration can/should you perform on your model?
Multilinguality Most likely, the model/paper you reproduced has only been tested on one language (or just a hanful of them). What if we attempted to perform the same task on other languages or domains? Make the appropriate modifications to the code (e.g. if you have an English-only BERT-based model, substitute the English BERT with i.e. CamemBERT or FlauBERT to perform experiments in French, or substitute with the multilingual mBERT or XLM-R models to perform experiments in many languages). You can try to find datasets in other languages (beyond those the paper explores) or you can simply translate the existing datasets with the methods you used in Homework 2 (or both!). You may also explore the usage of Multilingual Checklist to perform behavioral testing on all languages you work with.
The above points include the minimum requirements -- we highly encourage you to get creative and take inspiration from your life experiences!

Some teams (especially those lead by PhD students) might wish to explore an even more research-y direction for their final project. If you have a concrete idea about improving the state-of-the-art on an NLP task, or about proposing a new interesting NLP task, this is allowed and even encouraged. But you should make sure to discuss this with the instructors in advance, ideally before your Project Checkpoint 1 submission.

Report Specifics The final project report should be a minimum 8 pages (excluding references). It’s OK to reuse parts of your initial baseline reproduction reports. Make sure to include the following content:

Introduction. This should be about 1 page (figures not counted) covering the following:
- What is the problem you are trying to solve / task you are studying?
- Why is the problem you are solving / studying important?
- What is motivating your proposed study?
- Give a brief description of your proposed approach
- Give a brief summary of the results you achieved experimentally -- basically, what takeaway messages would your work deliver?
Approach. This should be around 1-2 pages (figure is not counted) covering the following:
- Describe any sufficient background on important or non-standard concepts
- Describe your approach in detail and articulate the motivations/intuitions
- Typically you would have a figure illustrating your proposed approach, and optionally another illustrating the dataset/task setting, or have the two combined.
Experiments. This should be around 2-5 pages covering the following:
- Describe the datasets you are using in detail
- Describe the baseline methods you are comparing to
- Describe the metrics on which you are evaluating
- Present the results you have in tables and/or figures
- Describe why you believe the results are what they are.
- Typically you would have a table/figure demonstrating the output from your proposed approach and contrasting it with the outputs from baselines.
Related work. This should be around 0.5-1 page.
Conclusions & Future work. This should be around 0.5 page, including
- A short summary of your work and the results
- Concrete next steps for your project that you did not have time to do in the semester.

Submission: A single PDF should be turned in through Blackboard by the due date (only one person from each team needs to submit this). Make sure the PDF includes links to a Github repository with your code that would allow us to reproduce your experiments. The PDF should follow the proposed template.

Grading: This component is worth 30% of your overall grade. The grading rubric for the project proposal component is as follows.

	Grade
Component (portion)	<=5 (fail)	7 (satisfactory)	9 (good)	10 (excellent)
General Comments (5)	Poor choice of paper/task. Flawed research methodologies. No effort at reproduction or analyses along any interesting dimensions.	Adequate use of research and design methodologies. OK literature review. Minimum requirements satisfied.	Uses the correct research and design methodologies. Good literature review. At least one original contribution along each of the robustness/multilinguality dimensions beyond the original results or error analysis of the paper.	Excellent demonstration of research and design methodologies. Very good discussion of the literature. The project team goes above and beyond in analysis ideas.
Robustness (5)	No effort along the robustness dimension	Fulfils the minimum requirements. Uses at least one dimension of the NLP Checklist or performs similar analysis.	Good analysis on the robustness dimension. For example, it studies multiple robustness dimensions from the NLP Checklist (or similar)	Excellent analysis, goes above and beyond the NLP Checklist, e.g. explores new robustness dimensions.
Multilinguality (5)	No effort along the multilinguality dimension	Fulfils the minimum requirements. Evaluates the model in at least one language beyond the original paper	Good analysis on the multilingual dimension. Studies multiple languages/domains and performs good error analysis.	Excellent analysis, goes above and beyond to study on multiple languages and perform error analysis (e.g. by introducing new multilingual Checklist templates).
Code (5)	Insufficient effort. Simple copy of existing code.	Satisfactory: the TAs could use the code to reproduce the results	Very Good: The TAs can use the provided repository to easily reproduce the results	Exceptional: Anyone could easily reproduce everything. Code is well-annotated, README files and bash scripts to aid reproduction, include comments on best practices and findings.
Report (10)	No coherence, content missing, or serious grammar/spelling errors.	Report fulfils the minimum requirements. Structure is acceptable. Some arguments and discussion.	Good report in terms of content, well structured and formulated. Meaningful discussion and argumentation.	Excellent report in terms of content, well structured, effective use of tables and figures.

Last updated on Nov 1, 2022