Project

CS 678

Advanced Natural Language Processing

Project Options for CS678

1. SIGMORPHON UniMorph Shared Task on Typologically Diverse and Acquisition-Inspired Morphological Inflection Generation (Part 1)
Goal: design a model that learns to generate morphological inflections from a lemma and a set of morphosyntactic features of the target form.
URL: https://github.com/sigmorphon/2022InflectionST
2. SIGMORPHON 2022 Shared Task on Morpheme Segmentation
morpheme segmentation that converts a text into a sequence of morphemes.
URL: https://github.com/sigmorphon/2022SegmentationST
3. SIGMORPHON 2022 Shared Task on Grapheme-to-Phoneme Conversion (Low-Resource and Cross-Lingual)
create computational models that map a sequence of "graphemes"—characters—representing a word to a transcription of that word's pronunciation. This task is an important part of speech technologies, including recognition and synthesis.
URL: https://github.com/sigmorphon/sigmorphon.github.io/blob/master/sharedtasks/2022/G2P.md
4. Identification of Languages and Dialects of Italy (ITDI)
The task is classification, i.e. the model is required to discriminate between different language varieties, of 11 languages and dialects of Italy for training (Piedmontese, Venetian, Sicilian, Neapolitan, Emilian-Romagnol, Tarantino, Sardinian, Ligurian, Friulian, Ladin, Lombard).
URL: https://sites.google.com/view/vardial-2022/shared-tasks
5. Dialectal Extractive Question Answering (DialQA)
The Dialectal Extractive Question Answering Shared Task invites participants to build QA systems that are robust to dialectal variation.
URL: webpage
6. MLM Performance Prediction
The task of performance prediction is to be able to accurately predict the performance of a model on a set of target languages. Can we also do this at a more granular level e.g. at the dataset or instance level?
URL: https://www.microsoft.com/en-us/research/event/sumeval-2022/shared-task/
7. Active Learning for Low-Resource Dependency Parsing
The task is to perform morphosyntactic analysis for low-resource languages, simulating a scenario with a linguist in the loop that provides trees and tags at a per-sentence level.
URL: https://universaldependencies.org/
8. Multilingual Idiomaticity Detection through Parallel Data
By and large, the use of compositionality of word representations has been successful in capturing the meaning of sentences. However, there is an important set of phrases — those which are idiomatic — which are inherently not compositional.. The goal is detecting and representing multiword expressions (MWEs) which are potentially idiomatic phrases.
URL: https://sites.google.com/view/semeval2022task2-idiomaticity
9. Prediction of Cognate Reflexes
In historical-comparative linguistics, scholars typically assemble words from related languages into cognate sets. In contrast to the notion of cognates in didactics and synchronic NLP applications, cognate words -- the members of a cognate set -- are commonly assumed to share a common origin regardless of their meaning. In addition, cognate sets should not contain borrowed words. Cognate words typically show so-called regular sound correspondences.
URL: https://sigtyp.github.io/st2022.html
10. Multilingual Clause-level Morphology
This task aims to investigate methods for morphological analysis or generation of different forms in languages with varying typology, where the modeling and alignment of morphosyntactic structure is accomplished at the level of clauses. The input is verbal lemma (the form given as a lexicon entry) and a specific set of inflectional features. The task requires generating the desired output clause manifesting the features.
URL: https://sigtyp.github.io/st2022-mrl.html
11. Text-based NP Enrichment
Understanding the relations between entities denoted by NPs in a text is a critical part of human-like natural language understanding. The goal is to enrich each NP in a text with all the preposition-mediated relations — either explicit or implicit — that hold between it and other NPs in the text. The relations are represented as triplets, each denoted by two NPs related via a preposition.
URL: https://yanaiela.github.io/TNE/
12. Study the influence of tokenization in cross-lingual transfer
Run models like CANINE, BYT5, PIXEL, PerceiverIO. Study cross-lingual transfer (train one language, test on another) on Dependency parsing, POS tagging, NER.
URL: none (PoC: Fahim Faisal)
13. SIGTYP 2020 Shared Task: Prediction of Typological Features
Typological knowledge bases (KBs) such as WALS (Dryer and Haspelmath, 2013) contain information about linguistic properties of the world’s languages. They have been shown to be useful for downstream applications, including cross-lingual transfer learning and linguistic probing. A major drawback hampering broader adoption of typological KBs is that they are sparsely populated, in the sense that most languages only have annotations for some features, and skewed, in that few features have wide coverage. As typological features often correlate with one another, it is possible to predict them and thus automatically populate typological KBs.
URL: paper, data

Last updated on Nov 1, 2022