State-of-the-art machine translation (MT) systems are typically trained to generate "standard" target language; however, many languages have multiple varieties (regional varieties, dialects, sociolects, non-native varieties) that are different from …
Question answering (QA) in English has been widely explored, but multilingual datasets are relatively new, with several methods attempting to bridge the gap between high- and low-resourced languages using data augmentation through translation and …
Models pre-trained on multiple languages have shown significant promise for improving speech recognition, particularly for low-resource languages. In this work, we focus on phoneme recognition using Allosaurus, a method for multilingual recognition …
Transfer learning based on pretraining language models on a large amount of raw data has become a new norm to reach state-of-theart performance in NLP. Still, it remains unclear how this approach should be applied for unseen languages that are not …
Active learning (AL) uses a data selection algorithm to select useful training samples to minimize annotation cost. This is now an essential tool for building low-resource syntactic analyzers such as part-of-speech (POS) taggers. Existing AL …
Creating a descriptive grammar of a language is an indispensable step for language documentation and preservation. However, at the same time it is a tedious, time-consuming task. In this paper, we take steps towards automating this process by …
As machine translation (MT) systems progress at a rapid pace, questions of their adequacy linger. In this study we focus on negation, a universal, core property of human language that significantly affects the semantics of an utterance. We …
Language models (LMs) have proven surprisingly successful at capturing factual knowledge by completing cloze-style fill-in-the-blank questions such as *Punta Cana is located in _blank_.* However, while knowledge is both written and queried in many …
An exciting research direction that we pursue at GMU NLP is building multi-lingual and polyglot systems. The languages of the world often share similar characteristics, and training systems cross-lingually allows us to leverage these similarities and overcome data scarcity issues.