multilingual NLP | George Mason NLP

An Efficient Approach for Studying Cross-Lingual Transfer in Multilingual Language Models

GlobalBench: A Benchmark for Global Progress in Natural Language Processing

Despite the major advances in NLP, significant disparities in NLP system performance across languages still exist. Arguably, these are due to uneven resource allocation and sub-optimal incentives to work on less resourced languages. To track and …

LIMIT: Language Identification, Misidentification, and Translation using Hierarchical Models in 350+ Languages

Knowing the language of an input text/audio is a necessary first step for using almost every NLP tool such as taggers, parsers, or translation systems. Language identification is a well-studied problem, sometimes even considered solved; in reality, …

To token or not to token: A Comparative Study of Text Representations for Cross-Lingual Transfer

Choosing an appropriate tokenization scheme is often a bottleneck in low-resource cross-lingual transfer. To understand the downstream implications of text representation choices, we perform a comparative analysis on language models having diverse …

Towards a Universal Python: Translating the Natural Modality of Python into Other Human Languages

The Python programming language plays a large role in computer science today, both in industry and education. While the pseudo-code nature of its keywords and built-in functions/modules makes programming easy to learn for English speakers, …

Script Normalization for Unconventional Writing of Under-Resourced Languages in Bilingual Communities

The wide accessibility of social media has provided linguistically under-represented communities with an extraordinary opportunity to create content in their native languages. This, however, comes with certain challenges in script normalization, …

Approaches to Corpus Creation for Low-Resource Language Technology: the Case of Southern Kurdish and Laki

One of the major challenges that under-represented and endangered language communities face in language technology is the lack or paucity of language data. This is also the case of the Southern varieties of the Kurdish and Laki languages for which …

PALI: A Language Identification Benchmark for Perso-Arabic Scripts

The Perso-Arabic scripts are a family of scripts that are widely adopted and used by various linguistic communities around the globe. Identifying various languages using such scripts is crucial to language technologies and challenging in low-resource …

GMNLP at SemEval-2023 Tasks 12: Sentiment Analysis with Phylogeny-Based Adapters

This report describes GMU’s sentiment analysis system for the SemEval-2023 shared task AfriSenti-SemEval. We participated in all three sub-tasks: Monolingual, Multilingual, and Zero-Shot. Our approach uses models initialized with AfroXLMR-large, a …

Geographic and Geopolitical Biases of Language Models

Pretrained language models (PLMs) often fail to fairly represent target users from certain world regions because of the under-representation of those regions in training datasets. With recent PLMs trained on enormous data sources, quantifying their …