George Mason NLP

George Mason NLP

George Mason University

George Mason Natural Language Processing Group

Natural language processing (NLP) aims to enable computers to use human languages – so that people can, for example, interact with computers naturally; or communicate with people who don’t speak a common language; or manipulate speech or text data at scales not otherwise possible. The NLP group at George Mason Computer Science is interested in all aspects of NLP, with a focus on building tools for under-served languages, and constructing natural language interfaces that can reliably assist humans in knowledge acquisition and task completion.

We are currently working on multilingual models, on building Machine Translation robust to L2-language variations, on NLP for documentation of endangered languages, on exploring the interplay between language and code, on constructing interactive natural language interfaces, and on improving the efficiency of NLP models.

We have open PhD positions and we are looking for students to start in Fall 2024. Do reach out if you have a passion for language and NLP!
Application Link

News

  • June 2024 - GMNLP will be at NAACL with 5 papers (NAACL main, Findings, and various workshops).
  • May 2024 - GMNLP will be at LREC-COLING with 1 paper on NLP for Kurdish Varieties .
  • March 2024 - GMNLP will be at EACL with 1 paper on Dialectal Translation.
  • August 2023 - GMNLP will be at EMNLP with 5 papers (EMNLP main, Findings, and various workshops).
  • August 2023 - Congratulations to Prabin Bhandari, whose paper was accepted at INLG!
  • July 2023 - Join us at ACL, were GMNLP will present 5 papers (ACL main, IWSLT; Twitter thread summary). Congratulations to Daking Rai, Hao Yan, Saurabh Srivastava, Jonathan Mbuya, and collaborator Claytone Sikasote! Papers, code, and data are online!
  • George Mason organized MASC-SLL 2023, an annual NLP event for students and researchers in the Mid-Atlantic area, on April 22!
  • January 2023 - Congratulations to Roy Xie, whose paper was accepted at EACL (Findings)! GMNLP also had another paper at the EACL main conference. Preprints are online.
  • November 2022 - GMNLP has 1 papers accepted at AACL (main) and 2 papers to be presented at WMT. Preprints are online.
  • February 2022 - GMNLP has 3 papers accepted at ACL (2 main, 1 Findings)! Preprints are online!

Projects

Our research is/has been supported by the following organizations/companies:

NSF logo NEH logo Google logo Amazon logo VRIF logo
META AI logo CCI logo ARLIS logo
*

Dataset Maps - MasakhaNER (all)

Back to all dataset maps amharic Relevant Statistics Percentage in-country: 25.07% Missing countries: 171 of 243 (70.37%) Total Variation Distance between observed and population-proportional distribution: 0.000 We also trained a linear model to find socioeconomic correlates of the datasets

Dataset Maps - MLQA (English)

Back to all dataset maps Relevant Statistics Percentage in-country: 53.63% Missing countries: 80 of 243 (32.92%) Total Variation Distance between observed and population-proportional distribution: 913.093 We also trained a linear model to find socioeconomic correlates of the datasets

Dataset Maps - Natural Questions (English)

Back to all dataset maps Relevant Statistics Percentage in-country: 80.07% Missing countries: 49 of 243 (20.16%) Total Variation Distance between observed and population-proportional distribution: 11907.219 We also trained a linear model to find socioeconomic correlates of the datasets

Dataset Maps - SQuAD (English)

Back to all dataset maps Relevant Statistics Percentage in-country: 62.73% Missing countries: 93 of 243 (38.27%) Total Variation Distance between observed and population-proportional distribution: 5004.444 We also trained a linear model to find socioeconomic correlates of the datasets

Dataset Maps - TyDi-QA (all)

Back to all dataset maps arabic Relevant Statistics Percentage in-country: 20.39% Missing countries: 69 of 243 (28.40%) Total Variation Distance between observed and population-proportional distribution: 745.564 We also trained a linear model to find socioeconomic correlates of the datasets

Dataset Maps - WikiANN (all)

Back to all dataset maps afrikaans Relevant Statistics Percentage in-country: 34.71% Missing countries: 85 of 243 (34.98%) Total Variation Distance between observed and population-proportional distribution: 34.966 We also trained a linear model to find socioeconomic correlates of the datasets

Dataset Maps - X-FACTR (12 languages)

Back to all dataset maps greek Relevant Statistics Percentage in-country: 2.94% Total Variation Distance between observed and population-proportional distribution: 1.623 We also trained a linear model to find socioeconomic correlates of the datasets

Efficient NLP/AI

We study building NLP/AI models with limited supervision, especially for low-resource domains (e.g., healthcare).

Human-AI Interaction

We explore how machine learning systems can interact with humans effectively. This includes being able to converse with humans through dialogues, as well as proactively collaborate with and learn from humans during decision making.

Information Intelligence

We explore computational approaches for information intelligence tasks such as Question Answering, Information Extraction, etc.

Language and Code

We seek to build natural language interfaces that allow humans to communicate with computers/machines easily. This requires modeling natural language, programming language, and their interplay. Applications of this research include semantic parsing and general-purpose code generation.

Language Models

Language model-relevant research including prompt engineering, LLM reasoning, LLM interpretability, applications of LLMs for other disciplines, etc.

OCR

This NEH-funded project focuses on the development of modern Optical Character Recognition (OCR) and post-correction tools tailored for Indigenous Latin American Languages.

Fairness

Advances in natural language processing (NLP) technology now make it possible to perform many tasks through natural language or over natural language data – automatic systems can answer questions, perform web search, or command our computers to perform specific tasks.

Speech

Most languages of the world are “oral”: they are not traditionally written and even if an alphabet exists, the community doesn’t usually use it. Hence, building NLP systems that can directly operate on speech input is paramount.

Morphology

Human language is marked by considerable diversity around the world, and the surface form of languages varies substantially. Morphology describes the way through which different word forms arise from lexemes. Computational morphology attempts to reproduce this process across languages, or uses machine learning models to model/discover the morphophonological processes that exist in a language.

Robustness

NLP systems are typically trained and evaluated in “clean” settings, over data without significant noise. However, systems deployed in the real world need to deal with vast amounts of noise. At GMU NLP we work towards making NLP systems more robust to several types of noise (adversarial or naturally occuring).

Language Documentation

Language Documentation aims at producing a permanent record that describes a language as used by its language community by producing a formal grammatical description along with a lexicon. Our group works on integrating NLP systems into the documentation workflow, aiming to speed-up the process and help the work of field linguists and language communities.

Machine Translation

Machine Translation is the task of translating between human languages using computers. Starting from simple word-for-word rule-based system in 1950s, we now have large multilingual neural models that can learn translate between dozens of languages.

Multilingual NLP

An exciting research direction that we pursue at GMU NLP is building multi-lingual and polyglot systems. The languages of the world often share similar characteristics, and training systems cross-lingually allows us to leverage these similarities and overcome data scarcity issues.

Members

Faculty

Avatar

Antonios Anastasopoulos

Assistant Professor

Computational Linguistics, Machine Translation, Speech Recognition, NLP for Endangered Languages

Avatar

Ziyu Yao

Assistant Professor

Question Answering, Semantic Parsing, Human-AI Interaction, Language and Code, Efficient NLP/AI

Students

Avatar

Fahim Faisal

PhD Student

Computational linguistics, Natural language processing, Machine learning

Avatar

Md Mahfuz Ibn Alam

PhD Student

Natural Language Processing, Machine Learning, Computer Vision, Common Sense Reasoning

Avatar

Murong Yue

PhD Student

Natural Language Processing, Deep Learning

Avatar

Saurabh Srivastava

PhD Student

Natural Language Processing, Machine Learning, Deep Learning

Avatar

Anjishnu Mukherjee

PhD Student

Large Language Models, Fairness and Biases

Avatar

Milind Agarwal

PhD Student

Natural language processing, Machine learning, Endangered Languages

Avatar

Prabin Bhandari

PhD Student

Natural language processing, Machine learning, Endangered Languages

Avatar

Joshua Otten

PhD Student

Natural language processing, Ancient Languages, Artificial Interlligence, Linguistics, Embedded Systems

Avatar

Daking Rai

PhD Student

Natural Language Processing, Machine Learning, Deep Learning

Avatar

Hao Yan

PhD Student

Natural Language Processing, Machine Learning, Deep Learning

Avatar

Jonathan Mbuya Kabala

PhD Student

Natural language processing

Avatar

Syeda Sabrina Akter

CS PhD Student

Multilingual NLP, Speech Translation, NLP for Low-Resource and Endangered Languages

Avatar

Nathaniel (Nate) Krasner

PhD Student

Natural language processing

Avatar

Kourosh T Baghaei

PhD Student

Natural Language Processing, Machine Learning, Reinforcement Learning, Game Development

Avatar

Mohamed Aghzal

PhD Student

Natural Language Processing

Sam Blouir

CS PhD Student

NLP, ML

Avatar

Wijdane Mifdal

Undergraduate

Natural Language Processing

Alumni

Avatar

Ruoyu (Roy) Xie

Undergraduate → PhD@Duke CS

Natural Language Processing, Machine Learning, Computer Vision

Avatar

Sina Ahmadi

Postdoctoral Researcher → Postdoc@U.Zurich

Computational Linguistics, NLP for Endangered Languages, Language Resources

Avatar

Yinkai Wang

Undergraduate → PhD@Tufts

Natural Language Processing

Avatar

Sharlina Keshava

CS MSc → Amazon

Natural Language Processing, Fairness in AI, Multilingual NLP, Machine Learning, Deep Learning

Collaborators

Claytone Sikasote

MS@African Masters of Machine Intelligence and Lecturer@University of Zambia

Language Processing for Bemba

Recent Publications

Browse all publications.

DIALECTBENCH: A NLP Benchmark for Dialects, Varieties, and Closely-Related Languages. Proceedings of the North American Chapter of the Association for Computational Linguistics: NAACL 2024, 2024.

PDF Code Dataset

Extracting Lexical Features from Dialects via Interpretable Dialect Classifiers. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2024), 2024.

PDF Code

Cross-Modal Multi-Tasking for Speech-to-Text Translation via Hard Parameter Sharing. Proceedings of ICASSP 2024, 2024.

PDF

Enhancing End-to-End Conversational Speech Translation Through Target Language Context Utilization. Proceedings of ICASSP 2024, 2024.

PDF

Clinical Risk Prediction Using Language Models: Benefits And Considerations. Journal of the American Medical Informatics Association, 2024.

PDF