language documentation

Approaches to Corpus Creation for Low-Resource Language Technology: the Case of Southern Kurdish and Laki

One of the major challenges that under-represented and endangered language communities face in language technology is the lack or paucity of language data. This is also the case of the Southern varieties of the Kurdish and Laki languages for which …

PALI: A Language Identification Benchmark for Perso-Arabic Scripts

The Perso-Arabic scripts are a family of scripts that are widely adopted and used by various linguistic communities around the globe. Identifying various languages using such scripts is crucial to language technologies and challenging in low-resource …

Noisy Parallel Data Alignment

An ongoing challenge in current natural language processing is how its major advancements tend to disproportionately favor resource-rich languages, leaving a significant number of under-resourced languages behind. Due to the lack of resources …

Educational Tools for Mapuzugun

Mapuzugun is the language of the Mapuche people. Due to political and historical reasons, its number of speakers has decreased and the language has been excluded from the educational system in Chile and Argentina. For this reason, it is very …

Lexically-Aware Semi-Supervised Learning for OCR Post-Correction

Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents. Optical character recognition (OCR) can be used to produce digitized text, and previous work has demonstrated the utility of …

Automatic Interlinear Glossing for Under-Resourced Languages Leveraging Translations

Interlinear Glossed Text (IGT) is a widely used format for encoding linguistic information in language documentation projects and scholarly papers. Manual production of IGT takes time and requires linguistic expertise. We attempt to address this …

OCR Post-Correction for Endangered Language Texts

There is little to no data available to build natural language processing models for most endangered languages. However, textual data in these languages often exists in formats that are not machine-readable, such as paper books and scanned images. In …

Language Documentation aims at producing a permanent record that describes a language as used by its language community by producing a formal grammatical description along with a lexicon. Our group works on integrating NLP systems into the documentation workflow, aiming to speed-up the process and help the work of field linguists and language communities.

A Resource for Computational Experiments on Mapudungun

We present a resource for computational experiments on Mapudungun, a polysynthetic indigenous language spoken in Chile with upwards of 200 thousand speakers. We provide 142 hours of culturally significant conversations in the domain of medical …

A Resource for Studying Chatino Verbal Morphology

We present the first resource focusing on the verbal inflectional morphology of San Juan Quiahije Chatino, a tonal mesoamerican language spoken in Mexico. We provide a collection of complete inflection tables of 198 lemmata, with morphological tags …