As language technologies become more ubiquitous, there are increasing efforts towards expanding the language diversity and coverage of natural language processing (NLP) systems.
Arguably, the most important factor influencing the quality of modern NLP systems is data availability. In this work, we study the geographical representativeness of NLP datasets, aiming to quantify if and by how much do NLP datasets match the expected needs of the language speakers. To do so, we use entity recognition and linking systems to identify which countries are represented (through named entities) in the dataset.
We then explore some geographical and economic factors that may explain the observed dataset distributions.
Let's look at the following example, from the Hausa portion of the MasakhaNER dataset, as a guide on how to interpret our graphs:
- In the map below, you can hover over every country to see exactly how many relevant entities we found. The scatter marker size is proportional to the log of this count.
- The first important statistic is what portion of the dataset can be associated with countries where the language is spoken. We only consider countries where the language has official status, or where a significant portion of the language community resides. In the case of Hausa, we only have Nigeria (NGA) and Niger (NER), with about 21% of the dataset associated with these two. Across all datasets we studied, the minimum value for this metric is around 16%, with the maximum being around 85%.
The bar graph in the bottom right visualizes in crimson the counts for these countries, and in green visualizes a counterfactual distribution where the entities are proportional to the respective language-speaking population residing in these countries.
This reveals that Nigeria is perhaps under-represented in the dataset (since the crimson bar is lower than the green) while Niger is over-represented.
- The bar graph at the bottom left shows the top-10 countries with the most entities in the dataset.
Relevant Statistics
Percentage in-country: 21.47%
Total Variation Distance between observed and population-proportional distribution: 82.819
- Additionally, we trained simple linear models to find socioeconomic correlates of the entity distributions we observe on the datasets, testing a country's population, a country's GDP, and a country's geographical distance from a population-weighted mixture of the countries where the language is spoken. In almost all cases, GDP tends to be the best predictor (not in the case of Hausa, where population explains slightly more variance than GDP), and typically a combination of GDP and geographical distance explains most of the observed variance.
- Below are the results of this analysis:
Variance explained by population: 0.271
Variance explained by GDP: 0.246
Variance explained by geographic distance: 0.056
Variance explained by all 3 factors: 0.331
Analyzed Datasets
See below for a list, categorized by task:
Named Entity Recognition
Question Answering
Probing LM knowledge
below are listed the rest of the pages of this project