Dataset Maps | George Mason NLP

As language technologies become more ubiquitous, there are increasing efforts towards expanding the language diversity and coverage of natural language processing (NLP) systems.

Arguably, the most important factor influencing the quality of modern NLP systems is data availability. In this work, we study the geographical representativeness of NLP datasets, aiming to quantify if and by how much do NLP datasets match the expected needs of the language speakers. To do so, we use entity recognition and linking systems to identify which countries are represented (through named entities) in the dataset.

We then explore some geographical and economic factors that may explain the observed dataset distributions.

Let's look at the following example, from the Hausa portion of the MasakhaNER dataset, as a guide on how to interpret our graphs:

In the map below, you can hover over every country to see exactly how many relevant entities we found. The scatter marker size is proportional to the log of this count.
The first important statistic is what portion of the dataset can be associated with countries where the language is spoken. We only consider countries where the language has official status, or where a significant portion of the language community resides. In the case of Hausa, we only have Nigeria (NGA) and Niger (NER), with about 21% of the dataset associated with these two. Across all datasets we studied, the minimum value for this metric is around 16%, with the maximum being around 85%.
The bar graph in the bottom right visualizes in crimson the counts for these countries, and in green visualizes a counterfactual distribution where the entities are proportional to the respective language-speaking population residing in these countries.
This reveals that Nigeria is perhaps under-represented in the dataset (since the crimson bar is lower than the green) while Niger is over-represented.
The bar graph at the bottom left shows the top-10 countries with the most entities in the dataset.

Relevant Statistics

Percentage in-country: 21.47%
Total Variation Distance between observed and population-proportional distribution: 82.819

Additionally, we trained simple linear models to find socioeconomic correlates of the entity distributions we observe on the datasets, testing a country's population, a country's GDP, and a country's geographical distance from a population-weighted mixture of the countries where the language is spoken. In almost all cases, GDP tends to be the best predictor (not in the case of Hausa, where population explains slightly more variance than GDP), and typically a combination of GDP and geographical distance explains most of the observed variance.
Below are the results of this analysis:

Variance explained by population: 0.271
Variance explained by GDP: 0.246
Variance explained by geographic distance: 0.056
Variance explained by all 3 factors: 0.331

Analyzed Datasets

See below for a list, categorized by task:

Named Entity Recognition

WikiANN (26 languages): maps+details
MasakhaNER (11 languages): maps+details

Question Answering

Natural Questions (NQ): map+details
SQuAD: map+details
MLQA: map+details
TyDi-QA: map+details

Probing LM knowledge

X-FACTR (mTRex): map+details

below are listed the rest of the pages of this project

Dataset Maps - MasakhaNER (all)

Back to all dataset maps amharic Relevant Statistics Percentage in-country: 25.07% Missing countries: 171 of 243 (70.37%) Total Variation Distance between observed and population-proportional distribution: 0.000 We also trained a linear model to find socioeconomic correlates of the datasets

Dataset Maps - MLQA (English)

Back to all dataset maps Relevant Statistics Percentage in-country: 53.63% Missing countries: 80 of 243 (32.92%) Total Variation Distance between observed and population-proportional distribution: 913.093 We also trained a linear model to find socioeconomic correlates of the datasets

Dataset Maps - Natural Questions (English)

Back to all dataset maps Relevant Statistics Percentage in-country: 80.07% Missing countries: 49 of 243 (20.16%) Total Variation Distance between observed and population-proportional distribution: 11907.219 We also trained a linear model to find socioeconomic correlates of the datasets

Dataset Maps - SQuAD (English)

Back to all dataset maps Relevant Statistics Percentage in-country: 62.73% Missing countries: 93 of 243 (38.27%) Total Variation Distance between observed and population-proportional distribution: 5004.444 We also trained a linear model to find socioeconomic correlates of the datasets

Dataset Maps - TyDi-QA (all)

Back to all dataset maps arabic Relevant Statistics Percentage in-country: 20.39% Missing countries: 69 of 243 (28.40%) Total Variation Distance between observed and population-proportional distribution: 745.564 We also trained a linear model to find socioeconomic correlates of the datasets

Dataset Maps - WikiANN (all)

Back to all dataset maps afrikaans Relevant Statistics Percentage in-country: 34.71% Missing countries: 85 of 243 (34.98%) Total Variation Distance between observed and population-proportional distribution: 34.966 We also trained a linear model to find socioeconomic correlates of the datasets

Dataset Maps - X-FACTR (12 languages)

Back to all dataset maps greek Relevant Statistics Percentage in-country: 2.94% Total Variation Distance between observed and population-proportional distribution: 1.623 We also trained a linear model to find socioeconomic correlates of the datasets