Dataset Maps

As language technologies become more ubiquitous, there are increasing efforts towards expanding the language diversity and coverage of natural language processing (NLP) systems.

Arguably, the most important factor influencing the quality of modern NLP systems is data availability. In this work, we study the geographical representativeness of NLP datasets, aiming to quantify if and by how much do NLP datasets match the expected needs of the language speakers. To do so, we use entity recognition and linking systems to identify which countries are represented (through named entities) in the dataset.

We then explore some geographical and economic factors that may explain the observed dataset distributions.

Let's look at the following example, from the Hausa portion of the MasakhaNER dataset, as a guide on how to interpret our graphs:

Relevant Statistics

Percentage in-country: 21.47%
Total Variation Distance between observed and population-proportional distribution: 82.819

    Variance explained by population: 0.271
    Variance explained by GDP: 0.246
    Variance explained by geographic distance: 0.056
    Variance explained by all 3 factors: 0.331

Analyzed Datasets

See below for a list, categorized by task:

Named Entity Recognition

Question Answering

Probing LM knowledge



















below are listed the rest of the pages of this project

Dataset Maps - MasakhaNER (all)

Back to all dataset maps amharic Relevant Statistics Percentage in-country: 25.07% Missing countries: 171 of 243 (70.37%) Total Variation Distance between observed and population-proportional distribution: 0.000 We also trained a linear model to find socioeconomic correlates of the datasets

Dataset Maps - MLQA (English)

Back to all dataset maps Relevant Statistics Percentage in-country: 53.63% Missing countries: 80 of 243 (32.92%) Total Variation Distance between observed and population-proportional distribution: 913.093 We also trained a linear model to find socioeconomic correlates of the datasets

Dataset Maps - Natural Questions (English)

Back to all dataset maps Relevant Statistics Percentage in-country: 80.07% Missing countries: 49 of 243 (20.16%) Total Variation Distance between observed and population-proportional distribution: 11907.219 We also trained a linear model to find socioeconomic correlates of the datasets

Dataset Maps - SQuAD (English)

Back to all dataset maps Relevant Statistics Percentage in-country: 62.73% Missing countries: 93 of 243 (38.27%) Total Variation Distance between observed and population-proportional distribution: 5004.444 We also trained a linear model to find socioeconomic correlates of the datasets

Dataset Maps - TyDi-QA (all)

Back to all dataset maps arabic Relevant Statistics Percentage in-country: 20.39% Missing countries: 69 of 243 (28.40%) Total Variation Distance between observed and population-proportional distribution: 745.564 We also trained a linear model to find socioeconomic correlates of the datasets

Dataset Maps - WikiANN (all)

Back to all dataset maps afrikaans Relevant Statistics Percentage in-country: 34.71% Missing countries: 85 of 243 (34.98%) Total Variation Distance between observed and population-proportional distribution: 34.966 We also trained a linear model to find socioeconomic correlates of the datasets

Dataset Maps - X-FACTR (12 languages)

Back to all dataset maps greek Relevant Statistics Percentage in-country: 2.94% Total Variation Distance between observed and population-proportional distribution: 1.623 We also trained a linear model to find socioeconomic correlates of the datasets