Dataset Maps - MasakhaNER (all)
amharic
Relevant Statistics
Percentage in-country: 25.07%
Missing countries: 171 of 243 (70.37%)
Total Variation Distance between observed and population-proportional distribution: 0.000
We also trained a linear model to find socioeconomic correlates of the datasets
Variance explained by population: 0.027
Variance explained by GDP: 0.031
Variance explained by geographic distance: 0.035
Variance explained by all 3 factors: 0.114
yoruba
Relevant Statistics
Percentage in-country: 27.14%
Missing countries: 154 of 243 (63.37%)
Total Variation Distance between observed and population-proportional distribution: 0.000
We also trained a linear model to find socioeconomic correlates of the datasets
Variance explained by population: 0.283
Variance explained by GDP: 0.308
Variance explained by geographic distance: 0.018
Variance explained by all 3 factors: 0.349
hausa
Relevant Statistics
Percentage in-country: 21.47%
Missing countries: 143 of 243 (58.85%)
Total Variation Distance between observed and population-proportional distribution: 82.819
We also trained a linear model to find socioeconomic correlates of the datasets
Variance explained by population: 0.271
Variance explained by GDP: 0.246
Variance explained by geographic distance: 0.056
Variance explained by all 3 factors: 0.331
igbo
Relevant Statistics
Percentage in-country: 57.05%
Missing countries: 167 of 243 (68.72%)
Total Variation Distance between observed and population-proportional distribution: 0.000
We also trained a linear model to find socioeconomic correlates of the datasets
Variance explained by population: 0.145
Variance explained by GDP: 0.253
Variance explained by geographic distance: 0.049
Variance explained by all 3 factors: 0.259
kinyarwanda
Relevant Statistics
Percentage in-country: 49.74%
Missing countries: 171 of 243 (70.37%)
Total Variation Distance between observed and population-proportional distribution: 0.000
We also trained a linear model to find socioeconomic correlates of the datasets
Variance explained by population: 0.045
Variance explained by GDP: 0.101
Variance explained by geographic distance: 0.062
Variance explained by all 3 factors: 0.141
luganda
Relevant Statistics
Percentage in-country: 44.07%
Missing countries: 159 of 243 (65.43%)
Total Variation Distance between observed and population-proportional distribution: 0.000
We also trained a linear model to find socioeconomic correlates of the datasets
Variance explained by population: 0.116
Variance explained by GDP: 0.279
Variance explained by geographic distance: 0.008
Variance explained by all 3 factors: 0.284
dholuo
Relevant Statistics
Percentage in-country: 61.68%
Missing countries: 193 of 243 (79.42%)
Total Variation Distance between observed and population-proportional distribution: 95.700
We also trained a linear model to find socioeconomic correlates of the datasets
Variance explained by population: 0.141
Variance explained by GDP: 0.023
Variance explained by geographic distance: 0.039
Variance explained by all 3 factors: -0.023
nigerian pidgin
Relevant Statistics
Percentage in-country: 16.49%
Missing countries: 161 of 243 (66.26%)
Total Variation Distance between observed and population-proportional distribution: 0.000
We also trained a linear model to find socioeconomic correlates of the datasets
Variance explained by population: 0.258
Variance explained by GDP: 0.473
Variance explained by geographic distance: -0.002
Variance explained by all 3 factors: 0.486
wolof
Relevant Statistics
Percentage in-country: 24.20%
Missing countries: 177 of 243 (72.84%)
Total Variation Distance between observed and population-proportional distribution: 9.610
We also trained a linear model to find socioeconomic correlates of the datasets
Variance explained by population: 0.243
Variance explained by GDP: 0.350
Variance explained by geographic distance: -0.004
Variance explained by all 3 factors: 0.368
swahili
Relevant Statistics
Percentage in-country: 16.97%
Missing countries: 156 of 243 (64.20%)
Total Variation Distance between observed and population-proportional distribution: 40.366
We also trained a linear model to find socioeconomic correlates of the datasets
Variance explained by population: 0.300
Variance explained by GDP: 0.341
Variance explained by geographic distance: 0.100
Variance explained by all 3 factors: 0.441