Dataset Maps - TyDi-QA (all)
arabic
Relevant Statistics
Percentage in-country: 20.39%
Missing countries: 69 of 243 (28.40%)
Total Variation Distance between observed and population-proportional distribution: 745.564
We also trained a linear model to find socioeconomic correlates of the datasets
Variance explained by population: 0.235
Variance explained by GDP: 0.351
Variance explained by geographic distance: 0.165
Variance explained by all 3 factors: 0.475
bengali
Relevant Statistics
Percentage in-country: 51.35%
Missing countries: 127 of 243 (52.26%)
Total Variation Distance between observed and population-proportional distribution: 893.526
We also trained a linear model to find socioeconomic correlates of the datasets
Variance explained by population: 0.385
Variance explained by GDP: 0.440
Variance explained by geographic distance: 0.092
Variance explained by all 3 factors: 0.510
english
Relevant Statistics
Percentage in-country: 57.33%
Missing countries: 113 of 243 (46.50%)
Total Variation Distance between observed and population-proportional distribution: 1089.409
We also trained a linear model to find socioeconomic correlates of the datasets
Variance explained by population: 0.303
Variance explained by GDP: 0.563
Variance explained by geographic distance: 0.005
Variance explained by all 3 factors: 0.543
finnish
Relevant Statistics
Percentage in-country: 20.04%
Missing countries: 94 of 243 (38.68%)
Total Variation Distance between observed and population-proportional distribution: 0.000
We also trained a linear model to find socioeconomic correlates of the datasets
Variance explained by population: 0.204
Variance explained by GDP: 0.478
Variance explained by geographic distance: 0.120
Variance explained by all 3 factors: 0.547
indonesian
Relevant Statistics
Percentage in-country: 31.29%
Missing countries: 99 of 243 (40.74%)
Total Variation Distance between observed and population-proportional distribution: 0.000
We also trained a linear model to find socioeconomic correlates of the datasets
Variance explained by population: 0.331
Variance explained by GDP: 0.500
Variance explained by geographic distance: 0.023
Variance explained by all 3 factors: 0.491
japanese
Relevant Statistics
Percentage in-country: 23.65%
Missing countries: 92 of 243 (37.86%)
Total Variation Distance between observed and population-proportional distribution: 0.000
We also trained a linear model to find socioeconomic correlates of the datasets
Variance explained by population: 0.293
Variance explained by GDP: 0.548
Variance explained by geographic distance: 0.052
Variance explained by all 3 factors: 0.523
korean
Relevant Statistics
Percentage in-country: 19.49%
Missing countries: 124 of 243 (51.03%)
Total Variation Distance between observed and population-proportional distribution: 106.953
We also trained a linear model to find socioeconomic correlates of the datasets
Variance explained by population: 0.300
Variance explained by GDP: 0.535
Variance explained by geographic distance: 0.043
Variance explained by all 3 factors: 0.517
russian
Relevant Statistics
Percentage in-country: 42.09%
Missing countries: 110 of 243 (45.27%)
Total Variation Distance between observed and population-proportional distribution: 202.497
We also trained a linear model to find socioeconomic correlates of the datasets
Variance explained by population: 0.177
Variance explained by GDP: 0.405
Variance explained by geographic distance: 0.175
Variance explained by all 3 factors: 0.500
swahili
Relevant Statistics
Percentage in-country: 26.99%
Missing countries: 99 of 243 (40.74%)
Total Variation Distance between observed and population-proportional distribution: 836.845
We also trained a linear model to find socioeconomic correlates of the datasets
Variance explained by population: 0.245
Variance explained by GDP: 0.313
Variance explained by geographic distance: 0.099
Variance explained by all 3 factors: 0.416
telugu
Relevant Statistics
Percentage in-country: 85.54%
Missing countries: 98 of 243 (40.33%)
Total Variation Distance between observed and population-proportional distribution: 0.000
We also trained a linear model to find socioeconomic correlates of the datasets
Variance explained by population: 0.384
Variance explained by GDP: 0.495
Variance explained by geographic distance: 0.053
Variance explained by all 3 factors: 0.500
thai
Relevant Statistics
Percentage in-country: 46.26%
Missing countries: 144 of 243 (59.26%)
Total Variation Distance between observed and population-proportional distribution: 123.140
We also trained a linear model to find socioeconomic correlates of the datasets
Variance explained by population: 0.272
Variance explained by GDP: 0.507
Variance explained by geographic distance: 0.075
Variance explained by all 3 factors: 0.532