Dataset Maps - TyDi-QA (all)

Back to all dataset maps

arabic

Relevant Statistics

Percentage in-country: 20.39%
Missing countries: 69 of 243 (28.40%)
Total Variation Distance between observed and population-proportional distribution: 745.564

We also trained a linear model to find socioeconomic correlates of the datasets
    Variance explained by population: 0.235
    Variance explained by GDP: 0.351
    Variance explained by geographic distance: 0.165
    Variance explained by all 3 factors: 0.475


bengali

Relevant Statistics

Percentage in-country: 51.35%
Missing countries: 127 of 243 (52.26%)
Total Variation Distance between observed and population-proportional distribution: 893.526

We also trained a linear model to find socioeconomic correlates of the datasets
    Variance explained by population: 0.385
    Variance explained by GDP: 0.440
    Variance explained by geographic distance: 0.092
    Variance explained by all 3 factors: 0.510


english

Relevant Statistics

Percentage in-country: 57.33%
Missing countries: 113 of 243 (46.50%)
Total Variation Distance between observed and population-proportional distribution: 1089.409

We also trained a linear model to find socioeconomic correlates of the datasets
    Variance explained by population: 0.303
    Variance explained by GDP: 0.563
    Variance explained by geographic distance: 0.005
    Variance explained by all 3 factors: 0.543


finnish

Relevant Statistics

Percentage in-country: 20.04%
Missing countries: 94 of 243 (38.68%)
Total Variation Distance between observed and population-proportional distribution: 0.000

We also trained a linear model to find socioeconomic correlates of the datasets
    Variance explained by population: 0.204
    Variance explained by GDP: 0.478
    Variance explained by geographic distance: 0.120
    Variance explained by all 3 factors: 0.547


indonesian

Relevant Statistics

Percentage in-country: 31.29%
Missing countries: 99 of 243 (40.74%)
Total Variation Distance between observed and population-proportional distribution: 0.000

We also trained a linear model to find socioeconomic correlates of the datasets
    Variance explained by population: 0.331
    Variance explained by GDP: 0.500
    Variance explained by geographic distance: 0.023
    Variance explained by all 3 factors: 0.491


japanese

Relevant Statistics

Percentage in-country: 23.65%
Missing countries: 92 of 243 (37.86%)
Total Variation Distance between observed and population-proportional distribution: 0.000

We also trained a linear model to find socioeconomic correlates of the datasets
    Variance explained by population: 0.293
    Variance explained by GDP: 0.548
    Variance explained by geographic distance: 0.052
    Variance explained by all 3 factors: 0.523


korean

Relevant Statistics

Percentage in-country: 19.49%
Missing countries: 124 of 243 (51.03%)
Total Variation Distance between observed and population-proportional distribution: 106.953

We also trained a linear model to find socioeconomic correlates of the datasets
    Variance explained by population: 0.300
    Variance explained by GDP: 0.535
    Variance explained by geographic distance: 0.043
    Variance explained by all 3 factors: 0.517


russian

Relevant Statistics

Percentage in-country: 42.09%
Missing countries: 110 of 243 (45.27%)
Total Variation Distance between observed and population-proportional distribution: 202.497

We also trained a linear model to find socioeconomic correlates of the datasets
    Variance explained by population: 0.177
    Variance explained by GDP: 0.405
    Variance explained by geographic distance: 0.175
    Variance explained by all 3 factors: 0.500


swahili

Relevant Statistics

Percentage in-country: 26.99%
Missing countries: 99 of 243 (40.74%)
Total Variation Distance between observed and population-proportional distribution: 836.845

We also trained a linear model to find socioeconomic correlates of the datasets
    Variance explained by population: 0.245
    Variance explained by GDP: 0.313
    Variance explained by geographic distance: 0.099
    Variance explained by all 3 factors: 0.416


telugu

Relevant Statistics

Percentage in-country: 85.54%
Missing countries: 98 of 243 (40.33%)
Total Variation Distance between observed and population-proportional distribution: 0.000

We also trained a linear model to find socioeconomic correlates of the datasets
    Variance explained by population: 0.384
    Variance explained by GDP: 0.495
    Variance explained by geographic distance: 0.053
    Variance explained by all 3 factors: 0.500


thai

Relevant Statistics

Percentage in-country: 46.26%
Missing countries: 144 of 243 (59.26%)
Total Variation Distance between observed and population-proportional distribution: 123.140

We also trained a linear model to find socioeconomic correlates of the datasets
    Variance explained by population: 0.272
    Variance explained by GDP: 0.507
    Variance explained by geographic distance: 0.075
    Variance explained by all 3 factors: 0.532


Antonios Anastasopoulos
Antonios Anastasopoulos
Assistant Professor

I work on multilingual models, machine translation, speech recognition, and NLP for under-served languages.

Fahim Faisal
Fahim Faisal
PhD Student

My name is Fahim Faisal. My academic interest involves learning different aspects of computational linguistics and natural language processing (eg. machine translation). Currently, I am working on a project related to semi-supervised learning of morphological process of language.

Related