Dataset Maps - WikiANN (all)

Back to all dataset maps

afrikaans

Relevant Statistics

Percentage in-country: 34.71%
Missing countries: 85 of 243 (34.98%)
Total Variation Distance between observed and population-proportional distribution: 34.966

We also trained a linear model to find socioeconomic correlates of the datasets
    Variance explained by population: 0.155
    Variance explained by GDP: 0.460
    Variance explained by geographic distance: 0.002
    Variance explained by all 3 factors: 0.502


arabic

Relevant Statistics

Percentage in-country: 29.68%
Missing countries: 51 of 243 (20.99%)
Total Variation Distance between observed and population-proportional distribution: 1587.836

We also trained a linear model to find socioeconomic correlates of the datasets
    Variance explained by population: 0.276
    Variance explained by GDP: 0.402
    Variance explained by geographic distance: 0.201
    Variance explained by all 3 factors: 0.555


azerbaijani

Relevant Statistics

Percentage in-country: 33.94%
Missing countries: 92 of 243 (37.86%)
Total Variation Distance between observed and population-proportional distribution: 1268.119

We also trained a linear model to find socioeconomic correlates of the datasets
    Variance explained by population: 0.220
    Variance explained by GDP: 0.407
    Variance explained by geographic distance: 0.224
    Variance explained by all 3 factors: 0.547


bulgarian

Relevant Statistics

Percentage in-country: 14.46%
Missing countries: 46 of 243 (18.93%)
Total Variation Distance between observed and population-proportional distribution: 0.000

We also trained a linear model to find socioeconomic correlates of the datasets
    Variance explained by population: 0.239
    Variance explained by GDP: 0.396
    Variance explained by geographic distance: 0.161
    Variance explained by all 3 factors: 0.501


bengali

Relevant Statistics

Percentage in-country: 26.91%
Missing countries: 82 of 243 (33.74%)
Total Variation Distance between observed and population-proportional distribution: 662.371

We also trained a linear model to find socioeconomic correlates of the datasets
    Variance explained by population: 0.325
    Variance explained by GDP: 0.433
    Variance explained by geographic distance: 0.027
    Variance explained by all 3 factors: 0.445


german

Relevant Statistics

Percentage in-country: 27.03%
Missing countries: 41 of 243 (16.87%)
Total Variation Distance between observed and population-proportional distribution: 430.165

We also trained a linear model to find socioeconomic correlates of the datasets
    Variance explained by population: 0.244
    Variance explained by GDP: 0.484
    Variance explained by geographic distance: 0.181
    Variance explained by all 3 factors: 0.589


greek

Relevant Statistics

Percentage in-country: 27.68%
Missing countries: 42 of 243 (17.28%)
Total Variation Distance between observed and population-proportional distribution: 5.759

We also trained a linear model to find socioeconomic correlates of the datasets
    Variance explained by population: 0.112
    Variance explained by GDP: 0.337
    Variance explained by geographic distance: 0.176
    Variance explained by all 3 factors: 0.472


spanish

Relevant Statistics

Percentage in-country: 54.58%
Missing countries: 44 of 243 (18.11%)
Total Variation Distance between observed and population-proportional distribution: 7674.883

We also trained a linear model to find socioeconomic correlates of the datasets
    Variance explained by population: 0.240
    Variance explained by GDP: 0.435
    Variance explained by geographic distance: 0.090
    Variance explained by all 3 factors: 0.475


estonian

Relevant Statistics

Percentage in-country: 21.08%
Missing countries: 50 of 243 (20.58%)
Total Variation Distance between observed and population-proportional distribution: 0.000

We also trained a linear model to find socioeconomic correlates of the datasets
    Variance explained by population: 0.200
    Variance explained by GDP: 0.405
    Variance explained by geographic distance: 0.210
    Variance explained by all 3 factors: 0.546


basque

Relevant Statistics

Percentage in-country: 23.53%
Missing countries: 97 of 243 (39.92%)
Total Variation Distance between observed and population-proportional distribution: 0.000

We also trained a linear model to find socioeconomic correlates of the datasets
    Variance explained by population: 0.212
    Variance explained by GDP: 0.489
    Variance explained by geographic distance: 0.096
    Variance explained by all 3 factors: 0.550


finnish

Relevant Statistics

Percentage in-country: 17.22%
Missing countries: 46 of 243 (18.93%)
Total Variation Distance between observed and population-proportional distribution: 0.000

We also trained a linear model to find socioeconomic correlates of the datasets
    Variance explained by population: 0.248
    Variance explained by GDP: 0.478
    Variance explained by geographic distance: 0.172
    Variance explained by all 3 factors: 0.572


french

Relevant Statistics

Percentage in-country: 33.50%
Missing countries: 40 of 243 (16.46%)
Total Variation Distance between observed and population-proportional distribution: 5967.332

We also trained a linear model to find socioeconomic correlates of the datasets
    Variance explained by population: 0.254
    Variance explained by GDP: 0.487
    Variance explained by geographic distance: 0.169
    Variance explained by all 3 factors: 0.574


hebrew

Relevant Statistics

Percentage in-country: 17.49%
Missing countries: 53 of 243 (21.81%)
Total Variation Distance between observed and population-proportional distribution: 23.197

We also trained a linear model to find socioeconomic correlates of the datasets
    Variance explained by population: 0.189
    Variance explained by GDP: 0.412
    Variance explained by geographic distance: 0.170
    Variance explained by all 3 factors: 0.529


hungarian

Relevant Statistics

Percentage in-country: 20.35%
Missing countries: 52 of 243 (21.40%)
Total Variation Distance between observed and population-proportional distribution: 449.676

We also trained a linear model to find socioeconomic correlates of the datasets
    Variance explained by population: 0.186
    Variance explained by GDP: 0.421
    Variance explained by geographic distance: 0.185
    Variance explained by all 3 factors: 0.532


indonesian

Relevant Statistics

Percentage in-country: 11.77%
Missing countries: 42 of 243 (17.28%)
Total Variation Distance between observed and population-proportional distribution: 0.000

We also trained a linear model to find socioeconomic correlates of the datasets
    Variance explained by population: 0.357
    Variance explained by GDP: 0.490
    Variance explained by geographic distance: 0.006
    Variance explained by all 3 factors: 0.482


japanese

Relevant Statistics

Percentage in-country: 66.02%
Missing countries: 69 of 243 (28.40%)
Total Variation Distance between observed and population-proportional distribution: 0.000

We also trained a linear model to find socioeconomic correlates of the datasets
    Variance explained by population: 0.357
    Variance explained by GDP: 0.582
    Variance explained by geographic distance: 0.022
    Variance explained by all 3 factors: 0.555


korean

Relevant Statistics

Percentage in-country: 25.06%
Missing countries: 45 of 243 (18.52%)
Total Variation Distance between observed and population-proportional distribution: 1378.067

We also trained a linear model to find socioeconomic correlates of the datasets
    Variance explained by population: 0.333
    Variance explained by GDP: 0.524
    Variance explained by geographic distance: 0.022
    Variance explained by all 3 factors: 0.483


marathi

Relevant Statistics

Percentage in-country: 52.83%
Missing countries: 99 of 243 (40.74%)
Total Variation Distance between observed and population-proportional distribution: 0.000

We also trained a linear model to find socioeconomic correlates of the datasets
    Variance explained by population: 0.359
    Variance explained by GDP: 0.510
    Variance explained by geographic distance: 0.033
    Variance explained by all 3 factors: 0.482


russian

Relevant Statistics

Percentage in-country: 24.63%
Missing countries: 46 of 243 (18.93%)
Total Variation Distance between observed and population-proportional distribution: 843.995

We also trained a linear model to find socioeconomic correlates of the datasets
    Variance explained by population: 0.234
    Variance explained by GDP: 0.424
    Variance explained by geographic distance: 0.223
    Variance explained by all 3 factors: 0.557


swahili

Relevant Statistics

Percentage in-country: 11.33%
Missing countries: 134 of 243 (55.14%)
Total Variation Distance between observed and population-proportional distribution: 35.505

We also trained a linear model to find socioeconomic correlates of the datasets
    Variance explained by population: 0.152
    Variance explained by GDP: 0.364
    Variance explained by geographic distance: 0.032
    Variance explained by all 3 factors: 0.395


telugu

Relevant Statistics

Percentage in-country: 81.58%
Missing countries: 182 of 243 (74.90%)
Total Variation Distance between observed and population-proportional distribution: 0.000

We also trained a linear model to find socioeconomic correlates of the datasets
    Variance explained by population: 0.192
    Variance explained by GDP: 0.345
    Variance explained by geographic distance: 0.039
    Variance explained by all 3 factors: 0.155


thai

Relevant Statistics

Percentage in-country: 38.26%
Missing countries: 161 of 243 (66.26%)
Total Variation Distance between observed and population-proportional distribution: 21.969

We also trained a linear model to find socioeconomic correlates of the datasets
    Variance explained by population: 0.276
    Variance explained by GDP: 0.517
    Variance explained by geographic distance: 0.090
    Variance explained by all 3 factors: 0.548


turkish

Relevant Statistics

Percentage in-country: 28.79%
Missing countries: 52 of 243 (21.40%)
Total Variation Distance between observed and population-proportional distribution: 1071.450

We also trained a linear model to find socioeconomic correlates of the datasets
    Variance explained by population: 0.222
    Variance explained by GDP: 0.447
    Variance explained by geographic distance: 0.227
    Variance explained by all 3 factors: 0.597


vietnamese

Relevant Statistics

Percentage in-country: 33.29%
Missing countries: 57 of 243 (23.46%)
Total Variation Distance between observed and population-proportional distribution: 3702.792

We also trained a linear model to find socioeconomic correlates of the datasets
    Variance explained by population: 0.317
    Variance explained by GDP: 0.540
    Variance explained by geographic distance: 0.045
    Variance explained by all 3 factors: 0.533


yoruba

Relevant Statistics

Percentage in-country: 20.77%
Missing countries: 215 of 243 (88.48%)
Total Variation Distance between observed and population-proportional distribution: 0.000

We also trained a linear model to find socioeconomic correlates of the datasets
    Variance explained by population: -0.049
    Variance explained by GDP: 0.029
    Variance explained by geographic distance: 0.016
    Variance explained by all 3 factors: 0.122


chinese

Relevant Statistics

Percentage in-country: 49.11%
Missing countries: 49 of 243 (20.16%)
Total Variation Distance between observed and population-proportional distribution: 4044.378

We also trained a linear model to find socioeconomic correlates of the datasets
    Variance explained by population: 0.387
    Variance explained by GDP: 0.575
    Variance explained by geographic distance: 0.070
    Variance explained by all 3 factors: 0.573







Antonios Anastasopoulos
Antonios Anastasopoulos
Assistant Professor

I work on multilingual models, machine translation, speech recognition, and NLP for under-served languages.

Fahim Faisal
Fahim Faisal
PhD Student

My name is Fahim Faisal. My academic interest involves learning different aspects of computational linguistics and natural language processing (eg. machine translation). Currently, I am working on a project related to semi-supervised learning of morphological process of language.

Related