Dataset Maps - WikiANN (all)

Oct 2, 2021

afrikaans

Relevant Statistics

Percentage in-country: 34.71%
Missing countries: 85 of 243 (34.98%)
Total Variation Distance between observed and population-proportional distribution: 34.966

We also trained a linear model to find socioeconomic correlates of the datasets
Variance explained by population: 0.155
Variance explained by GDP: 0.460
Variance explained by geographic distance: 0.002
Variance explained by all 3 factors: 0.502

arabic

Relevant Statistics

Percentage in-country: 29.68%
Missing countries: 51 of 243 (20.99%)
Total Variation Distance between observed and population-proportional distribution: 1587.836

We also trained a linear model to find socioeconomic correlates of the datasets
Variance explained by population: 0.276
Variance explained by GDP: 0.402
Variance explained by geographic distance: 0.201
Variance explained by all 3 factors: 0.555

azerbaijani

Relevant Statistics

Percentage in-country: 33.94%
Missing countries: 92 of 243 (37.86%)
Total Variation Distance between observed and population-proportional distribution: 1268.119

We also trained a linear model to find socioeconomic correlates of the datasets
Variance explained by population: 0.220
Variance explained by GDP: 0.407
Variance explained by geographic distance: 0.224
Variance explained by all 3 factors: 0.547

bulgarian

Relevant Statistics

Percentage in-country: 14.46%
Missing countries: 46 of 243 (18.93%)
Total Variation Distance between observed and population-proportional distribution: 0.000

We also trained a linear model to find socioeconomic correlates of the datasets
Variance explained by population: 0.239
Variance explained by GDP: 0.396
Variance explained by geographic distance: 0.161
Variance explained by all 3 factors: 0.501

bengali

Relevant Statistics

Percentage in-country: 26.91%
Missing countries: 82 of 243 (33.74%)
Total Variation Distance between observed and population-proportional distribution: 662.371

We also trained a linear model to find socioeconomic correlates of the datasets
Variance explained by population: 0.325
Variance explained by GDP: 0.433
Variance explained by geographic distance: 0.027
Variance explained by all 3 factors: 0.445

german

Relevant Statistics

Percentage in-country: 27.03%
Missing countries: 41 of 243 (16.87%)
Total Variation Distance between observed and population-proportional distribution: 430.165

We also trained a linear model to find socioeconomic correlates of the datasets
Variance explained by population: 0.244
Variance explained by GDP: 0.484
Variance explained by geographic distance: 0.181
Variance explained by all 3 factors: 0.589

greek

Relevant Statistics

Percentage in-country: 27.68%
Missing countries: 42 of 243 (17.28%)
Total Variation Distance between observed and population-proportional distribution: 5.759

We also trained a linear model to find socioeconomic correlates of the datasets
Variance explained by population: 0.112
Variance explained by GDP: 0.337
Variance explained by geographic distance: 0.176
Variance explained by all 3 factors: 0.472

spanish

Relevant Statistics

Percentage in-country: 54.58%
Missing countries: 44 of 243 (18.11%)
Total Variation Distance between observed and population-proportional distribution: 7674.883

We also trained a linear model to find socioeconomic correlates of the datasets
Variance explained by population: 0.240
Variance explained by GDP: 0.435
Variance explained by geographic distance: 0.090
Variance explained by all 3 factors: 0.475

estonian

Relevant Statistics

Percentage in-country: 21.08%
Missing countries: 50 of 243 (20.58%)
Total Variation Distance between observed and population-proportional distribution: 0.000

We also trained a linear model to find socioeconomic correlates of the datasets
Variance explained by population: 0.200
Variance explained by GDP: 0.405
Variance explained by geographic distance: 0.210
Variance explained by all 3 factors: 0.546

basque

Relevant Statistics

Percentage in-country: 23.53%
Missing countries: 97 of 243 (39.92%)
Total Variation Distance between observed and population-proportional distribution: 0.000

We also trained a linear model to find socioeconomic correlates of the datasets
Variance explained by population: 0.212
Variance explained by GDP: 0.489
Variance explained by geographic distance: 0.096
Variance explained by all 3 factors: 0.550

finnish

Relevant Statistics

Percentage in-country: 17.22%
Missing countries: 46 of 243 (18.93%)
Total Variation Distance between observed and population-proportional distribution: 0.000

We also trained a linear model to find socioeconomic correlates of the datasets
Variance explained by population: 0.248
Variance explained by GDP: 0.478
Variance explained by geographic distance: 0.172
Variance explained by all 3 factors: 0.572

french

Relevant Statistics

Percentage in-country: 33.50%
Missing countries: 40 of 243 (16.46%)
Total Variation Distance between observed and population-proportional distribution: 5967.332

We also trained a linear model to find socioeconomic correlates of the datasets
Variance explained by population: 0.254
Variance explained by GDP: 0.487
Variance explained by geographic distance: 0.169
Variance explained by all 3 factors: 0.574

hebrew

Relevant Statistics

Percentage in-country: 17.49%
Missing countries: 53 of 243 (21.81%)
Total Variation Distance between observed and population-proportional distribution: 23.197

We also trained a linear model to find socioeconomic correlates of the datasets
Variance explained by population: 0.189
Variance explained by GDP: 0.412
Variance explained by geographic distance: 0.170
Variance explained by all 3 factors: 0.529

hungarian

Relevant Statistics

Percentage in-country: 20.35%
Missing countries: 52 of 243 (21.40%)
Total Variation Distance between observed and population-proportional distribution: 449.676

We also trained a linear model to find socioeconomic correlates of the datasets
Variance explained by population: 0.186
Variance explained by GDP: 0.421
Variance explained by geographic distance: 0.185
Variance explained by all 3 factors: 0.532

indonesian

Relevant Statistics

Percentage in-country: 11.77%
Missing countries: 42 of 243 (17.28%)
Total Variation Distance between observed and population-proportional distribution: 0.000

We also trained a linear model to find socioeconomic correlates of the datasets
Variance explained by population: 0.357
Variance explained by GDP: 0.490
Variance explained by geographic distance: 0.006
Variance explained by all 3 factors: 0.482

japanese

Relevant Statistics

Percentage in-country: 66.02%
Missing countries: 69 of 243 (28.40%)
Total Variation Distance between observed and population-proportional distribution: 0.000

We also trained a linear model to find socioeconomic correlates of the datasets
Variance explained by population: 0.357
Variance explained by GDP: 0.582
Variance explained by geographic distance: 0.022
Variance explained by all 3 factors: 0.555

korean

Relevant Statistics

Percentage in-country: 25.06%
Missing countries: 45 of 243 (18.52%)
Total Variation Distance between observed and population-proportional distribution: 1378.067

We also trained a linear model to find socioeconomic correlates of the datasets
Variance explained by population: 0.333
Variance explained by GDP: 0.524
Variance explained by geographic distance: 0.022
Variance explained by all 3 factors: 0.483

marathi

Relevant Statistics

Percentage in-country: 52.83%
Missing countries: 99 of 243 (40.74%)
Total Variation Distance between observed and population-proportional distribution: 0.000

We also trained a linear model to find socioeconomic correlates of the datasets
Variance explained by population: 0.359
Variance explained by GDP: 0.510
Variance explained by geographic distance: 0.033
Variance explained by all 3 factors: 0.482

russian

Relevant Statistics

Percentage in-country: 24.63%
Missing countries: 46 of 243 (18.93%)
Total Variation Distance between observed and population-proportional distribution: 843.995

We also trained a linear model to find socioeconomic correlates of the datasets
Variance explained by population: 0.234
Variance explained by GDP: 0.424
Variance explained by geographic distance: 0.223
Variance explained by all 3 factors: 0.557

swahili

Relevant Statistics

Percentage in-country: 11.33%
Missing countries: 134 of 243 (55.14%)
Total Variation Distance between observed and population-proportional distribution: 35.505

We also trained a linear model to find socioeconomic correlates of the datasets
Variance explained by population: 0.152
Variance explained by GDP: 0.364
Variance explained by geographic distance: 0.032
Variance explained by all 3 factors: 0.395

telugu

Relevant Statistics

Percentage in-country: 81.58%
Missing countries: 182 of 243 (74.90%)
Total Variation Distance between observed and population-proportional distribution: 0.000

We also trained a linear model to find socioeconomic correlates of the datasets
Variance explained by population: 0.192
Variance explained by GDP: 0.345
Variance explained by geographic distance: 0.039
Variance explained by all 3 factors: 0.155

thai

Relevant Statistics

Percentage in-country: 38.26%
Missing countries: 161 of 243 (66.26%)
Total Variation Distance between observed and population-proportional distribution: 21.969

We also trained a linear model to find socioeconomic correlates of the datasets
Variance explained by population: 0.276
Variance explained by GDP: 0.517
Variance explained by geographic distance: 0.090
Variance explained by all 3 factors: 0.548

turkish

Relevant Statistics

Percentage in-country: 28.79%
Missing countries: 52 of 243 (21.40%)
Total Variation Distance between observed and population-proportional distribution: 1071.450

We also trained a linear model to find socioeconomic correlates of the datasets
Variance explained by population: 0.222
Variance explained by GDP: 0.447
Variance explained by geographic distance: 0.227
Variance explained by all 3 factors: 0.597

vietnamese

Relevant Statistics

Percentage in-country: 33.29%
Missing countries: 57 of 243 (23.46%)
Total Variation Distance between observed and population-proportional distribution: 3702.792

We also trained a linear model to find socioeconomic correlates of the datasets
Variance explained by population: 0.317
Variance explained by GDP: 0.540
Variance explained by geographic distance: 0.045
Variance explained by all 3 factors: 0.533

yoruba

Relevant Statistics

Percentage in-country: 20.77%
Missing countries: 215 of 243 (88.48%)
Total Variation Distance between observed and population-proportional distribution: 0.000

We also trained a linear model to find socioeconomic correlates of the datasets
Variance explained by population: -0.049
Variance explained by GDP: 0.029
Variance explained by geographic distance: 0.016
Variance explained by all 3 factors: 0.122

chinese

Relevant Statistics

Percentage in-country: 49.11%
Missing countries: 49 of 243 (20.16%)
Total Variation Distance between observed and population-proportional distribution: 4044.378

We also trained a linear model to find socioeconomic correlates of the datasets
Variance explained by population: 0.387
Variance explained by GDP: 0.575
Variance explained by geographic distance: 0.070
Variance explained by all 3 factors: 0.573

Antonios Anastasopoulos

Assistant Professor

I work on multilingual models, machine translation, speech recognition, and NLP for under-served languages.

Fahim Faisal

PhD Student

My name is Fahim Faisal. My academic interest involves learning different aspects of computational linguistics and natural language processing (eg. machine translation). Currently, I am working on a project related to semi-supervised learning of morphological process of language.

Dataset Maps - WikiANN (all)

afrikaans

Relevant Statistics

arabic

Relevant Statistics

azerbaijani

Relevant Statistics

bulgarian

Relevant Statistics

bengali

Relevant Statistics

german

Relevant Statistics

greek

Relevant Statistics

spanish

Relevant Statistics

estonian

Relevant Statistics

basque

Relevant Statistics

finnish

Relevant Statistics

french

Relevant Statistics

hebrew

Relevant Statistics

hungarian

Relevant Statistics

indonesian

Relevant Statistics

japanese

Relevant Statistics

korean

Relevant Statistics

marathi

Relevant Statistics

russian

Relevant Statistics

swahili

Relevant Statistics

telugu

Relevant Statistics

thai

Relevant Statistics

turkish

Relevant Statistics

vietnamese

Relevant Statistics

yoruba

Relevant Statistics

chinese

Relevant Statistics

Antonios Anastasopoulos

Assistant Professor

Fahim Faisal

PhD Student

Related