Dataset Maps - MasakhaNER (all)

Back to all dataset maps

amharic

Relevant Statistics

Percentage in-country: 25.07%
Missing countries: 171 of 243 (70.37%)
Total Variation Distance between observed and population-proportional distribution: 0.000

We also trained a linear model to find socioeconomic correlates of the datasets
    Variance explained by population: 0.027
    Variance explained by GDP: 0.031
    Variance explained by geographic distance: 0.035
    Variance explained by all 3 factors: 0.114


yoruba

Relevant Statistics

Percentage in-country: 27.14%
Missing countries: 154 of 243 (63.37%)
Total Variation Distance between observed and population-proportional distribution: 0.000

We also trained a linear model to find socioeconomic correlates of the datasets
    Variance explained by population: 0.283
    Variance explained by GDP: 0.308
    Variance explained by geographic distance: 0.018
    Variance explained by all 3 factors: 0.349


hausa

Relevant Statistics

Percentage in-country: 21.47%
Missing countries: 143 of 243 (58.85%)
Total Variation Distance between observed and population-proportional distribution: 82.819

We also trained a linear model to find socioeconomic correlates of the datasets
    Variance explained by population: 0.271
    Variance explained by GDP: 0.246
    Variance explained by geographic distance: 0.056
    Variance explained by all 3 factors: 0.331


igbo

Relevant Statistics

Percentage in-country: 57.05%
Missing countries: 167 of 243 (68.72%)
Total Variation Distance between observed and population-proportional distribution: 0.000

We also trained a linear model to find socioeconomic correlates of the datasets
    Variance explained by population: 0.145
    Variance explained by GDP: 0.253
    Variance explained by geographic distance: 0.049
    Variance explained by all 3 factors: 0.259


kinyarwanda

Relevant Statistics

Percentage in-country: 49.74%
Missing countries: 171 of 243 (70.37%)
Total Variation Distance between observed and population-proportional distribution: 0.000

We also trained a linear model to find socioeconomic correlates of the datasets
    Variance explained by population: 0.045
    Variance explained by GDP: 0.101
    Variance explained by geographic distance: 0.062
    Variance explained by all 3 factors: 0.141


luganda

Relevant Statistics

Percentage in-country: 44.07%
Missing countries: 159 of 243 (65.43%)
Total Variation Distance between observed and population-proportional distribution: 0.000

We also trained a linear model to find socioeconomic correlates of the datasets
    Variance explained by population: 0.116
    Variance explained by GDP: 0.279
    Variance explained by geographic distance: 0.008
    Variance explained by all 3 factors: 0.284


dholuo

Relevant Statistics

Percentage in-country: 61.68%
Missing countries: 193 of 243 (79.42%)
Total Variation Distance between observed and population-proportional distribution: 95.700

We also trained a linear model to find socioeconomic correlates of the datasets
    Variance explained by population: 0.141
    Variance explained by GDP: 0.023
    Variance explained by geographic distance: 0.039
    Variance explained by all 3 factors: -0.023


nigerian pidgin

Relevant Statistics

Percentage in-country: 16.49%
Missing countries: 161 of 243 (66.26%)
Total Variation Distance between observed and population-proportional distribution: 0.000

We also trained a linear model to find socioeconomic correlates of the datasets
    Variance explained by population: 0.258
    Variance explained by GDP: 0.473
    Variance explained by geographic distance: -0.002
    Variance explained by all 3 factors: 0.486


wolof

Relevant Statistics

Percentage in-country: 24.20%
Missing countries: 177 of 243 (72.84%)
Total Variation Distance between observed and population-proportional distribution: 9.610

We also trained a linear model to find socioeconomic correlates of the datasets
    Variance explained by population: 0.243
    Variance explained by GDP: 0.350
    Variance explained by geographic distance: -0.004
    Variance explained by all 3 factors: 0.368


swahili

Relevant Statistics

Percentage in-country: 16.97%
Missing countries: 156 of 243 (64.20%)
Total Variation Distance between observed and population-proportional distribution: 40.366

We also trained a linear model to find socioeconomic correlates of the datasets
    Variance explained by population: 0.300
    Variance explained by GDP: 0.341
    Variance explained by geographic distance: 0.100
    Variance explained by all 3 factors: 0.441





Antonios Anastasopoulos
Antonios Anastasopoulos
Assistant Professor

I work on multilingual models, machine translation, speech recognition, and NLP for under-served languages.

Fahim Faisal
Fahim Faisal
PhD Student

My name is Fahim Faisal. My academic interest involves learning different aspects of computational linguistics and natural language processing (eg. machine translation). Currently, I am working on a project related to semi-supervised learning of morphological process of language.

Related