A note on evaluating multilingual benchmarks

A note on evaluating multilingual benchmarks

Antonis Anastasopoulos, December 2019.

tl;dr: Be careful when reporting averages for multilingual benchmarks, especially if making claims about multilinguality. In addition, averaging by language family can provide additional insights.

Even at a slow pace, NLP is moving away from focusing solely on high-resource languages like English, German, and Chinese, and is expanding to building and testing systems for many languages. Collections like Universal Dependencies [1] (which now includes 90 languages) and UniMorph [2] (110 languages) allow for training and evaluation on several languages for NLP tasks such as POS tagging, dependency parsing, and morphological inflection. Another example of a multilingual benchmark is XNLI (15 languages) [3] used for language understanding.

There have been, in fact, several recent shared tasks that included many languages in their evaluation. For instance, the SIGMORPHON 2019 shared tasks [4] examined cross-lingual transfer for morphological inflection over 100 language pairs (covering 79 unique languages) as well as contextual lemmatization and morphosyntactic description in 66 languages (107 monolingual treebanks).

A crucial detail in both cases is that the evaluation sets contain duplicates:

  1. 1. In the case of the cross-lingual inflection task (SIGMORPHON task 1) languages appear disproportionately as test languages. For instance, Quechua is a test language once (where the participants had to transfer from Swahili), Classical Syriac is a test language twice (transferring from Arabic and Hebrew), Crimean Tatar is the test language 3 times, Azeri, Bengali, and Turkmen appear 4 times.

  2. 2. In the case of the SIGMORPHON task 2 (lemmatization and morphological analysis), the test data sometimes include more than one treebank per language. For example, there are two ancient Greek treebanks, three Japanese, four Italian, five English and five Czech treebanks.

Now, if one wants to evaluate methods/systems on their performance on some NLP task or machine learning approach (i.e. cross-lingual transfer, in general) then it is probable that this imbalance does not really matter, and averaging over datasets might be ok. However, if one wants to make claims about multilingual, universal, or language independent systems, then the story changes, and I'll outline some examples of that in a bit. Note 1: The SIGMORPHON shared tasks papers do *not* make such claims!

Emily Bender wrote an excellent article on this matter a couple of years ago. The issue there was initially raised because a paper had used (different) treebanks from the same language for training and testing, while making claims of cross-linguistic applicability. Note 2: the issue was addressed by the authors.

Treebank System 1: UDify System 2: UDPipe
UD_Czech-CAC 92.4 90.7
UD_Czech-CLTT 89.9 84.3
UD_Czech-FicTree 92.8 89.8
UD_Czech-PDT 92.9 91.3
UD_Czech-PUD 88.0 85.0*
UD_North_Sami-Giella 67.1 74.5
UD_Polish-LFG 94.6 94.2
UD_Polish-SZ 89.2 91.2
Average LAS: 88.4 87.6
LAS as reported in [5] *: this number does not appear in the paper,
but UDPipe is, on average, worse by 3 points in Czech than UDify,
so we just use this as an estimation for showcasing purposes.

The issue I am pointing out here does not have to do with using the same or similar languages for both training and testing (we'll come back to that for analysing the results though), but rather has its root at the imbalance of the evaluation sets, which could lead to mistaken takeaways, if our goal is to properly evaluate how good are our systems *across languages*.

In order to pinpoint the issue with averaging evaluation sets that over-represent some languages, let's consider an artificial (and exacerbated) scenario where we have results for two dependency parsing systems (UDPipe and UDify, with results taken from this cool paper [5]) only over the North Sami, the two Polish and the five Czech treebanks in the table on the right, where based on a simple macro-average over these 8 treebanks UDify outperforms UDPipe.

Language UDify UDPipe
Czech (avg) 91.2 88.2
North Sami 67.1 74.5
Polish (avg) 91.9 92.7
Average LAS 83.4 85.1
But is the first system really better than the second? Or is the first system only better in parsing Czech specifically, and the Czech over-representation can lead to misleading takeaways? (again, this is an artificial scenario! UDify is, I think, a better alternative to UDPipe). As this very interesting paper of Balduzzi et al [6] on "Re-evaluating evaluation" points out, this unbalanced inclusion of related/similar test datasets/tasks indeed can lead to mistaken ranking of systems/agents. In fact, if we care about comparing how systems fair across languages, then I suggest we should first average the performance per language and then report a macro-average. Now, across languages, System 2 seems to be better on average!

Language Family UDify UDPipe
Slavic 91.6 90.5
Uralic 67.1 74.5
Average LAS 79.3 82.5

We could go further, and average the performance per language family (Czech and Polish are both Slavic languages, while North Sami is Uralic). If we do that, we end up with the table on the right, where the difference between the two systems has again increased slightly.

I hope that this example has convinced you that, if we want to compare two (or several) methods across languages (and not across just datasets), then we should take this into account.

Does this, though, lead to different rankings of systems for some of these shared tasks? The answer is yes.

We got the systems' outputs for the SIGMORPHON tasks 1 and 2, as well as results from the CoNLL 2018 Shared Task on dependency parsing [7] , and compare the simple averaging rankings with the rankings with our per-language averaging. In every case the final systems ranking is somewhat different, even if the winning system doesn't always change.

Here are the details per shared task, with the potential changes in the ranking highlighted. Note 3: I am involved with organizing this year's SIGMORPHON, and I will try to implement the additions/changes in analysis that I mention here.

SIGMORPHON 2019 task 1 (inflection)

 Average of max accuracy 
Simple Average Accuracy per test language
Rank SystemScoreSystemScore
1CMU-03-158.79CMU-03-162.48
2BASE: transformer 54.25 BASE: transformer55.96
3IT-IST-02-150.14IT-IST-02-152.42
4IT-IST-01-148.97IT-IST-01-150.23
5Tuebingen-01-134.49Tuebingen-01-135.65
6AX-02-122.00AX-02-130.16
7Tuebingen-02-120.86AX-01-124.75
8AX-01-118.35Tuebingen-02-122.21
In this case, because the repeated entries of each language actually share the exact same test set, we keep the best score for each system, so that the aggregating operation for each language is a max() instead of an average().
The ranking of the last two systems does change: on the left is the official ranking, and on the right the ranking using our suggested method of aggregating scores.





SIGMORPHON 2019 task 2 (part 1: morphological analysis)

Simple Average AccuracyAvg of avg accuracy per test language
RankSystemScoreSystemScore
1CHARLES-SAARLAND-02-293.23UFALPRAGUE-01*92.05
2UFALPRAGUE-01*93.19CHARLES-SAARLAND-02-291.94
3EDINBURGH-01-288.93EDINBURGH-01-287.71
4OHIOSTATE-01-287.42OHIOSTATE-01-286.14
5CMU-01-2-DataAug86.53CMU-01-2-DataAug85.49
6CARNEGIEMELLON-02-285.06CARNEGIEMELLON-02-283.44
*: this submission used external data, so it shouldn’t be really compared to the others,
but we include it here to show that even the best system could potentially change
In this case study (as in the two following ones) the repeated entries of each language come from different treebanks, so the aggregating operation per language is an average().

Now the ranking of the two best systems changes! (see note)




SIGMORPHON 2019 task 2 (part 2: lemmatization)

Simple Average AccuracyAvg of avg accuracy per test language
RankSystem Score  System Score
1UFALPRAGUE-01-295.78UFALPRAGUE-01-294.92
2CHARLES-SAARLAND-02-295.00CHARLES-SAARLAND-02-293.99
3CHARLES-SAARLAND-01-294.20CHARLES-SAARLAND-01-293.56
4EDINBURGH-01-294.02OHIOSTATE-01-293.17
5OHIOSTATE-01-293.43EDINBURGH-01-293.05
6CBNU-01-293.30CBNU-01-292.99
7RUG-02-293.06ITU-01-292.92
8ITU-01-292.72CMU-01-2-DataAug92.32
9CMU-01-2-DataAug92.51RUG-02-291.97
10CARNEGIEMELLON-02-292.20CARNEGIEMELLON-02-291.21
11NLPCUBE-01-291.43NLPCUBE-01-290.43
Same as above, except now we test the systems on their lemmatization accuracy. The ranking of several systems changes.














CoNLL 2018 Shared Task on UD parsing

Simple Macro-Averaged  Avg. of macro-averaged
LAS-F1  LAS-F1 per test language
RankSystemScore SystemScore
1HIT-SCIR75.84HIT-SCIR74.23
2TurkuNLP73.28TurkuNLP72.24
3UDPipe73.11UDPipe72.04
4LATTICE73.02CEA71.83
5ICS73.02Stanford71.50
6CEA72.56LATTICE71.28
7Uppsala72.37Uppsala71.21
8Stanford72.29ICS71.14
9AntNLP70.90NLP-Cube70.07
10NLP-Cube70.82ParisNLP69.83
11ParisNLP70.64AntNLP69.27
12SLT-Interactions69.98SLT-Interactions68.51
13IBM69.11IBM68.32
14UniMelb68.66LeisureX68.10
15LeisureX68.31UniMelb67.86
16KParse66.58Fudan65.79
17Fudan66.34KParse65.50
18BASELINE65.80BASELINE65.07
...............
The performance on parsing universal dependencies treebanks, with performance measured with macro-averaged LAS-F1.























Averaging per language family

The rankings for all three above shared tasks would vary even more if we computed averages per language family. You can find them in the appendix. The cool thing about computing averages per language families, is that it can show differences that could shed light into the properties of the used methods.

I'll show an example from the SIGMORPHON morphological inflection task, comparing the performance of the top three systems. The first was our submission, which mostly relied on data augmentation along with a novel architecture [8]. The second was the IT-IST submission, which used sparse attention but without any data augmentation [9], and which is the only referenced work that actually presents averages per language family. The last system is the very competitive transformer-based baseline [10].

Note 4: Most of the data come from Indo-European languages, so we split this language family into its major subgroups for our analysis. The question of whether granularity matters (it could!) and how to actually split such large language groups is beyond the scope of this note, but we should keep it in mind!

An overview of the results shows that our data augmentation method really works in some language families, but fails in others. For instance, our result in Greek is more than 10 times better than the IT-IST system (perhaps me being Greek helped a lot here!), and better by 20 accuracy points in Indo-Aryan, more than 25 points in Indo-Iranian and Slavic languages, and about 40 in Quechua. At the same time though, our system is 8 times worse than the baseline (cf accuracies of 6 and 48) for Murrinh-patha (a Southern Daly language) and 30 points worse than the baseline in the highly suffixing Uralic languages.

Language Family#languagesCMUIT-ISTBaseline transformer
Caucasian1939496
Celtic449.6635.4441.28
Semitic27157.562.5
Turkic1282.1789.590.83
Romance673.4853.9255.64
Slavic572.4240.7546.77
Greek174.66.91.9
Germanic550.8148.0450.96
Southern Daly164248
Uralic73250.561.25
Indo-Aryan2684948
Dravidian1948282
Indo-Iranian358.527.4531.45
Baltic148.415.322.5
Quechuan192.148.236.7
Niger–Congo1816670
Average:65.4550.4152.86

Acknowledgements

Many thanks to Arya McCarthy for his comments in the preliminary drafts of this note!

References

[1] Nivre, Joakim, et al. "Universal dependencies v1: A multilingual treebank collection." Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16). 2016.
[2] Kirov, Christo, et al. "UniMorph 2.0: Universal Morphology." Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018). 2018.
[3] Conneau, Alexis, et al. "XNLI: Evaluating Cross-lingual Sentence Representations." Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018.
[4] McCarthy, Arya D., et al. "The SIGMORPHON 2019 Shared Task: Morphological Analysis in Context and Cross-Lingual Transfer for Inflection." Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology. 2019.
[5] Kondratyuk, Dan and Straka, Milan. "75 Languages, 1 Model: Parsing Universal Dependencies Universally". Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019.
[6] Balduzzi, David, et al. "Re-evaluating evaluation." Advances in Neural Information Processing Systems. 2018.
[7] Zeman, Daniel, Jan Hajič, Martin Popel, Martin Potthast, Milan Straka, Filip Ginter, Joakim Nivre, and Slav Petrov. "CoNLL 2018 shared task: multilingual parsing from raw text to universal dependencies." Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 1-21. 2018.
[8] Anastasopoulos, Antonios, and Graham Neubig. "Pushing the Limits of Low-Resource Morphological Inflection." Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019.
[9] Peters, Ben, and André FT Martins. "IT–IST at the SIGMORPHON 2019 Shared Task: Sparse Two-headed Models for Inflection." Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology. 2019.
[10] Wu, Shijie, and Ryan Cotterell. "Exact Hard Monotonic Attention for Character-Level Transduction." preprint arXiv:1905.06319 (2019).

Appendix

[back]

Ranking Results with averaging per language family

SIGMORPHON 2019 task 1 (inflection)

Simple Average Accuracyavg of max accuracy per language family
RankSystemScoreSystemScore
1CMU-03-158.79CMU-03-172.60
2BASE: transformer54.25BASE: transformer58.89
4IT-IST-02-150.14IT-IST-02-158.28
5IT-IST-01-148.97IT-IST-01-153.51
6Tuebingen-01-134.49Tuebingen-01-140.84
7AX-02-122.00AX-02-128.25
8Tuebingen-02-120.86AX-01-122.84
9AX-01-118.35Tuebingen-02-122.40











SIGMORPHON 2019 task 2 (part 1: morphological analysis)

Simple Average AccuracyAvg of avg accuracy per language family
RankSystemScoreSystemScore
1CHARLES-SAARLAND-02-293.23UFALPRAGUE-0191.96
2UFALPRAGUE-0193.19CHARLES-SAARLAND-02-291.71
3EDINBURGH-01-288.93EDINBURGH-01-288.49
4OHIOSTATE-01-287.42OHIOSTATE-01-286.47
5CMU-01-2-DataAug86.53CMU-01-2-DataAug85.74
6CARNEGIEMELLON-02-285.06CARNEGIEMELLON-02-283.87










SIGMORPHON 2019 task 2 (part 2: lemmatization)

Simple Average AccuracyAvg of avg accuracy per language family
RankSystemScoreSystemScore
1UFALPRAGUE-01-295.78UFALPRAGUE-01-295.08
2CHARLES-SAARLAND-02-295.00CHARLES-SAARLAND-02-293.74
3CHARLES-SAARLAND-01-294.20CHARLES-SAARLAND-01-293.29
4EDINBURGH-01-294.02OHIOSTATE-01-293.19
5OHIOSTATE-01-293.43ITU-01-293.18
6CBNU-01-293.30CBNU-01-293.13
7RUG-02-293.06EDINBURGH-01-292.36
8ITU-01-292.72CMU-01-2-DataAug92.32
9CMU-01-2-DataAug92.51RUG-02-292.22
10CARNEGIEMELLON-02-292.20CARNEGIEMELLON-02-291.10
11NLPCUBE-01-291.43NLPCUBE-01-289.01
















CoNLL 2018 Shared Task on UD parsing

Simple Macro-AveragedAvg. of macro-averaged
LAS-F1 LAS-F1 per language family
RankSystemScoreSystemScore
1HIT-SCIR75.84HIT-SCIR67.58
2TurkuNLP73.28TurkuNLP65.43
3UDPipe73.11CEA64.89
4LATTICE73.02Uppsala64.81
5ICS73.02UDPipe64.80
6CEA72.56Stanford64.64
7Uppsala72.37LATTICE64.41
8Stanford72.29ICS64.27
9AntNLP70.90NLP-Cube63.09
10NLP-Cube70.82AntNLP62.78
11ParisNLP70.64ParisNLP62.47
12SLT-Interactions69.98SLT-Interactions62.23
13IBM69.11UniMelb61.35
14UniMelb68.66LeisureX61.29
15LeisureX68.31IBM61.16
16KParse66.58KParse59.71
17Fudan66.34Fudan59.03
18BASELINE65.80CUNI58.76
19Phoenix65.61BASELINE58.20
20CUNI64.87Phoenix58.11
...............

























Antonios Anastasopoulos
Antonios Anastasopoulos
Assistant Professor

I work on multilingual models, machine translation, speech recognition, and NLP for under-served languages.

Related