Generalized Data Augmentation for Low-Resource Translation

Mengzhou Xia, Xiang Kong, Antonios Anastasopoulos, Graham Neubig

July 2019

PDF DOI

Abstract

Low-resource language pairs with a paucity of parallel data pose challenges for machine translation in terms of both adequacy and fluency. Data augmentation utilizing a large amount of monolingual data is regarded as an effective way to alleviate the problem. In this paper, we propose a general framework of data augmentation for low-resource machine translation not only using target-side monolingual data, but also by pivoting through a related high-resource language. Specifically, we experiment with a two-step pivoting method to convert high-resource data to the low-resource language, making best use of available resources to better approximate the true distribution of the low-resource language. First, we inject low-resource words into high-resource sentences through an induced bilingual dictionary. Second, we further edit the high-resource data injected with low-resource words using a modified unsupervised machine translation framework. Extensive experiments on four low-resource datasets show that under extreme low-resource settings, our data augmentation techniques improve translation quality by up to 1.5 to 8 BLEU points compared to supervised back-translation baselines.

Type

Conference paper

Publication

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

MT multilingual NLP

Antonios Anastasopoulos

Assistant Professor

I work on multilingual models, machine translation, speech recognition, and NLP for under-served languages.

Generalized Data Augmentation for Low-Resource Translation

Abstract

Antonios Anastasopoulos

Assistant Professor

Related