Téléchargement | - Voir le manuscrit accepté : Improving parallel data identification using iteratively refined sentence alignments and bilingual mappings of pre-trained language models (PDF, 382 Kio)
|
---|
Auteur | Rechercher : Lo, Chi-Kiu1; Rechercher : Joanis, Eric1 |
---|
Affiliation | - Conseil national de recherches du Canada. Technologies numériques
|
---|
Format | Texte, Article |
---|
Conférence | The 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), November 19-20, 2020 [Held Online] |
---|
Résumé | The National Research Council of Canada’s team submissions to the parallel corpus filtering task at the Fifth Conference on Machine Translation are based on two key components: (1) iteratively refined statistical sentence alignments for extracting sentence pairs from document pairs and (2) a crosslingual semantic textual similarity metric based on a pretrained multilingual language model, XLMRoBERTa, with bilingual mappings learnt from a minimal amount of clean parallel data for scoring the parallelism of the extracted sentence pairs. The translation quality of the neural machine translation systems trained and fine-tuned on the parallel data extracted by our submissions improved significantly when compared to the organizers’ LASER-based baseline, a sentence-embedding method that worked well last year. For re-aligning the sentences in the document pairs (component 1), our statistical approach has outperformed the current state-of-the-art neural approach in this low-resource context. |
---|
Date de publication | 2020-11-19 |
---|
Date de création | 2020-11-30 |
---|
Maison d’édition | Association for Computational Linguistics |
---|
Dans | |
---|
Langue | anglais |
---|
Publications évaluées par des pairs | Oui |
---|
Exporter la notice | Exporter en format RIS |
---|
Signaler une correction | Signaler une correction (s'ouvre dans un nouvel onglet) |
---|
Identificateur de l’enregistrement | 520de843-ed04-446c-83cc-d121adf6aa90 |
---|
Enregistrement créé | 2020-11-30 |
---|
Enregistrement modifié | 2020-11-30 |
---|