The impact of sentence alignment errors on phrase-based machine translation performance

Par Conseil national de recherches du Canada

Téléchargement	Voir le manuscrit accepté : The impact of sentence alignment errors on phrase-based machine translation performance (PDF, 564 Kio)
Auteur	Rechercher : Goutte, Cyril; Rechercher : Carpuat, Marine¹; Rechercher : Foster, George
Affiliation	Conseil national de recherches du Canada. Technologies de l'information et des communications
Format	Texte, Article
Conférence	The Tenth Biennial Conference of the Association for Machine Translation in the Americas, 28 October-1 November 2012, San Diego, California, USA
Résumé	When parallel or comparable corpora are harvested from the web, there is typically a tradeoff between the size and quality of the data. In order to improve quality, corpus collection efforts often attempt to fix or remove misaligned sentence pairs. But, at the same time, Statistical Machine Translation (SMT) systems are widely assumed to be relatively robust to sentence alignment errors. However, there is little empirical evidence to support and characterize this robustness. This contribution investigates the impact of sentence alignment errors on a typical phrase-based SMT system. We confirm that SMT systems are highly tolerant to noise, and that performance only degrades seriously at very high noise levels. Our findings suggest that when collecting larger, noisy parallel data for training phrase-based SMT, cleaning up by trying to detect and remove incorrect alignments can actually degrade performance. Although fixing errors, when applicable, is a preferable strategy to removal, its benefits only become apparent for fairly high misalignment rates. We provide several explanations to support these findings.
Date de publication	2012-11
Dans	Proceedings of the Tenth Conference of the Association for Machine Translation in the Americas (novembre 2012).
Langue	anglais
Publications évaluées par des pairs	Oui
Numéro NPARC	21268097
Exporter la notice	Exporter en format RIS
Signaler une correction	Signaler une correction (s'ouvre dans un nouvel onglet)
Identificateur de l’enregistrement	6aeda7ee-6f72-466f-9c7a-56c71e481d52
Enregistrement créé	2013-04-09
Enregistrement modifié	2020-06-04

Date de modification :: 2024-12-24