The impact of sentence alignment errors on phrase-based machine translation performance

From National Research Council Canada

Download	View accepted manuscript: The impact of sentence alignment errors on phrase-based machine translation performance (PDF, 564 KiB)
Author	Search for: Goutte, Cyril; Search for: Carpuat, Marine¹; Search for: Foster, George
Affiliation	National Research Council of Canada. Information and Communication Technologies
Format	Text, Article
Conference	The Tenth Biennial Conference of the Association for Machine Translation in the Americas, 28 October-1 November 2012, San Diego, California, USA
Abstract	When parallel or comparable corpora are harvested from the web, there is typically a tradeoff between the size and quality of the data. In order to improve quality, corpus collection efforts often attempt to fix or remove misaligned sentence pairs. But, at the same time, Statistical Machine Translation (SMT) systems are widely assumed to be relatively robust to sentence alignment errors. However, there is little empirical evidence to support and characterize this robustness. This contribution investigates the impact of sentence alignment errors on a typical phrase-based SMT system. We confirm that SMT systems are highly tolerant to noise, and that performance only degrades seriously at very high noise levels. Our findings suggest that when collecting larger, noisy parallel data for training phrase-based SMT, cleaning up by trying to detect and remove incorrect alignments can actually degrade performance. Although fixing errors, when applicable, is a preferable strategy to removal, its benefits only become apparent for fairly high misalignment rates. We provide several explanations to support these findings.
Publication date	2012-11
In	Proceedings of the Tenth Conference of the Association for Machine Translation in the Americas (November 2012).
Language	English
Peer reviewed	Yes
NPARC number	21268097
Export citation	Export as RIS
Report a correction	Report a correction (opens in a new tab)
Record identifier	6aeda7ee-6f72-466f-9c7a-56c71e481d52
Record created	2013-04-09
Record modified	2020-06-04

Date modified:: 2024-12-24