Coarse “split and lump” bilingual language models for richer source information in SMT

Par Conseil national de recherches du Canada

Téléchargement	Voir le manuscrit accepté : Coarse “split and lump” bilingual language models for richer source information in SMT (PDF, 938 Kio)
Auteur	Rechercher : Stewart, Darlene¹; Rechercher : Kuhn, Roland¹; Rechercher : Joanis, Eric¹; Rechercher : Forster, George¹
Affiliation	Conseil national de recherches du Canada. Technologies de l'information et des communications
Format	Texte, Article
Conférence	11th Conference of the Association for Machine Translation in the Americas (AMTA), October 22-26, 2014, Vancouver, British Columbia, Canada
Sujet	aluminum; computational linguistics; computer aided language translation; speech transmission; syntactics automatically generated; bilingual language model; coarse models; contextual information; language pairs; parts of speech; statistical machine translation; word clustering
Résumé	Recently, there has been interest in automatically generated word classes for improving sta- tistical machine translation (SMT) quality: e.g, (Wuebker et al, 2013). We create new mod- els by replacing words with word classes in features applied during decoding; we call these “coarse models”. We find that coarse versions of the bilingual language models (biLMs) of (Niehues et al, 2011) yield larger BLEU gains than the original biLMs. BiLMs provide phrase-based systems with rich contextual information from the source sentence; because they have a large number of types, they suffer from data sparsity. Niehues et al (2011) miti- gated this problem by replacing source or target words with parts of speech (POSs). We vary their approach in two ways: by clustering words on the source or target side over a range of granularities (word clustering), and by clustering the bilingual units that make up biLMs (bitoken clustering). We find that loglinear combinations of the resulting coarse biLMs with each other and with coarse LMs (LMs based on word classes) yield even higher scores than single coarse models. When we add an appealing “generic” coarse configuration chosen on English > French devtest data to four language pairs (keeping the structure fixed, but providing language-pair-specific models for each pair), BLEU gains on blind test data against strong baselines averaged over 5 runs are +0.80 for English > French, +0.35 for French > English, +1.0 for Arabic > English, and +0.6 for Chinese > English.
Date de publication	2014-10
Maison d’édition	Association for Machine Translation in the Americas
Dans	Proceedings of the Eleventh Conference of the Association for Machine Translation in the Americas (AMTA), 2014 1 : 28–41.
Langue	anglais
Publications évaluées par des pairs	Oui
Numéro NPARC	23001442
Exporter la notice	Exporter en format RIS
Signaler une correction	Signaler une correction (s'ouvre dans un nouvel onglet)
Identificateur de l’enregistrement	b9054aec-0086-41ab-b1b8-11dddc2cca51
Enregistrement créé	2017-02-08
Enregistrement modifié	2023-04-24

Date de modification :: 2024-12-22