| Download | - View accepted manuscript: Phrase clustering for smoothing TM probabilities – or, how to extract paraphrases from phrase tables (PDF, 687 KiB)
|
|---|
| Author | Search for: Kuhn, Roland1; Search for: Chen, Boxing1; Search for: Foster, George1; Search for: Stratford, Evan |
|---|
| Affiliation | - National Research Council Canada. NRC Institute for Information Technology
|
|---|
| Format | Text, Article |
|---|
| Conference | The 23rd International Conference on Computational Linguistics (COLING 2010), August 23-27, 2010, Beijing, China |
|---|
| Subject | Information and Communication Technologies |
|---|
| Abstract | This paper describes how to cluster to-gether the phrases of a phrase-based sta-tistical machine translation (SMT) sys-tem, using information in the phrase table itself. The clustering is symmetric and recursive: it is applied both to source-language and target-language phrases, and the clustering in one language helps determine the clustering in the other. The phrase clusters have many possible uses. This paper looks at one of these uses: smoothing the conditional translation model (TM) probabilities employed by the SMT system. We incorporated phrase-cluster-derived probability esti-mates into a baseline loglinear feature combination that included relative fre-quency and lexically-weighted condition-al probability estimates. In Chinese-English (C-E) and French-English (F-E) learning curve experiments, we obtained a gain over the baseline in 29 of 30 tests, with a maximum gain of 0.55 BLEU points (though most gains were fairly small). The largest gains came with me-dium (200-400K sentence pairs) rather than with small (less than 100K sentence pairs) amounts of training data, contrary to what one would expect from the pa-raphrasing literature. We have only be-gun to explore the original smoothing approach described here. |
|---|
| Publication date | 2010-08-27 |
|---|
| In | |
|---|
| Language | English |
|---|
| Peer reviewed | Yes |
|---|
| NPARC number | 15736686 |
|---|
| Export citation | Export as RIS |
|---|
| Report a correction | Report a correction (opens in a new tab) |
|---|
| Record identifier | 68e35bd5-b0b9-4e25-8be2-36e382b8aa1b |
|---|
| Record created | 2010-07-05 |
|---|
| Record modified | 2020-04-17 |
|---|