Download | - View final version: Beyond correlation: making sense of the score differences of new MT evaluation metrics (PDF, 1.6 MiB)
|
---|
Author | Search for: Lo, Chi-kiu1; Search for: Knowles, Rebecca1; Search for: Goutte, Cyril1 |
---|
Name affiliation | - National Research Council of Canada. Digital Technologies
|
---|
Format | Text, Article |
---|
Conference | Machine Translation Summit (MTS) 2023, September 4–8, 2023, Macau SAR, China |
---|
Abstract | While many new automatic metrics for machine translation valuation have been proposed in recent years, BLEU scores are still used as the primary metric in the vast majority of MT research papers. There are many reasons that researchers may be reluctant to switch to new metrics, from external pressures (reviewers, prior work) to the ease of use of metric toolkits. Another reason is a lack of intuition about the meaning of novel metric scores. In this work, we examine “rules of thumb” about metric score differences and how they do (and do not) correspond to human judgments of statistically significant differences between systems. In particular, we show that common rules of thumb about BLEU score differences do not in fact guarantee that human annotators will find significant differences between systems. We also show ways in which these rules of thumb fail to generalize across translation directions or domains. |
---|
Publication date | 2023-09-08 |
---|
Publisher | Asia-Pacific Association for Machine Translation |
---|
Licence | |
---|
In | |
---|
Language | English |
---|
Peer reviewed | Yes |
---|
Export citation | Export as RIS |
---|
Report a correction | Report a correction (opens in a new tab) |
---|
Record identifier | 250b820e-150c-477e-8a6c-da1bb4ee65ea |
---|
Record created | 2023-09-18 |
---|
Record modified | 2023-09-20 |
---|