| Download | - View accepted manuscript: Improving parallel data identification using iteratively refined sentence alignments and bilingual mappings of pre-trained language models (PDF, 382 KiB)
|
|---|
| Author | Search for: Lo, Chi-Kiu1; Search for: Joanis, Eric1 |
|---|
| Affiliation | - National Research Council Canada. Digital Technologies
|
|---|
| Format | Text, Article |
|---|
| Conference | The 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), November 19-20, 2020 [Held Online] |
|---|
| Abstract | The National Research Council of Canada’s team submissions to the parallel corpus filtering task at the Fifth Conference on Machine Translation are based on two key components: (1) iteratively refined statistical sentence alignments for extracting sentence pairs from document pairs and (2) a crosslingual semantic textual similarity metric based on a pretrained multilingual language model, XLMRoBERTa, with bilingual mappings learnt from a minimal amount of clean parallel data for scoring the parallelism of the extracted sentence pairs. The translation quality of the neural machine translation systems trained and fine-tuned on the parallel data extracted by our submissions improved significantly when compared to the organizers’ LASER-based baseline, a sentence-embedding method that worked well last year. For re-aligning the sentences in the document pairs (component 1), our statistical approach has outperformed the current state-of-the-art neural approach in this low-resource context. |
|---|
| Publication date | 2020-11-19 |
|---|
| Date created | 2020-11-30 |
|---|
| Publisher | Association for Computational Linguistics |
|---|
| In | |
|---|
| Language | English |
|---|
| Peer reviewed | Yes |
|---|
| Export citation | Export as RIS |
|---|
| Report a correction | Report a correction (opens in a new tab) |
|---|
| Record identifier | 520de843-ed04-446c-83cc-d121adf6aa90 |
|---|
| Record created | 2020-11-30 |
|---|
| Record modified | 2020-11-30 |
|---|