| Téléchargement | - Voir la version finale : UniversalCEFR: enabling open multilingual research on language proficiency assessment (PDF, 1.7 Mio)
|
|---|
| DOI | Trouver le DOI : https://doi.org/10.18653/v1/2025.emnlp-main.491 |
|---|
| Auteur | Rechercher : Imperial, Joseph Marvin; Rechercher : Barayan, Abdullah; Rechercher : Stodden, Regina; Rechercher : Wilkens, Rodrigo; Rechercher : Muñoz Sánchez, Ricardo; Rechercher : Gao, Lingyun; Rechercher : Torgbi, Melissa; Rechercher : Knight, Dawn; Rechercher : Forey, Gail; Rechercher : Jablonkai, Reka R.; Rechercher : Kochmar, Ekaterina; Rechercher : Reynolds, Robert Joshua; Rechercher : Ribeiro, Eugénio; Rechercher : Saggion, Horacio; Rechercher : Volodina, Elena; Rechercher : Vajjala, Sowmya1Identifiant ORCID : https://orcid.org/0000-0002-4033-9936; Rechercher : François, Thomas; Rechercher : Alva-Manchego, Fernando; Rechercher : Tayyar Madabushi, Harish |
|---|
| Affiliation | - Conseil national de recherches Canada. Technologies numériques
|
|---|
| Format | Texte, Article |
|---|
| Conférence | 2025 Conference on Empirical Methods in Natural Language Processing, November 4-9, 2025, Suzhou, China |
|---|
| Résumé | We introduce UniversalCEFR, a large-scale multilingual multidimensional dataset of texts annotated according to the CEFR (Common European Framework of Reference) scale in 13 languages. To enable open research in both automated readability and language proficiency assessment, UniversalCEFR comprises 505,807 CEFR-labeled texts curated from educational and learner-oriented resources, standardized into a unified data format to support consistent processing, analysis, and modeling across tasks and languages. To demonstrate its utility, we conduct benchmark experiments using three modelling paradigms: a) linguistic feature-based classification, b) fine-tuning pre-trained LLMs, and c) descriptor-based prompting of instruction-tuned LLMs. Our results further support using linguistic features and fine-tuning pretrained models in multilingual CEFR level assessment. Overall, UniversalCEFR aims to establish best practices in data distribution in language proficiency research by standardising dataset formats and promoting their accessibility to the global research community. |
|---|
| Date de publication | 2025-11 |
|---|
| Maison d’édition | Association for Computational Linguistics |
|---|
| Emplacement | Stroudsburg, Pennsylvania, United States |
|---|
| Licence | |
|---|
| Dans | |
|---|
| Langue | anglais |
|---|
| Publications évaluées par des pairs | Oui |
|---|
| Exporter la notice | Exporter en format RIS |
|---|
| Signaler une correction | Signaler une correction (s'ouvre dans un nouvel onglet) |
|---|
| Identificateur de l’enregistrement | 6f760d65-0398-413b-96b0-c1e3bf3539f2 |
|---|
| Enregistrement créé | 2025-11-28 |
|---|
| Enregistrement modifié | 2026-02-19 |
|---|