Abstract | This document reports on project "Document Workflow (``ACA)", part of the BtB-NRC Collaboration Agreement 2021-22, titled "Artificial Intelligence for Translation Quality" (AI4TQ).
This project is about building computer support tools for the Translation Bureau's client advisors, more specifically to identify the specialty domain of texts submitted for translation. In a previous project, we used Bureau data to create classifiers that can identify to which domain a given document belongs, with an accuracy close to 80%. We delivered a system to the Bureau, called the "Assistant Client advisor" (ACA), which provides document classification as a web service, accessible both through a web-based user interface (UI) and an Application Programming Interface (API).
In this project, we have greatly expanded this system, by developing a set of functionalities that allow creating, updating, evaluating and deploying domain predictors. The API and UI of this new system will allow the Bureau to create and maintain domain predictors themselves.
In addition, we have experimented with approaches to improve prediction accuracy, most notably through neural networks. The new API allows creating and using predictors based on the FastText neural network technology, in addition to the algorithms previously available, SVM and ProbCat.
In a series of experiments on Confidence Estimation, we have analyzed the performance of the classifiers, and the relationship between classification accuracy and some numerical indicators produced by classifiers, with the goal of distinguishing between documents that can be handled automatically and documents that should be verified by a client advisor, with the goal of minimizing domain prediction errors and human workload.
Finally, we have added functionalities to segment large documents into smaller pieces, based on the predicted domain of individual segments of text. |
---|