Prasanth Kolachina, Nicola Cancedda, Marc Dymetman, Sriram Venkatapathy
The 50th Annual Meeting of the Association for Computational Linguistics, Jeju, Republic of Korea, July 8 - July 14 2012.
Parallel data in the domain of interest is the key resource when training a statistical machine translation (SMT) system for a specific business purpose. It is well-known that the greater the amount of parallel corpus, the better the expected level of accuracy of the resulting system. However, creation of parallel data is costly and time-intensive, and a prior assessment of the amount of human translations that should be produced in order to achieve a satisfactory accuracy level would be very useful. The prediction of the size of the parallel corpus is our primary goal here. In this work, we predict a learning curve that plots the size of the parallel corpus against the expected accuracy of a machine translation system. We consider two scenarios, 1) a monolingual corpus sample in the source language is available and 2) a small amount of parallel corpus is available. We propose methods for predicting learning curves for both these scenarios, as well as for combining these two scenarios in order to obtain a more accurate learning curve.
Report number: