Loïc Lecerf, Boris Chidlovskii
CORIA, 4éme Conférence en Recherche d Information et Applications, Saint-Etienne, 28-30 Mars 2007.
In the framework of the LegDoc project at Xerox Research Centre Europe, we are developing components for the semantic annotation of semi-structured documents. While certain semantic entities have regular forms and might be easily extracted, more complex and heterogeneous collections favor the deployment of machine learning methods. Moreover, real world cases pose the technical challenge of the unavailable training sets for specific annotation tasks. As the manual annotation is costly and error-prone, our approach consists in applying active learning methods in oreder to considerably reduce the corpus required for accurate learning models. In this paper, we explain how the active learning principles get adapted the interactive semantic annotation of layout-oriented documents. We deploy the maximum entropy classifier for the probabilistic reasoning and three uncertainty metrics for the efficient application of active learning on large collections. We present the Active Learning Document Annotation Interface (ALDAI) prototype and describe its functionality and implementation choices. The prototype offers a WYSIWYG interface, a high-level language for feature definitions and integrates the active learning component aimed at helping users during the annotation process. We also report some evaluation results of testing the active learning techniques on one public (UCI) and one internal document collections.
Report number: