2005/016 - From legacy Document to XML: A conversion Framework
- Jean-Pierre Chanod,Boris Chidlovskii,Hervé Dejean,Olivier Fambon,Jérôme Fuselier,Thierry Jacquin,Jean-Luc Meunier
9th European Conference on Research and Advanced Technology for Digital Libraries, Vienna, Austria, September 18-23, 2005.
We present an integrated framework for the document conversion from legacy formats to XML format. We describe the LegDoC project, aimed at automating the conversion of layout annotations layout-oriented formats like PDF, PS and HTML to semantic-oriented annotations. A toolkit of different components covers complementary techniques the logical document analysis and semantic annotations with the methods of machine learning. We use a real case conversion project as a driving example to exemplify different techniques implemented in the project.