2005/025 - HTML-to-XML Migration by means of sequential learning and grammatical inference
Boris Chidlovskii, Jérôme Fuselier
IJCAI 05 Workshop on Grammatical Inference Applications, Edinburgh, Scotland, 30 July, 2005
we consider the problem of document conversionfrom the layout-oriented HTML into a semantic-oriented XML annotation. An important fragment of the conversion problem can be reduced to the sequential learning framework, where source tree leaves are labeled with XML tags. We review sequential learning methods developed for the NLP applications, including the Naive Bayes and Maximum entropy. Then we extend these methods with the hidden markov model (HMM) that injects the transition probabilities into the leaf classification function. Finally, we address the issue of HMM topology. We adopt grammatical inference methods to induce the HMM topology and show how to extend the sequential learning methods accordingly. We test all methods on a particular conversion case and report the evaluation results.