Boris Chidlovskii, Jérôme Fuselier
Conférence Francophone sur Apprentissage automatique 2005, Maison du Séminaire, May 30-June 3, 2005
we consider the problem of semantic annotation of semi-structured documents according to a target XML schema. The task is to annotate a document in a tree-like manner where the annotation tree is an instance of a tree class defined by DTD or W3C XML Schema description. In the probabilistic setting, we cope with the tree annotation problem as a generalized probabilistic context free parsing of an observation sequence where each observation comes with a probability distribution over terminals supplied by a probabilistic classifier associated with the content of documents. We determine the most probable tree annotation by maximizing the joint probability of selecting a terminal sequence for the observation sequence and the most probable parse for the selected terminal sequence. We extend the inside-outside algorithm for probabilistic context-free grammars and establish a Naive Bayes-like requirement that the content classifier should satisfy when estimating the terminal probabilities.
Report number: