Boris Chidlovskii
IJCAI-03 Workshop on Information Integration on the Web
Information extraction form HTML pages has been conventionally treated as plain text documents extended
with HTML tags.However, the growing maturity and correct usage of HTML/XHTML formats open an
opportunity to treat Web pages as trees, to mine the rich structural context in the trees and to learn accurate
extraction rules. In this paper, we generalize the notion of delimiter developed for the string information
extraction to tree documents. Similar to delimiters in strings, we define delimiters in tree documents as
subtrees surrounding the text leaves. We formalize the wrapper induction for tree documents as learning the
classification rules based on the subtree delimiters. We analyze a restricted case of subtree delimiters in the
form of simple paths. We design an efficient data structure for storing candidate delimiters and an incremental
algorithm for finding most discriminative subtree delimiters for the wrapper.
Report number: