Boris Chidlovskii
European Conference on Machine Learning, Freiburg, Germany, September 3-7, 2001
Modern agent and mediator systems communicate to a multitude of Web information providers to better satisfy
user requests. They use wrappers to extract relevant information from HTML responses and to annotate it with
user-defined labels. A number of approaches exploit the methods of machine learning to induce instances of
certain wrapper classes, by assuming the tabular structure of HTML responses and by observing the regularity
of extracted fragments in the HTML structure. In this work, we propose a general approach and consider the
information extraction conducted by wrappers as a special form of transduction. We make no assumption
about the HTML response structure and profit from the advanced methods of transducer induction, in order to
develop two powerful wrapper classes, for samples with and without ambiguous translations. We test the
proposed induction methods on a set of general-purpose and bibliographic data providers and report the results
of experiments.
Report number: