Boris Chidlovskii
Proc. of 14th IEEE International Conference On Tools with Artificial Intelligence, Washington DC, USA, Nov. 4-6, 2002.
We address the problem of automatic maintenance of Web wrappers used in data integration systems to
encapsulate an access to Web information providers. The maintenance of Web wrappers is critical as providers
often changes the page format and/or structure making wrappers inoperable. The solution we propose extends
the conventional wrapper architecture with a novel component of automatic maintenance and recovery. We
consider the automatic recovery as special type of the classification problem and use ensemble methods of
machine learning to build alternative views of provider pages. We combine extraction rules of conventional
wrappers with contentfeatures of extracted information to accurate recovery from three types of format
changes, namely, content ,context and structural changes. We report results of the recovery performance for
format changes at widely used Web providers.
Report number: