2010/002 - A Bag-of-Pages Approach to Unordered Multi-Page Document Classification
Albert Gordo, Florent Perronnin
ICPR (International Conference for Pattern Recognition) - Istanbul, Turkey, 23-26 August 2010.
We are interested in the problem of classifying documents containing multiple unordered pages. For this propose, we propose a novel bag-of-pages document representation. Offline, we learn a page clusters on a training set. To represent a new document, one assigns every page to a cluster and counts the proportion of pages assigned to each cluster. This leads to a histogram representation which can then be fed to any discriminative classifier. We consider several refinements of this initial approach: 1/ clusters learned in a supervised manner; 2/ soft-assignment of pages to clusters and 3/going beyond simple counting. We show on two challenging datasets that the proposed approach outperforms a simple baseline system.