Albert Gordo
CVPR 2015 : 28th Conference on Computer Vision and Pattern Recognition, Boston, USA, June 7-12, 2015.
This paper addresses the problem of learning word image
representations: given the cropped image of a word, we
are interested in finding a descriptive, robust, and compact
fixed-length representation. Machine learning techniques
can then be supplied with these representations to produce
models useful for word retrieval or recognition tasks. Although
many works have focused on the machine learning
aspect once a global representation has been produced, little
work has been devoted to the construction of those base
image representations: most works use standard coding and
aggregation techniques directly on top of standard computer
vision features such as SIFT or HOG.
We propose to learn local mid-level features suitable for
building word image representations. These features are
learnt by leveraging character bounding box annotations
on a small set of training images. However, contrary to
other approaches that use character bounding box information,
our approach does not rely on detecting the individual
characters explicitly at testing time. Our local midlevel
features can then be aggregated to produce a global
word image signature. When pairing these features with
the recent word attributes framework of [4], we obtain results
comparable with or better than the state-of-the-art on
matching and recognition tasks using global descriptors of only 96 dimensions.
Report number: