José A. Rodriguez, Florent Perronnin
24th British Machine Vision Conference, University of Bristol, UK, 9 - 13 Sept 2013. Full paper available on <a href=> BMVC Website </a>
The standard approach to recognizing text in images consists in first classifying local
image regions into candidate characters and then combining them with high-level word
models such as conditional random fields (CRF). This paper explores a new paradigm
that departs from this bottom-up view. We propose to embed word labels and word
images into a common Euclidean space. Given a word image to be recognized, the
text recognition problem is cast as one of retrieval: find the closest word label in this
space. This common space is learned using the Structured SVM (SSVM) framework by
enforcing matching label-image pairs to be closer than non-matching pairs. This method
presents the following advantages: it does not require costly pre- or post-processing
operations, it allows for the recognition of never-seen-before words and the recognition
process is efficient. Experiments are performed on two challenging datasets (one of
license plates and one of scene text) and show that the proposed method is competitive
with standard bottom-up approaches to text recognition.
Report number: