Albert Gordo, Jon Almazan, Naila Murray, Florent Perronnin
ICCV, Santiago de Chile, Chile, 11-18 December, 2015
The goal of this work is to bring semantics into the tasks
of text recognition and retrieval in natural images. Although
text recognition and retrieval have received a lot of
attention in recent years, previous works have focused on
recognizing or retrieving exactly the same word used as a
query, without taking the semantics into consideration.
In this paper, we ask the following question: can we predict
semantic concepts directly from a word image, without
explicitly trying to transcribe the word image or its
characters at any point? For this goal we propose a convolutional
neural network (CNN) with a weighted ranking
loss objective that ensures that the concepts relevant to the
query image are ranked ahead of those that are not relevant.
This can also be interpreted as learning a Euclidean
space where word images and concepts are jointly embedded.
This model is learned in an end-to-end manner, from
image pixels to semantic concepts, using a dataset of synthetically
generated word images and concepts mined from
a lexical database (WordNet). Our results show that, despite
the complexity of the task, word images and concepts
can indeed be associated with a high degree of accuracy.
Report number: