Gabriela Csurka, Stéphane Clinchant, Adrian Popescu
Conference on Multilingual and Multimodal Information Access Evaluation, Amsterdam, Netherlands, 19-22 September 2011.
In this document, we first recall briefly our baseline methods both for text and image
retrieval and describe our information fusion strategy, before giving specific details
concerning our submitted runs.
As text retrieval, XRCE used either and Information-Based IR model [4] or a Lexical
Entailment IR model based on statistical translation IR model [5]. Alternatively, we
also used an approach from CEA List that models the queries using on one hand
socially related Flickr tags and on the other hand Wikipedia concepts introduced
in [13]. The combination of these runs have shown that the approaches were rather
As image representation, we used spatial pyramid of Fisher Vectors built on local
orientation histograms and local RGB statistics. The dot product was used to define
the similarity between two images and to combine the color and texture based ranking
we used simple score averaging.
Finally, to combine visual and textual information, we used a so called the Late
Semantic Combination (LSC) method [3], where first the text expert is used to retrieved
semantically relevant documents, and than the visual and textual scores are
averaged to rank these documents. This strategy allowed us to significantly improve
over mono-modal retrieval performances. Using the late fusion of the best text expert
from XRCE and from CEA and combining with our Fisher Vector based image run
with LSC leaded to a MAP of 37% (best score obtained in the Challenge).
Report number: