Florent Perronnin, Jorge Sanchez, Zeynep Akata
CVPR, Colorado Springs, June 20,24,25, 2011.
The bag-of-visual-words (BOW) is certainly the most
popular image representation to date and it has been shown
to yield good results in various problems including Fine-
Grained Visual Categorization (FGVC) [3, 4]. Our contribution
is to show that the Fisher Vector (FV) - which describes
an image by its deviation from an average model
- is an alternative which performs much better than the
BOW for the FGVC problem. In this extended abstract we
first provide a brief introduction to the FV. We then present
theoretical as well as practical motivations for using the
FV for FGVC. We finally provide experimental results on
four ImageNet subsets: fungus, ungulate, vehicle and ImageNet10K.
Compared to [4] which uses spatial pyramid
(SP) BOW representations, we report significantly higher
classification accuracies. For instance, on ImageNet10K
we report 16.7% vs 6.4% top-1 accuracy which represents
a 160%relative improvement.
Report number: