Learning Visual Representations
Behind the unique visual understanding capabilities of humans is the ability for our brain to transform a complex scene, as perceived by our visual system, into a meaningful representation. Similarly, if we wish computers to understand visual content, then they should be able to learn such internal visual representations. Given the wide variety of visual data (e.g. photographs, medical images, document scans, video streams, etc.) and application scenarios (e.g. in transportation, retail, healthcare, business process outsourcing, etc.), one should avoid hand-crafting a representation for each new task. Therefore, our goal is to learn visual representations which are adaptive to different scenarios and requirements with minimal custom-design. However, learning visual representations is a difficult task because several conflicting factors should be taken into account: visual representations should be descriptive, meaning that they should faithfully describe the rich visual content of images and videos, they should be robust, in the sense that they should be invariant to confounding factors such as the lighting or the viewpoint of the scene, and they should be efficient in handling vast quantities of visual data. Finding a good compromise between descriptiveness, robustness and efficiency is a great scientific challenge.
Bags-of-visual-words and Fisher Vectors
Our group has made key contributions to this research topic over the past decade. First, in 2004, we were among the first to show that we could represent an image by extracting a set of small image patches, quantizing them into a finite number of prototypes (referred to as visual words) and building a histogram of visual word occurrences. This framework is known as the bag-of-visual-words and has been one of the most successful so far. Then, in 2007, we showed that these representations could be greatly improved by going beyond quantization and including higher-order statistics. This framework is known as the Fisher Vector and has become one of the de-factor standards to represent visual content. Below we list our major contributions related to these representations.
- “Generalized Max Pooling”, Naila Murray and Florent Perronnin, CVPR, 2014.
- “Image classification with the Fisher vector: Theory and practice”, Jorge Sánchez, Florent Perronnin , Thomas Mensink , Jakob Verbeek, IJCV, 2013.
- “Images as sets of locally weighted features”, Teófilo Emídio de Campos, Gabriela Csurka, Florent Perronnin, Computer Vision and Image Understanding, 2012.
- “Modeling the spatial layout of images beyond spatial pyramids”, Jorge Sánchez, Florent Perronnin and Teófilo de Campos, Pattern Recognition Letters, 2012.
- “Improving the Fisher kernel for large-scale image classification”, Florent Perronnin, Jorge Sánchez, Thomas Mensink, ECCV, 2010.
- “Fisher Kernels on Visual Vocabularies for Image Categorization”, Florent Perronnin and Christopher Dance, CVPR, 2007.
- “Visual Categorization with Bags of Keypoints”, Gabriela Csurka, Christopher Dance, Lixin Fan, Jutta Willamowski, Cédric Bray, ECCV SLCV workshop, 2004.
More recently, we have considered deep learning models to build new representations. We proposed a hybrid architecture to combine Fisher Vectors with additional fully connected layers for image classification, instance retrieval, and activity recogntion.
- Fisher Vectors Meet Neural Networks: A Hybrid Classification Architecture, Florent Perronnin, Diane Larlus, CVPR 2015
- Sympathy for the details: Dense Trajectories and Hybrid Classification Architectures for Action Recognition, Cesar De Souza, Adrien Gaidon, Eleonora Vig and Antonio Lopez. ECCV 2016
We also successfully designed new deep architectures or training mechanisms to be applied to fundamental computer vision problems, such as predicting saliency, retrieving particular objects or even a full semantic scene (see the pages dedicated to these projects: Deep Image Retrieval, and Semantic Image Retrieval).
- End-to-End Saliency Mapping via Probability Distribution Prediction,, Saumya Jetley, Naila Murray, and Eleonora Vig. CVPR, 2016
- Deep Image Retrieval: Learning Global Representations for Image Search, Albert Gordo, Jon Almazan, Jerome Revaud, Diane Larlus. ECCV, 2016
- Albert Gordo, Diane Larlus: Beyond instance-level image retrieval: Leveraging captions to learn a global visual representation for semantic retrieval