Behind the unique visual understanding capabilities of humans is the ability for our brain to transform a complex scene, as perceived by our visual system, into a meaningful representation. Similarly, if we wish computers to understand visual content, then they should be able to learn such internal visual representations. Given the wide variety of visual data (e.g. photographs, medical images, document scans, video streams, etc.) and application scenarios (e.g. in transportation, retail, healthcare, business process outsourcing, etc.), one should avoid hand-crafting a representation for each new task. Therefore, our goal is to learn visual representations which are adaptive to different scenarios and requirements with minimal custom-design. However, learning visual representations is a difficult task because several conflicting factors should be taken into account: visual representations should be descriptive, meaning that they should faithfully describe the rich visual content of images and videos, they should be robust, in the sense that they should be invariant to confounding factors such as the lighting or the viewpoint of the scene, and they should be efficient in handling vast quantities of visual data. Finding a good compromise between descriptiveness, robustness and efficiency is a great scientific challenge.

Bags-of-visual-words and Fisher Vectors

Our group has made key contributions to this research topic over the past decade. First, in 2004, we were among the first to show that we could represent an image by extracting a set of small image patches, quantizing them into a finite number of prototypes (referred to as visual words) and building a histogram of visual word occurrences. This framework is known as the bag-of-visual-words and has been one of the most successful so far. Then, in 2007, we showed that these representations could be greatly improved by going beyond quantization and including higher-order statistics. This framework is known as the Fisher Vector and has become one of the de-factor standards to represent visual content. Below we list our major contributions related to these representations.

Hybrid architectures

More recently, we have considered deep learning models to build new representations. We proposed a hybrid architecture to combine Fisher Vectors with additional fully connected layers for image classification, instance retrieval, and activity recogntion.

Deep learning

We also successfully designed new deep architectures or training mechanisms to be applied to fundamental computer vision problems, such as predicting saliency or retrieving particular objects (see the page dedicated to this project: Deep Image Retrieval).

Albert Gordo, Diane Larlus: Beyond instance-level image retrieval: Leveraging captions to learn a global visual representation for semantic retrieval
CVPR 2017