Radu Horaud, director of research at Perception project, INRIA Rhône-Alpes, Montbonnot, France


Audio-visual integration has been an active research topic, in particular for disambiguating the audio modality based on visual information, for example, leap reading to improve speech recognition performance. We address the more general problem of how to combine visual and auditory data within the task of cognitive interactions between an artificial agent (a robot) and a group of people. The overall task is to be able to retrieve a multi-party, multi-modal dialog and to allow a robot to purposively interact with people. One key ingredient is to properly align images with speech in an unconstrained setting. Modern computer vision and signal processing methods use high-dimensional descriptors to represent images and sounds. Therefore, one important task is to extract low-dimensional latent information from these high-dimensional observations, for example, to be able to keep track over time of face locations, face orientations, as well as clean speech signals that are emitted by individual speakers. We propose a novel high-dimensional-to-low-dimensional mapping model and we briefly describe the associated methodology for learning the model parameters, based on mixture- and latent-variable models and on EM inference. We then illustrate an instance of this model that robustly and efficiently aligns speech utterances with images of faces.