Cesar De Souza, Adrien Gaidon, Eleonora Vig, Antonio Lopez
ECCV, Amsterdam, The Netherlands, October 11-14, 2016.
Action recognition in videos is a challenging task due to the complexity
of the spatio-temporal patterns to model and the difficulty to acquire and learn on
large quantities of video data. Deep learning, although a breakthrough for image
classification and showing promise for videos, has still not clearly superseded
action recognition methods using hand-crafted features, even when training on
massive datasets. In this paper, we introduce hybrid video classification architectures
based on carefully designed unsupervised representations of hand-crafted
spatio-temporal features classified by supervised deep networks. As we show in
our experiments on five popular benchmarks for action recognition, our hybrid
model combines the best of both worlds: it is data efficient (trained on 150 to
10000 short clips) and yet improves significantly on the state of the art, including recent deep models trained on millions of manually labelled images and videos.
Report number: