3rd September 2015, 11:00AM

Speaker: Mihir Jain, post-doctoral researcher at University of Amsterdam, Amsterdam, The Netherlands


We assess the benefits of having objects in the video representation for action recognition. Rather than considering a handful of carefully selected and localized objects, we use 15,000 object categories from ImageNet to understand object-action relationship. This talk presents two of our recent works relating objects to actions: (1) Empirical study for supervised action recognition and (2) Zero-shot action recognition.  

  1. Empirical study: We conduct an empirical study on the benefit of encoding these object categories for actions using 6 datasets totalling more than 200 hours of video and covering 180 action classes. We show that objects matter for actions, and are often semantically relevant as well. We establish that actions have object preferences and these object-action relations are generic, which allows to transferring these relationships from one domain to the other. Objects, when combined with motion, improve the state-of-the-art for both action classification and localization.
  2. Zero-shot action recognition: The goal is to recognize actions without using any video example for learning. Different from the traditional zero-shot approaches we do not demand the design and specification of attribute classifiers and class-to-attribute mappings to allow for transfer from seen classes to unseen classes. Our key contribution is objects2action, a semantic word embedding that is spanned by a skip-gram model of thousands of object categories. Action labels are assigned to an object encoding of unseen video based on a convex combination of action and object affinities. Experiments on four action datasets demonstrate the potential of our approach.