CVPR 2018 - Part 1 Pose estimation

This 4 part blog series is co-authored by computer vision scientists who went to the Computer Vision and Pattern Recognition (CVPR) conference in Salt Lake City this year at the end of June.  

Rather than edit content down to one piece we made the judgement call of not touching the work that stuck most in our minds throughout a jam-packed week of workshops, presentations and posters.

The focus of Part 1 is Pose Estimation by Philippe Weinzaepfel. Part 2 looks at Visual Localization and Embedded Vision with Claudine Combe, Martin Humenberger and Nicolas Monet. Naila Murray and Jérôme Revaud talk about their favourite papers in Image Retrieval in Part 3 and the series rounds up with Part 4 on 3D Scene Understanding by Claudine, Nicolas and Philippe.

 We thought a good starting point for the series would be to call out the “Good Citizen of CVPR” panel. As the vision community explodes it’s become really difficult to maintain community norms and indeed some of them probably need to be revisited. The content here was about educating members of the CVPR community on best practices for reviewers, authors, attendees, area chairs and other participants. It covered diverse topics such as “How to Write a Good Research Paper” by Bill Freeman, “How to Avoid "Clique" Culture” with Timnit Gebru and “Tips for Preparing a Clear Talk” by Kristen Grauman. The talks were uniformly informative and were given by a top-notch roster of junior and scientist computer vision researchers. As the community continues to grow, we hope to see this panel become a fixture at CVPR conferences.

The NAVER LABS Panel! On the booth at CVPR – Korean and European team.

Pose estimation

There’s been a lot of recent, impressive progress in human pose estimation and quite a bit was shared in different forums at the conference. More than simply improving the task of 2D pose estimation (find the 2D pixel location of each joint for each person in an image) or 3D pose estimation (find 3D joint distances relative to the center of the person, usually limited to constrained environments such as motion capture rooms), the community has started to shift towards more difficult and interesting approaches such as finding 3D poses in real-world images or fitting human shapes (i.e. a mesh for each person). This is often done using the SMPL parametric human mesh model. This model represents most human shapes using only 82 parameters that control 6890 vertices.


For instance, the best poster award of the 3D humans workshop (sponsored by yours truly ), was selected for the work of Pavlakos et al. for their ‘Learning to Estimate 3D Human Pose and Shape from a Single Color Image’. Given a single image, the task consists of regressing the parameter of a SMPL mesh model for a human centered in the frame. You can see this framework in Figure 1 where a first network is trained to predict joint heatmaps and silhouettes from images (a), while a second network is trained on human shape instances (b). That’s to regress the mesh parameters from these heatmaps and silhouettes.

Figure 1: Illustration of the framework of Pavlakos et al.


By applying these two networks consecutively, the loss checks that the projection of the silhouette and the keypoints match the annotation of keypoints and silhouettes in the training data. 

For the same task, Kanazawa et al. in ‘End-to-end Recovery of Human Shape and Pose’ propose an adversarial framework in which parameters of the SMPL model are regressed. The loss minimizes the reprojection errors of the joint in the image, while the SMPL parameters are also fed to a discriminator whose aim is to check if this mesh is realistic.

Zanfir et al. can even deal with multiple people [Figure 2]. Their pipeline presented in ‘Monocular 3D Pose and Shape Estimation of Multiple People in Natural Scenes’   has multiple steps: first humans are detected and a CNN is used to predict 2D and 3D joint location for each person as well as pixel-level body part segmentation. This data is used to infer the SMPL parameters. These parameters are optimized while respecting a number of constraints, such as people are all on a ground plane, and each point in the 3D world is occupied by only one person. An extension to videos is also proposed, adding a temporal consistency constraint.

Figure 2: illustration of the results from Zanfir et al.

For a similar task, Güler et al. propose DensePose [DensePose: Dense Human Pose Estimation In The Wild] where they modify MaskRCNN to also regress the corresponding coordinates of the mesh surface for each pixel. To this end, ground-truth includes sparse correspondence annotations between people in images and points on the mesh surface for 50k humans from the MS-Coco dataset.

In ‘Video Based Reconstruction of 3D People Models’ Alldieck et al. don’t only regress a shape model, but also clothing and hair. This is done from an input video containing a collaborative person, which turns 360° in the center of the video.

Joo et al. received the best student paper award for ‘Total Capture: A 3D Deformation Model for Tracking Faces, Hands, and Bodies’. They propose the first markerless system that captures a face and its expression but also hands, bodies and feet, by unifying different frameworks and stitching their specific part models.

Finally, some papers propose estimating 3D pose estimation from images in-the-wild, by either requiring multiple views at training such as Rhodin et al. in ‘Learning Monocular 3D Human Pose Estimation from Multi-view Images’ [see Figure 3 left], or by annotating the ordinal depth relation between some pairs of joints e.g. Pavlakos et al. in ‘Ordinal Depth Supervision for 3D Human Pose Estimation’, [see the image on the right of Figure 3].

Figure 3. Left: illustration of ordinal depth [Pavlakos et al.]. Right: illustration of multi-view constraint [Rhodin et al.].

In terms of applications, human pose estimation has been widely used to improve action recognition i.e. “Glimpse Clouds: Human Activity Recognition from Unstructured Feature Points” by Baradel et al., “PoseFlow: A Deep Motion Representation for Understanding Human Behaviors in Videos“ by Zhang et al., “2D/3D Pose Estimation and Action Recognition using Multitask Deep Learning“ by Luvizon et al. and Liu et al’s. “Recognizing Human Actions as the Evolution of Pose Estimation Map”. In our paper “PoTion: Pose Motion Representation for Action Recognition”, resulting from our collaboration with Inria, we show the motion of the human pose provides a compact representation [Figure 4] to perform action recognition, that’s complementary to standard appearance and flow features. We’d like to hear your feedback on it.

Figure 4. Illustration of the Pose MoTion (PoTion) representation proposed in our paper PoTion: Pose MoTion Representation for Action Recognition”

About the author:

Philippe Weinzaepfel is a researcher in the computer vision group at NAVER LABS Europe.  He presented PoTion: Pose Motion Representation for Action Recognition at CVPR 2018. This work is carried out in collaboration with INRIA.


Part 2: CVPR, Embedded Vision and Visual Localization

Part 3: CVPR, Image Retrieval

Part 4: CVPR, 3D Scene Understanding