Figure 1. The manifold of our pose embedding visualized using t-SNE. Each point represents a human pose image. To better show correlation between the pose embedding and annotated pose, we color-code pose similarities in annotation between an arbitrary target image (red box) and all the other images. Selected examples of color-coded images are illustrated in the right-hand side. Images similar with the target in annotated pose are colored in yellow, otherwise in blue. As can be seen, yellow images lie closer by the target in general, which shows that a position on the embedding space implicitly represents a pose.


S. Kwak, M. Cho, and I. Laptev
Thin-Slicing for Pose: Learning to Understand Pose without Explicit Pose Estimation
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016
PDF | Abstract | BibTeX |


We address the problem of learning a pose-aware, compact embedding that projects images with similar human poses to be placed close-by in the embedding space. The embedding function is built on a deep convolutional network, and trained with triplet-based rank constraints on real image data. This architecture allows us to learn a robust representation that captures differences in human poses by effectively factoring out variations in clothing, background, and imaging conditions in the wild. For a variety of pose-related tasks, the proposed pose embedding provides a cost-efficient and natural alternative to explicit pose estimation, circumventing challenges of localizing body joints. We demonstrate the efficacy of the embedding on pose-based image retrieval and action recognition problems.


    author      = {Kwak, S. and Cho, M. and Laptev, I.},
    title       = {Thin-Slicing for Pose: Learning to Understand Pose without Explicit Pose Estimation},
    booktitle   = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
    year        = {2016},


Pose embedding network (317MB)
Pretrained network required to learn pose embedding (365MB)


We thank Greg Mori and Vadim Kantorov for fruitful discussions. This work was supported by Google Research Award and the ERC grants Activia and VideoWorld.