SHIFT:

Leveraging Synthetic Adult Datasets for Unsupervised Infant Pose Estimation

1University of California, Riverside

Problem Overview: On the top row, from left to right keypoint predictions are from a baseline adult human pose estimation model, next are predictions from a SOTA infant pose estimation model FiDIP, and finally predictions from our method, SHIFT. Adult pose estimation models fail when directly applied to infant data; similarly, UniFrame struggles to overcome the domain shift between adults and infants. In contrast, SHIFT accounts for the highly self-occluded pose distribution of infants, thereby effectively adapting to the infant domain.

Abstract

Human pose estimation is a critical tool across a variety of healthcare applications. Despite significant progress in pose estimation algorithms targeting adults, such developments for infants remain limited. Existing algorithms for infant pose estimation, despite achieving commendable performance, depend on fully supervised approaches that require large amounts of labeled data. These algorithms also struggle with poor generalizability under distribution shifts. To address these challenges, we introduce SHIFT: Leveraging SyntHetic Adult Datasets for Unsupervised InFanT Pose Estimation, which leverages the pseudolabeling-based Mean-Teacher framework to compensate for the lack of labeled data and addresses distribution shifts by enforcing consistency between the student and the teacher pseudo-labels. Additionally, to penalize implausible predictions obtained from the mean-teacher framework we also incorporate an infant manifold pose prior. To enhance SHIFT’s self-occlusion perception ability, we propose a novel visibility consistency module for improved alignment of the predicted poses with the original image. Extensive experiments on multiple benchmarks show that SHIFT significantly outperforms existing state-of-the-art unsupervised domain adaptation (UDA) based pose estimation methods by ∼ 5% and supervised infant pose estimation methods by a margin of ∼ 16%. The project page is available at: sarosijbose.github.io/SHIFT.

Method

overall

SHIFT utilizes the Mean-Teacher framework to update the teacher model Mt with an Exponential Moving Average (EMA) of the student model Ms's weights to adapt the model pre-trained on a labeled adult source dataset (xs, ys) to unlabeled infant target images (xt) (Section 3.2). To address anatomical variations in infants, SHIFT employs an infant pose prior \(\theta_p\) which assigns plausibility scores for each prediction of the student model Ms (Section 3.3). Further, to handle the large self-occlusions in the target domain, we employ an off-the-model Fseg to give pseudo segmentation masks pt with which our Kp2Seg module \(G(\cdot)\) learns to perform pose-image visibility alignment (Section 3.4) hence effectively leveraging the context present in the visible portions of each image. All the learnable components of the framework are denoted in red and rest in black.

 

Results

qualitative_results

Qualitative Results: Qualitative results on SURREAL → SyRIP (top 3 rows) and SURREAL → MINI-RGBD (bottom 2 rows). From left to right: source only keypoints, keypoint predictions by UniFrame, predictions by FiDIP, predictions by SHIFT, and ground truth keypoints. As it can be seen above, the infant prior is essential to predict plausible poses in cases where other methods fail (top row). Further, our method can utilize context from visible regions to predict keypoints in self-occluded areas (2nd and 3rd row) while seamlessly adapting to different scenarios (4th and 5th row). denotes the self-occluded regions in the images.

self-occlusion

Pose Estimation under Self-Occlusions: SURREAL → SyRIP. UniFrame prediction (left panel) fails to correctly estimate significant portions of the lower back and left hand of the infant while SHIFT is able to reasonably do so. Ground truth (rightmost panel) and extracted mask (second from left panel) are also shown.

Benchmarking SHIFT

Within the domain adaptation setting for infant pose estimation, SHIFT method outperforms existing approaches. As shown below, SHIFT achieves superior performance compared to both unsupervised domain adaptation (UDA) methods and fully supervised infant pose estimation techniques. The best numbers are highlighted in bold, second best are underlined.

quantitative_results_uda_minirgbd
quantitative_results_uda_syrip

Table 1: Comparison with UDA methods on adult → infant adaptation: SURREAL → MINI-RGBD (left) and SURREAL → SyRIP (right).

quantitative_results_supervised

Table 2: Comparison with UDA methods on infant → infant adaptation: SyRIP → MINI-RGBD.

BibTeX

@InProceedings{Bose_2025_CVPR,
                author    = {Bose, Sarosij and Cruz, Hannah Dela and Dutta, Arindam and Kokkoni, Elena and Karydis, Konstantinos and Chowdhury, Amit Kumar Roy},
                title     = {Leveraging Synthetic Adult Datasets for Unsupervised Infant Pose Estimation},
                booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) Workshops},
                month     = {June},
                year      = {2025},
                pages     = {5562-5571}
            }
            }

Copyright: CC BY-NC-SA 4.0 © Sarosij Bose | Last updated: 15th July 2025 | Website credits to Nerfies