Leveraging Synthetic Adult Datasets for Unsupervised Infant Pose Estimation

Abstract

Human pose estimation is a critical tool across a variety of healthcare applications. Despite significant progress in pose estimation algorithms targeting adults, such developments for infants remain limited. Existing algorithms for infant pose estimation, despite achieving commendable performance, depend on fully supervised approaches that require large amounts of labeled data. These algorithms also struggle with poor generalizability under distribution shifts. To address these challenges, we introduce SHIFT: Leveraging SyntHetic Adult Datasets for Unsupervised InFanT Pose Estimation, which leverages the pseudolabeling-based Mean-Teacher framework to compensate for the lack of labeled data and addresses distribution shifts by enforcing consistency between the student and the teacher pseudo-labels. Additionally, to penalize implausible predictions obtained from the mean-teacher framework we also incorporate an infant manifold pose prior. To enhance SHIFT’s self-occlusion perception ability, we propose a novel visibility consistency module for improved alignment of the predicted poses with the original image. Extensive experiments on multiple benchmarks show that SHIFT significantly outperforms existing state-of-the-art unsupervised domain adaptation (UDA) based pose estimation methods by ∼ 5% and supervised infant pose estimation methods by a margin of ∼ 16%. The project page is available at: sarosijbose.github.io/SHIFT.

Method

SHIFT utilizes the Mean-Teacher framework to update the teacher model M_t with an Exponential Moving Average (EMA) of the student model M_s's weights to adapt the model pre-trained on a labeled adult source dataset (x_s, y_s) to unlabeled infant target images (x_t) (Section 3.2). To address anatomical variations in infants, SHIFT employs an infant pose prior \(\theta_p\) which assigns plausibility scores for each prediction of the student model M_s (Section 3.3). Further, to handle the large self-occlusions in the target domain, we employ an off-the-model F_seg to give pseudo segmentation masks p_t with which our Kp2Seg module \(G(\cdot)\) learns to perform pose-image visibility alignment (Section 3.4) hence effectively leveraging the context present in the visible portions of each image. All the learnable components of the framework are denoted in red and rest in black.

Results

Qualitative Results: Qualitative results on SURREAL → SyRIP (top 3 rows) and SURREAL → MINI-RGBD (bottom 2 rows). From left to right: source only keypoints, keypoint predictions by UniFrame, predictions by FiDIP, predictions by SHIFT, and ground truth keypoints. As it can be seen above, the infant prior is essential to predict plausible poses in cases where other methods fail (top row). Further, our method can utilize context from visible regions to predict keypoints in self-occluded areas (2nd and 3rd row) while seamlessly adapting to different scenarios (4th and 5th row). ○ denotes the self-occluded regions in the images.

Pose Estimation under Self-Occlusions: SURREAL → SyRIP. UniFrame prediction (left panel) fails to correctly estimate significant portions of the lower back and left hand of the infant while SHIFT is able to reasonably do so. Ground truth (rightmost panel) and extracted mask (second from left panel) are also shown.

Benchmarking SHIFT

Within the domain adaptation setting for infant pose estimation, SHIFT method outperforms existing approaches. As shown below, SHIFT achieves superior performance compared to both unsupervised domain adaptation (UDA) methods and fully supervised infant pose estimation techniques. The best numbers are highlighted in bold, second best are underlined.

Table 1: Comparison with UDA methods on adult → infant adaptation: SURREAL → MINI-RGBD (left) and SURREAL → SyRIP (right).

Table 2: Comparison with UDA methods on infant → infant adaptation: SyRIP → MINI-RGBD.

BibTeX

@InProceedings{Bose_2025_CVPR,
                author    = {Bose, Sarosij and Cruz, Hannah Dela and Dutta, Arindam and Kokkoni, Elena and Karydis, Konstantinos and Chowdhury, Amit Kumar Roy},
                title     = {Leveraging Synthetic Adult Datasets for Unsupervised Infant Pose Estimation},
                booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) Workshops},
                month     = {June},
                year      = {2025},
                pages     = {5562-5571}
            }
            }