Less is More: Decoder-Free Masked Modeling
for Efficient Skeleton Representation Learning

Jeonghyeok Do     Yun Chen     Geunhyuk Youk     Munchurl Kim
Corresponding author
Korea Advanced Institute of Science and Technology (KAIST), South Korea      

Rethinking Masked Modeling for Skeletons


Top: Standard Masked Auto-Encoder (MAE) methods suffer from a 14.38× computational surge during inference relative to pre-training due to asymmetric full-sequence processing.
Bottom: SLiM synergizes masked modeling with contrastive learning (CL) in a decoder-free framework. This symmetric design achieves a 7.89× reduction in inference cost compared to MAE baselines.

Abstract

We introduce SLiM, a novel decoder-free framework. By eliminating the heavy reconstruction decoder and adopting a symmetric token processing design, SLiM fundamentally resolves the computational asymmetry and inference overhead inherent in standard MAE architectures.

We propose a unified representation learning architecture that synergizes the global discriminative power of contrastive learning with the local context sensitivity of masked modeling through a single, shared encoder.

To mitigate shortcut reconstruction caused by the high correlation among skeletal joints, we adopt Semantic Tube Masking alongside refined Skeleton-Aware Augmentations, which force the encoder to capture deeper action semantics.

Extensive experiments demonstrate that our SLiM consistently achieves state-of-the-art performance across all downstream protocols, while simultaneously providing a 7.89× reduction in inference computational cost compared to existing MAE methods.

Overview of SLiM


Our decoder-free teacher-student architecture unifies Masked Feature Modeling and Global-Local Contrastive Learning. By simultaneously minimizing feature reconstruction error (ℒMFM) and contrastive loss (ℒGLCL) across diverse temporal views, SLiM effectively captures both fine-grained local patterns and global semantics.

Masking and Augmentations


Top: Previous masking (a) and augmentations (b-d) often result in trivial solutions or physically implausible poses.
Bottom: Our Semantic Tube Masking (e) and Skeletal-Aware Augmentations (f-h) ensure anatomical and physical consistency through skeleton-aware designs.

Performance Evaluation


Linear evaluation results. Comparison with state-of-the-art methods on NTU-60, NTU-120, and PKU-MMD II datasets. Performance is reported as Top-1 accuracy (%). Bold indicates the best result, and underline indicates the second best.


Efficiency and performance comparison. SLiM achieves state-of-the-art accuracy while reducing inference computational cost by 7.89× compared to MAE methods. TGN: Target Generation Network. Bold indicates the best result.

Ablation Studies


Note that all ablation models are pre-trained for 100 epochs, whereas our final full model is trained for 150 epochs.
Left: Effect of objective functions. ℒMFM denotes the MFM loss, ℒCL represents a standard CL loss without temporal diversity, while ℒGLCL incorporates diverse temporal sampling.
Right: Ablation for the dual-role masking strategies within STM. We evaluate the synergy between applying masked modeling to the Global View and utilizing augmented Local Views for contrastive learning (Limited: previous motion-aware masking).

BibTeX

@article{do2026less,
  title={Less is More: Decoder-Free Masked Modeling for Efficient Skeleton Representation Learning},
  author={Do, Jeonghyeok and Chen, Yun and Youk, Geunhyuk and Kim, Munchurl},
  journal={arXiv preprint arXiv:2603.10648},
  year={2026}
}