Less is More: Decoder-Free Masked Modeling for Efficient Skeleton Representation Learning

Rethinking Masked Modeling for Skeletons

Top: Standard Masked Auto-Encoder (MAE) methods suffer from a 14.38× computational surge during inference relative to pre-training due to asymmetric full-sequence processing.
Bottom: SLiM synergizes masked modeling with contrastive learning (CL) in a decoder-free framework. This symmetric design achieves a 7.89× reduction in inference cost compared to MAE baselines.

Abstract

We introduce SLiM, a novel decoder-free framework. By eliminating the heavy reconstruction decoder and adopting a symmetric token processing design, SLiM fundamentally resolves the computational asymmetry and inference overhead inherent in standard MAE architectures.

We propose a unified representation learning architecture that synergizes the global discriminative power of contrastive learning with the local context sensitivity of masked modeling through a single, shared encoder.

To mitigate shortcut reconstruction caused by the high correlation among skeletal joints, we adopt Semantic Tube Masking alongside refined Skeleton-Aware Augmentations, which force the encoder to capture deeper action semantics.

Extensive experiments demonstrate that our SLiM consistently achieves state-of-the-art performance across all downstream protocols, while simultaneously providing a 7.89× reduction in inference computational cost compared to existing MAE methods.

Overview of SLiM

Our decoder-free teacher-student architecture unifies Masked Feature Modeling and Global-Local Contrastive Learning. By simultaneously minimizing feature reconstruction error (ℒ_MFM) and contrastive loss (ℒ_GLCL) across diverse temporal views, SLiM effectively captures both fine-grained local patterns and global semantics.

Masking and Augmentations

Top: Previous masking (a) and augmentations (b-d) often result in trivial solutions or physically implausible poses.
Bottom: Our Semantic Tube Masking (e) and Skeletal-Aware Augmentations (f-h) ensure anatomical and physical consistency through skeleton-aware designs.

Performance Evaluation

Linear evaluation results. Comparison with state-of-the-art methods on NTU-60, NTU-120, and PKU-MMD II datasets. Performance is reported as Top-1 accuracy (%). Bold indicates the best result, and underline indicates the second best.

Efficiency and performance comparison. SLiM achieves state-of-the-art accuracy while reducing inference computational cost by 7.89× compared to MAE methods. TGN: Target Generation Network. Bold indicates the best result.

Ablation Studies

Note that all ablation models are pre-trained for 100 epochs, whereas our final full model is trained for 150 epochs.
Left: Effect of objective functions. ℒ_MFM denotes the MFM loss, ℒ_CL represents a standard CL loss without temporal diversity, while ℒ_GLCL incorporates diverse temporal sampling.
Right: Ablation for the dual-role masking strategies within STM. We evaluate the synergy between applying masked modeling to the Global View and utilizing augmented Local Views for contrastive learning (Limited: previous motion-aware masking).

BibTeX

@article{do2026less,
  title={Less is More: Decoder-Free Masked Modeling for Efficient Skeleton Representation Learning},
  author={Do, Jeonghyeok and Chen, Yun and Youk, Geunhyuk and Kim, Munchurl},
  journal={arXiv preprint arXiv:2603.10648},
  year={2026}
}

Less is More: Decoder-Free Masked Modelingfor Efficient Skeleton Representation Learning