Top: Standard Masked Auto-Encoder (MAE) methods suffer from a 14.38×
computational surge during inference relative to pre-training due to asymmetric full-sequence processing.
Bottom: SLiM synergizes masked modeling with contrastive learning (CL) in a decoder-free framework.
This symmetric design achieves a 7.89× reduction in inference cost compared to MAE baselines.
We introduce SLiM, a novel decoder-free framework. By eliminating the heavy reconstruction decoder and adopting a symmetric token processing design, SLiM fundamentally resolves the computational asymmetry and inference overhead inherent in standard MAE architectures.
We propose a unified representation learning architecture that synergizes the global discriminative power of contrastive learning with the local context sensitivity of masked modeling through a single, shared encoder.
To mitigate shortcut reconstruction caused by the high correlation among skeletal joints, we adopt Semantic Tube Masking alongside refined Skeleton-Aware Augmentations, which force the encoder to capture deeper action semantics.
Extensive experiments demonstrate that our SLiM consistently achieves state-of-the-art performance across all downstream protocols, while simultaneously providing a 7.89× reduction in inference computational cost compared to existing MAE methods.
Top: Previous masking (a) and augmentations (b-d) often result in trivial solutions or
physically
implausible poses.
Bottom: Our Semantic Tube Masking (e) and Skeletal-Aware
Augmentations (f-h) ensure anatomical and physical consistency through skeleton-aware
designs.
Linear evaluation results. Comparison with state-of-the-art methods on NTU-60, NTU-120, and PKU-MMD II datasets. Performance is reported as Top-1 accuracy (%). Bold indicates the best result, and underline indicates the second best.
Efficiency and performance comparison. SLiM achieves state-of-the-art accuracy while reducing inference computational cost by 7.89× compared to MAE methods. TGN: Target Generation Network. Bold indicates the best result.
Note that all ablation models are pre-trained for 100 epochs, whereas our final full
model is trained for 150 epochs.
Left: Effect of objective functions. ℒMFM denotes
the MFM loss, ℒCL represents a standard CL loss without temporal
diversity, while ℒGLCL incorporates diverse temporal sampling.
Right:
Ablation for the dual-role masking strategies within STM. We evaluate the synergy between
applying masked modeling to the Global View and utilizing augmented Local
Views
for contrastive learning (Limited: previous motion-aware masking).
@article{do2026less,
title={Less is More: Decoder-Free Masked Modeling for Efficient Skeleton Representation Learning},
author={Do, Jeonghyeok and Chen, Yun and Youk, Geunhyuk and Kim, Munchurl},
journal={arXiv preprint arXiv:2603.10648},
year={2026}
}