Overview of SkateFormer's partition-specific attention strategy. SkateFormer partitions joints and frames based on different types of skeletal-temporal relation (4 Skate-Types) and performs skeletal-temporal self-attention (Skate-MSA) within each partition.
We propose a Skeletal-Temporal Transformer (SkateFormer), a partition-specific attention strategy (Skate-MSA) for skeleton-based action recognition that captures skeletal-temporal relations and reduces computational complexity.
We introduce a range of augmentation techniques and an effective positional embedding method, named Skate-Embedding, which combines skeletal and temporal features. This method significantly enhances action recognition performance by forming an outer product between learnable skeletal features and fixed temporal index features.
Our SkateFormer sets a new state-of-the-art for action recognition performance across multiple modalities (4-ensemble condition) and single modalities (joint, bone, joint motion, bone motion), showing notable improvement over the most recent state-of-the-art methods. Additionally, it concurrently establishes a new state-of-the-art in interaction recognition, a sub-field of action recognition.
Top. The overall framework of SkateFormer.
Bottom left. The Skate-MSA of SkateFormer.
Bottom right. Skate-Type partition and reverse.
Comparison with existing transformer-based methods. S-Type, T-Type, S-Attn, T-Attn, T-Conv, and ST-Attn indicate 'physically neighboring joints', 'local motion', ‘skeletal attention’, ‘temporal attention’, ‘temporal convolution’ and ‘skeletal-temporal attention’, respectively.
Top-1 accuracy of skeleton-based action recognition methods. (i) E1- joint modality only; (ii) E2 - joint + bone modalities; and (iii) E4 - joint + bone + joint motion + bone motion modalities.
Top-1 accuracy of skeleton-based interaction recognition methods. Human interaction recognition is a sub-part of skeleton-based action recognition, specifically focusing on scenarios where two or more individuals coexist within a single action.
Comparative analysis of SkateFormer with other methods by parameters, FLOPs, inference time, and average top-1 accuracy for joint modality.
(i) J - joint modality; (ii) B - bone modality; (iii) JM - joint motion modality; and (iv) BM - bone motion modality.
The activation level of Skate-Types according to action labels. For 'sitting down' (action label: 7) and 'standing up' (action label: 8), Skate-Type-3 or Skate-Type-4 exhibited pronounced activations, while 'reading' (action label: 10) and 'writing' (action label: 11) prominently activated Skate-Type-1.
@inproceedings{do2025skateformer,
title={Skateformer: skeletal-temporal transformer for human action recognition},
author={Do, Jeonghyeok and Kim, Munchurl},
booktitle={European Conference on Computer Vision},
pages={401--420},
year={2025},
organization={Springer}
}