Bridging the Skeleton-Text Modality Gap: Diffusion-Powered Modality Alignment for Zero-shot Skeleton-based Action Recognition

Motivation of the our TDSM pipeline. Previous methods rely on direct alignment between skeleton and text latent spaces, but suffer from modality gaps that limit generalization. Our TDSM overcomes this challenge by leveraging a reverse diffusion process to embed text prompts into the unified skeleton-text latent space, ensuring more effective cross-modal alignment.

Abstract

We firstly present a diffusion-based action recognition with zero-shot learning for skeleton inputs, called a Triplet Diffusion for Skeleton-Text Matching (TDSM), which is the first framework to apply diffusion models and to implicitly align the skeleton features with text prompts by fully taking advantage of excellent text-image correspondence learning in generative diffusion processes, thus being able to learn fused discriminative features in a unified latent space.

We propose a novel triplet diffusion (TD) loss to enhance the model's discriminative power by ensuring accurate denoising for correct skeleton-text pairs while suppressing incorrect pairs.

Our TDSM significantly outperforms the very recent state-of-the-art (SOTA) methods with large margins of 2.36%-point to 13.05%-point across multiple benchmarks, demonstrating scalability and robustness under various seen-unseen split settings.

Overview of TDSM

Training framework of our TDSM for zero-shot skeleton-based action recognition. During training, the skeleton and its GT text prompt are treated as positive pairs to enhance accurate denoising, while the skeleton and wrong text prompts are treated as negative pairs to suppress denoising. This process is driven by the proposed TD loss, which operates within the reverse diffusion process to maximize the noise prediction discrepancy between positive and negative pairs, thereby improving the model's discriminative alignment between skeleton features and text prompts.

Inference framework of our TDSM for zero-shot skeleton-based action recognition. During inference, the model evaluates all candidate text prompts for a given unseen skeleton sequence by comparing their predicted noises with a fixed Gaussian noise, and selects the candidate that best denoises as the predicted label.

Performance Evaluation

Evaluation on SynSE and PURLS benchmarks. Top-1 accuracy of zero-shot skeleton-based action recognition methods. Each split is denoted as X/Y, where X represents the number of seen classes and Y the number of unseen classes.

Evaluation on SMIE benchmark. Top-1 accuracy of zero-shot skeleton-based action recognition methods. Each split is denoted as X/Y, where X represents the number of seen classes and Y the number of unseen classes.

Ablation Studies

Left. Effect of loss function design.
Right. Impact of total timesteps \( T \).

Impact of timestep \( t_{\text{test}} \) and noise \( \epsilon_{\text{test}} \) in inference. Our TDSM shows consistently outperforming the state-of-the-art methods regardless of noise levels.

BibTeX

@InProceedings{Do_2025_ICCV,
    author    = {Do, Jeonghyeok and Kim, Munchurl},
    title     = {Bridging the Skeleton-Text Modality Gap: Diffusion-Powered Modality Alignment for Zero-shot Skeleton-based Action Recognition},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2025},
    pages     = {12757-12768}
}

Bridging the Skeleton-Text Modality Gap:Diffusion-Powered Modality Alignmentfor Zero-shot Skeleton-based Action Recognition

Abstract

Overview of TDSM

Performance Evaluation

Ablation Studies

BibTeX

Bridging the Skeleton-Text Modality Gap:
Diffusion-Powered Modality Alignment
for Zero-shot Skeleton-based Action Recognition