Motivation of the our TDSM pipeline. Previous methods rely on direct alignment between skeleton and text latent spaces, but suffer from modality gaps that limit generalization. Our TDSM overcomes this challenge by leveraging a reverse diffusion process to embed text prompts into the unified skeleton-text latent space, ensuring more effective cross-modal alignment.
We firstly present a diffusion-based action recognition with zero-shot learning for skeleton inputs, called a Triplet Diffusion for Skeleton-Text Matching (TDSM), which is the first framework to apply diffusion models and to implicitly align the skeleton features with text prompts by fully taking advantage of excellent text-image correspondence learning in generative diffusion processes, thus being able to learn fused discriminative features in a unified latent space.
We propose a novel triplet diffusion (TD) loss to enhance the model's discriminative power by ensuring accurate denoising for correct skeleton-text pairs while suppressing incorrect pairs.
Our TDSM significantly outperforms the very recent state-of-the-art (SOTA) methods with large margins of 2.36%-point to 13.05%-point across multiple benchmarks, demonstrating scalability and robustness under various seen-unseen split settings.
Inference framework of our TDSM for zero-shot skeleton-based action recognition. During inference, the model evaluates all candidate text prompts for a given unseen skeleton sequence by comparing their predicted noises with a fixed Gaussian noise, and selects the candidate that best denoises as the predicted label.
Evaluation on SynSE and PURLS benchmarks. Top-1 accuracy of zero-shot skeleton-based action recognition methods. Each split is denoted as X/Y, where X represents the number of seen classes and Y the number of unseen classes.
Evaluation on SMIE benchmark. Top-1 accuracy of zero-shot skeleton-based action recognition methods. Each split is denoted as X/Y, where X represents the number of seen classes and Y the number of unseen classes.
Left. Effect of loss function design.
Right. Impact of total timesteps \( T \).
Impact of timestep \( t_{\text{test}} \) and noise \( \epsilon_{\text{test}} \) in inference. Our TDSM shows consistently outperforming the state-of-the-art methods regardless of noise levels.