🏃‍♂️
MotionGrounder:
Grounded Multi-Object Motion
Transfer via Diffusion Transformer

VICLab, KAIST, South Korea    Adobe Research, California    CMLab, Chung-Ang University, South Korea   
Co-Corresponding Authors

MotionGrounder is a Diffusion Transformer-based framework that transfers motion from reference videos to newly synthesized videos with explicit object grounding, enabling object-consistent motion transfer with structural and appearance changes in a training-free, zero-shot manner.

Abstract

Motion transfer enables controllable video generation by transferring temporal dynamics from a reference video to synthesize a new video conditioned on a target caption. However, existing Diffusion Transformer (DiT)–based methods are limited to single-object videos, restricting fine-grained control in real-world scenes with multiple objects. In this work, we introduce MotionGrounder, a DiT-based framework that firstly handles motion transfer with multi-object controllability. Our Flow-based Motion Signal (FMS) in MotionGrounder provides a stable motion prior for target video generation, while our Object-Caption Alignment Loss (OCAL) grounds object captions to their corresponding spatial regions. We further propose a new Object Grounding Score (OGS), which jointly evaluates (i) spatial alignment between source video objects and their generated counterparts and (ii) semantic consistency between each generated object and its target caption. Our experiments show that MotionGrounder consistently outperforms recent baselines across quantitative, qualitative, and human evaluations.

Motivation

Existing motion transfer methods struggle with multi-object dynamics, often relying on costly inversion processes or lacking explicit grounding between objects and their corresponding target captions. While recent DiT-based approaches offer zero-shot capabilities, current patch trajectory-based methods (PTBM) extract noisy motion signals that hinder precise control over complex scenes. To address these gaps, MotionGrounder introduces a unified, inversion-free framework that enables stable motion transfer and grounded multi-object control across distinct spatial regions.

Motivation

Method

Overall Framework

MotionGrounder transfers motion dynamics from a source video to a target video while grounding each object caption to its designated spatial region. By estimating optical flow, we construct a Flow-based Motion Signal (FMS) that serves as a representation of the source video's motion dynamics. During denoising, the FMS aligns the motion of the target video with the source. Simultaneously, our Object-Caption Alignment Loss (OCAL) ensures that individual object captions are grounded within their designated spatial regions across the entire video duration.

MotionGrounder Method Overview

Flow-based Motion Signal (FMS)

Our Flow-based Motion Signal (FMS) provides a stable motion prior by constructing latent-space patch trajectories from dense optical flows. During denoising, motion transfer is supervised by minimizing the distance between the reference FMS and the displacements derived from cross-frame attention weights.

Flow-based Motion Signal (FMS)

Object-Caption Alignment Loss (OCAL)

Our Object-Caption Alignment Loss (OCAL) enforces spatial grounding by aligning text-to-video and video-to-text attention with specific object masks during the denoising process. By simultaneously maximizing attention within the target region and suppressing it elsewhere, the objective ensures proper multi-object placement.

Object-Caption Alignment Loss (OCAL)

Dataset Generation Pipeline

To support multi-object evaluation, we constructed a custom dataset of 52 videos, pre-processed into 24-frame sequences at 480 × 720 resolution. Our pipeline utilizes CogVLM2-Caption and GPT-5 to generate and structure captions into a <subject> <verb> <scene> format, where every described object in the <subject> is manually mapped to its corresponding mask. To evaluate model robustness, we generate three prompt types: (i) a Caption prompt that directly describes the input video content; (ii) a Subject prompt that alters the main objects while preserving the original background; and (iii) a Scene prompt that specifies an entirely new scene distinct from the source video.

Dataset Generation Pipeline

Results

Caption Prompt

Reference Video

MotionGrounder (Ours)

DMT

MotionClone

ConMo

DiTFlow

CAPTION PROMPT: "A man and a woman are walking on a rocky path at sunset."

Reference Video

MotionGrounder (Ours)

DMT

MotionClone

ConMo

DiTFlow

CAPTION PROMPT: "A snail with a brown shell is crawling across a rough gray surface."

Subject Prompt

Reference Video

MotionGrounder (Ours)

DMT

MotionClone

ConMo

DiTFlow

SUBJECT PROMPT: "A soldier is walking toward a tank on a helipad."

Reference Video

MotionGrounder (Ours)

DMT

MotionClone

ConMo

DiTFlow

SUBJECT PROMPT: "A child, a cat, and a rabbit are interacting on a green lawn."

Scene Prompt

Reference Video

MotionGrounder (Ours)

DMT

MotionClone

ConMo

DiTFlow

SCENE PROMPT: "A fox and a raccoon are perched on a fallen tree trunk."

Reference Video

MotionGrounder (Ours)

DMT

MotionClone

ConMo

DiTFlow

SCENE PROMPT: "A peacock and a turkey are standing in a garden courtyard."

Additional Applications

Interchangeable Generation

Reference Video

"A fox and a raccoon are perched on a fallen tree trunk."

"A raccoon and a fox are perched on a fallen tree trunk."

Object Removal

"A raccoon is perched on a fallen tree trunk."

"A fox is perched on a fallen tree trunk."

"A fallen tree trunk."

BibTeX

@misc{teodoro2026motiongrounder,
    title={MotionGrounder: Grounded Multi-Object Motion Transfer via Diffusion Transformer},
    author={Samuel Teodoro and Yun Chen and Agus Gunawan and Soo Ye Kim and Jihyong Oh and Munchurl Kim},
    year={2026},
    eprint={2604.00853},
    archivePrefix={arXiv}
}