Motion transfer enables controllable video generation by transferring temporal dynamics from a reference video to synthesize a new video conditioned on a target caption. However, existing Diffusion Transformer (DiT)–based methods are limited to single-object videos, restricting fine-grained control in real-world scenes with multiple objects. In this work, we introduce MotionGrounder, a DiT-based framework that firstly handles motion transfer with multi-object controllability. Our Flow-based Motion Signal (FMS) in MotionGrounder provides a stable motion prior for target video generation, while our Object-Caption Alignment Loss (OCAL) grounds object captions to their corresponding spatial regions. We further propose a new Object Grounding Score (OGS), which jointly evaluates (i) spatial alignment between source video objects and their generated counterparts and (ii) semantic consistency between each generated object and its target caption. Our experiments show that MotionGrounder consistently outperforms recent baselines across quantitative, qualitative, and human evaluations.
Existing motion transfer methods struggle with multi-object dynamics, often relying on costly inversion processes or lacking explicit grounding between objects and their corresponding target captions. While recent DiT-based approaches offer zero-shot capabilities, current patch trajectory-based methods (PTBM) extract noisy motion signals that hinder precise control over complex scenes. To address these gaps, MotionGrounder introduces a unified, inversion-free framework that enables stable motion transfer and grounded multi-object control across distinct spatial regions.
MotionGrounder transfers motion dynamics from a source video to a target video while grounding each object caption to its designated spatial region. By estimating optical flow, we construct a Flow-based Motion Signal (FMS) that serves as a representation of the source video's motion dynamics. During denoising, the FMS aligns the motion of the target video with the source. Simultaneously, our Object-Caption Alignment Loss (OCAL) ensures that individual object captions are grounded within their designated spatial regions across the entire video duration.
Our Flow-based Motion Signal (FMS) provides a stable motion prior by constructing latent-space patch trajectories from dense optical flows. During denoising, motion transfer is supervised by minimizing the distance between the reference FMS and the displacements derived from cross-frame attention weights.
Our Object-Caption Alignment Loss (OCAL) enforces spatial grounding by aligning text-to-video and video-to-text attention with specific object masks during the denoising process. By simultaneously maximizing attention within the target region and suppressing it elsewhere, the objective ensures proper multi-object placement.
To support multi-object evaluation, we constructed a custom dataset of 52 videos, pre-processed into 24-frame sequences at 480 × 720 resolution. Our pipeline utilizes CogVLM2-Caption and GPT-5 to generate and structure captions into a <subject> <verb> <scene> format, where every described object in the <subject> is manually mapped to its corresponding mask. To evaluate model robustness, we generate three prompt types: (i) a Caption prompt that directly describes the input video content; (ii) a Subject prompt that alters the main objects while preserving the original background; and (iii) a Scene prompt that specifies an entirely new scene distinct from the source video.
Reference Video
MotionGrounder (Ours)
DMT
MotionClone
ConMo
DiTFlow
CAPTION PROMPT: "A man and a woman are walking on a rocky path at sunset."
Reference Video
MotionGrounder (Ours)
DMT
MotionClone
ConMo
DiTFlow
CAPTION PROMPT: "A snail with a brown shell is crawling across a rough gray surface."
Reference Video
MotionGrounder (Ours)
DMT
MotionClone
ConMo
DiTFlow
SUBJECT PROMPT: "A soldier is walking toward a tank on a helipad."
Reference Video
MotionGrounder (Ours)
DMT
MotionClone
ConMo
DiTFlow
SUBJECT PROMPT: "A child, a cat, and a rabbit are interacting on a green lawn."
Reference Video
MotionGrounder (Ours)
DMT
MotionClone
ConMo
DiTFlow
SCENE PROMPT: "A fox and a raccoon are perched on a fallen tree trunk."
Reference Video
MotionGrounder (Ours)
DMT
MotionClone
ConMo
DiTFlow
SCENE PROMPT: "A peacock and a turkey are standing in a garden courtyard."
Reference Video
"A fox and a raccoon are perched on a fallen tree trunk."
"A raccoon and a fox are perched on a fallen tree trunk."
"A raccoon is perched on a fallen tree trunk."
"A fox is perched on a fallen tree trunk."
"A fallen tree trunk."
@misc{teodoro2026motiongrounder,
title={MotionGrounder: Grounded Multi-Object Motion Transfer via Diffusion Transformer},
author={Samuel Teodoro and Yun Chen and Agus Gunawan and Soo Ye Kim and Jihyong Oh and Munchurl Kim},
year={2026},
eprint={2604.00853},
archivePrefix={arXiv}
}