PropFly: Learning to Propagate via On-the-Fly Supervision
from Pre-trained Video Diffusion Models

1Korea Advanced Institute of Science and Technology (KAIST),
2Kyungpook National University, 3Adobe Research
*Co-first authors (Equal contribution), Co-corresponding authors

Video Propagation Demos

Click the tabs below to see how PropFly propagates edits.

Source
Loading...
Target
Loading...

Source Video

PropFly (Result 1)

Edited First Frame 1 Reference Frame 1

PropFly (Result 2)

Edited First Frame 2 Reference Frame 2

Abstract

Propagation-based video editing enables precise user control by propagating a single edited frame into following frames while maintaining the original context. However, training such models typically requires large-scale, paired (source and edited) video datasets, which are costly and complex to acquire.

Hence, we propose PropFly, a training pipeline for Propagation-based video editing, relying on on-the-Fly supervision from pre-trained video diffusion models (VDMs) instead of requiring off-the-shelf datasets. Specifically, PropFly leverages one-step clean latent estimations with varying Classifier-Free Guidance (CFG) scales to synthesize diverse pairs of 'source' (low-CFG) and 'edited' (high-CFG) latents. Our pipeline enables an adapter to learn propagation via Guidance-Modulated Flow Matching (GMFM) loss, which guides the model to replicate the target transformation.

Our on-the-fly supervision ensures the model learns temporally consistent and dynamic transformations. Extensive experiments demonstrate that PropFly significantly outperforms state-of-the-art methods on various video editing tasks, producing high-quality results.

Proposed Framework

The overall PropFly training pipeline.

(a) Data Sampling & RSPF: We sample a video-text pair and synthesize an augmented prompt by appending a random style description to enrich training diversity.

(b) On-the-fly Data Pair Generation: A frozen pre-trained VDM synthesizes structurally aligned yet semantically distinct source (low-CFG) and target (high-CFG) latent pairs on the fly.

(c) Guidance-Modulated Flow Matching: A trainable adapter learns to propagate edits by predicting the VDM's high-CFG velocity, conditioned on the source video structure and the edited first frame style via GMFM loss.

Proposed Framework

Figure 1.Overview of our PropFly training pipeline

Quantitative Results

We quantitatively evaluate PropFly against state-of-the-art text-guided and propagation-based methods on the EditVerseBench and TGVE benchmarks. Extensive experiments demonstrate that PropFly significantly outperforms existing baselines, achieving superior scores in video quality, text alignment, and temporal consistency.

Quantitative Results 1

Table 1. Quantitative comparison on the subset of EditVerseBench.

Quantitative Results 2

Table 2. Quantitative comparison on TGVE benchmark.

Qualitative Comparison

Comparison against state-of-the-art propagation-based methods.
PropFly (Ours) successfully propagates the edit while preserving the original motion and structure better than baselines.