PRIMEdit: Probability Redistribution for
Instance-aware Multi-object Video Editing
with Benchmark Dataset

(Previous version: MIVE: New Design and Benchmark for Multi-Instance Video Editing)

Samuel Teodoro^*, Agus Gunawan^*, Soo Ye Kim Jihyong Oh^†, Munchurl Kim^†,

KAIST, South Korea Adobe Research, California Chung-Ang University, South Korea
^*Equal Contribution, ^†Co-Corresponding Authors

arXiv Code (To be released)

TL;DR: PRIMEdit is a zero-shot multi-instance video editing framework that uses novel probability redistribution and sampling techniques to enable faithful instance edits while preventing unintended changes in diverse video scenarios.

Abstract

Recent AI-based video editing has enabled users to edit videos through simple text prompts, significantly simplifying the editing process. However, recent zero-shot video editing techniques primarily focus on global or single-object edits, which can lead to unintended changes in other parts of the video. When multiple objects require localized edits, existing methods face challenges, such as unfaithful editing, editing leakage, and lack of suitable evaluation datasets and metrics. To overcome these limitations, we propose Probability Redistribution for Instance-aware Multi-object Video Editing (PRIMEdit). PRIMEdit is a zero-shot framework that introduces two key modules: (i) Instance-centric Probability Redistribution (IPR) to ensure precise localization and faithful editing and (ii) Disentangled Multi-instance Sampling (DMS) to prevent editing leakage. Additionally, we present our new MIVE Dataset for video editing featuring diverse video scenarios, and introduce the Cross-Instance Accuracy (CIA) Score to evaluate editing leakage in multi-instance video editing tasks. Our extensive qualitative, quantitative, and user study evaluations demonstrate that PRIMEdit significantly outperforms recent state-of-the-art methods in terms of editing faithfulness, accuracy, and leakage prevention, setting a new benchmark for multi-instance video editing.

Method

In this work, we present PRIMEdit, a zero-shot multi-instance video editing framework that disentangles the multi-instance video editing process, achieving faithful edits and reduced attention leakage. Our PRIMEdit effectively disentangles the multi-instance video editing process through our (i) Instance-centric Probability Redistribution (IPR) which enhances editing localization and faithfulness and (ii) Disentangled Multi-instance Sampling (DMS) which reduces editing leakage.

Since our sampling requires that the edited objects appear within their respective masks, we propose our IPR (shown below) to ensure that this condition is consistently met.

In our DMS (shown below), we independently modify each instance using the series noise sampling (yellow box), and the multiple denoised instance latents are harmonized through the parallel noise sampling (blue box) preceded by the latent fusion (green box) and re-inversion (purple box).

MIVE Dataset

We also present our new MIVE Dataset specifically designed for multi-instance video editing tasks. MIVE Dataset features 200 diverse videos sourced from the VIPSeg dataset.

We generated and summarized the source captions using LLaVa and Llama 3, respectively. We then manually inserted tags in the source captions to establish instance-to-mask correspondence. Finally, we generated the target edit captions using Llama 3. We show a sample input video and source and target captions below. The target instance captions are color-coded to match the color of the masks.

Source Caption: In a domestic setting, a person in a gray hoodie stands in front of washing machine A and washing machine B against a blue wall, with a blue recycling trash can to the left. Source Video:	Target Caption: In a domestic setting, an alien stands in front of oven and yellow washing machine against a blue wall, with a blue recycling trash can to the left. Masked Source Video:

PRIMEdit Editing Results

Hover over the videos to see the target captions and original video.

Single-Instance Editing: MIVE Dataset "a pembroke welsh corgi" Multi-Instance Editing: Video-in-the-Wild "a white rabbit", "a colorful parrot"	Multi-Instance Editing: MIVE Dataset "an astronaut in blue suit", "a sorceress in yellow cloak", "a red metallic robot" Partial Instance Editing: Video-in-the-Wild "a yellow floral dress"

Comparison

Attention leakage examples are shown in green arrows while unfaithful editing examples are shown in red arrows. The target instance captions are color-coded to match the color of the masks.

Number of instances to edit: 2

Prompt: "A white cat with brown spots sits on a white carpeted floor, engaging with its toys, against a backdrop of white curtain, a white wooden bench, and a white wall, depicting a cozy indoor environment."

Input	Masked Input	PRIMEdit (Ours)	ControlVideo	FLATTEN

FreSCo	TokenFlow	RAVE	Ground-A-Video	VideoGrain

Number of instances to edit: 2

Prompt: "A robot in a shiny helmet is paddling a sturdy wooden canoe with a paddle on a river, surrounded by a natural landscape with a riverbank, a mountain, trees, and a cloudy sky."

Input	Masked Input	PRIMEdit (Ours)	ControlVideo	FLATTEN

FreSCo	TokenFlow	RAVE	Ground-A-Video	VideoGrain

Number of instances to edit: 3

Prompt: "In a domestic setting, an alien stands in front of oven and yellow washing machine against a blue wall, with a blue recycling trash can to the left."

Input	Masked Input	PRIMEdit (Ours)	ControlVideo	FLATTEN

FreSCo	TokenFlow	RAVE	Ground-A-Video	VideoGrain

Number of instances to edit: 4

Prompt: "A woman in a red dress and a man in gray sweatpants stand in a room with a white wall and cabinets containing a top terrarium and a bottom planter."

Input	Masked Input	PRIMEdit (Ours)	ControlVideo	FLATTEN

FreSCo	TokenFlow	RAVE	Ground-A-Video	VideoGrain

Number of instances to edit: 6

Prompt: "A soldier in a desert camouflage organizes a military duffel bag on top of a outdoor picnic table, with a messenger bag and kevlar on the left side, and a outdoor utility pouch and a hiking expedition pack on the right side. The scenery features a backdrop of mountainous terrain, sandy ground, and a clear sky, suggesting an outdoor setting for inspection activities."

Input	Masked Input	PRIMEdit (Ours)	ControlVideo	FLATTEN

FreSCo	TokenFlow	RAVE	Ground-A-Video	VideoGrain

Number of instances to edit: 7

Prompt: "A robot works at a glass table with lab items, holding a pink test tube near a tissue box, metal beaker with rusty finish, and an Erlenmeyer flask with metallic finish. A storage rack sits on top of a wooden table with carvings, near conference table, a window with metal frame and blinds and a wall."

Input	Masked Input	PRIMEdit (Ours)	ControlVideo	FLATTEN

FreSCo	TokenFlow	RAVE	Ground-A-Video	VideoGrain

BibTeX

@misc{teodoro2025primedit,
      title={PRIMEdit: Probability Redistribution for Instance-aware Multi-object Video Editing with Benchmark Dataset},
      author={Samuel Teodoro and Agus Gunawan and Soo Ye Kim and Jihyong Oh and Munchurl Kim},
      year={2025},
      eprint={2412.12877},
      archivePrefix={arXiv},
}

PRIMEdit: Probability Redistribution for Instance-aware Multi-object Video Editing with Benchmark Dataset