Recent AI-based video editing has enabled users to edit videos through simple text prompts, significantly simplifying the editing process. However, recent zero-shot video editing techniques primarily focus on global or single-object edits, which can lead to unintended changes in other parts of the video. When multiple objects require localized edits, existing methods face challenges, such as unfaithful editing, editing leakage, and lack of suitable evaluation datasets and metrics. To overcome these limitations, we propose a zero-shot Multi-Instance Video Editing framework, called MIVE. MIVE is a general-purpose mask-based framework, not dedicated to specific objects (e.g., people). MIVE introduces two key modules: (i) Disentangled Multi-instance Sampling (DMS) to prevent editing leakage and (ii) Instance-centric Probability Redistribution (IPR) to ensure precise localization and faithful editing. Additionally, we present our new MIVE Dataset featuring diverse video scenarios and introduce the Cross-Instance Accuracy (CIA) Score to evaluate editing leakage in multi-instance video editing tasks. Our extensive qualitative, quantitative, and user study evaluations demonstrate that MIVE significantly outperforms recent state-of-the-art methods in terms of editing faithfulness, accuracy, and leakage prevention, setting a new benchmark for multi-instance video editing.
In this work, we present MIVE, a general-purpose, zero-shot Multi-Instance Video Editing framework that disentangles the multi-instance video editing process, achieving faithful edits and reduced attention leakage. Our MIVE effectively disentangles the multi-instance video editing process through our (i) Disentangled Multi-instance Sampling (DMS) which reduces editing leakage and (ii) Instance-centric Probability Redistribution (IPR) which enhances editing localization and faithfulness.
In our DMS (shown below), we independently modify each instance using the latent parallel sampling (blue box), and the multiple denoised instance latents are harmonized through the noise parallel sampling (green box) preceded by the latent fusion (yellow box) and re-inversion (orange box).
Since our DMS requires that the edited objects appear within their masks, we propose our IPR (shown below) to ensure that this condition is consistently met.
Hover over the videos to see the original video and text prompts.
Single-Instance Editing: MIVE Dataset | Multi-Instance Editing: MIVE Dataset |
---|---|
|
|
Multi-Instance Editing: Video-in-the-Wild | Partial Instance Editing: Video-in-the-Wild |
---|---|
|
|
Input | Masked Input | MIVE (Ours) |
---|---|---|
ControlVideo | FLATTEN | FreSCo |
TokenFlow | RAVE | Ground-A-Video |
Input | Masked Input | MIVE (Ours) |
---|---|---|
ControlVideo | FLATTEN | FreSCo |
TokenFlow | RAVE | Ground-A-Video |
Input | Masked Input | MIVE (Ours) |
---|---|---|
ControlVideo | FLATTEN | FreSCo |
TokenFlow | RAVE | Ground-A-Video |
Input | Masked Input | MIVE (Ours) |
---|---|---|
ControlVideo | FLATTEN | FreSCo |
TokenFlow | RAVE | Ground-A-Video |
Input | Masked Input | MIVE (Ours) |
---|---|---|
ControlVideo | FLATTEN | FreSCo |
TokenFlow | RAVE | Ground-A-Video |
Input | Masked Input | MIVE (Ours) |
---|---|---|
ControlVideo | FLATTEN | FreSCo |
TokenFlow | RAVE | Ground-A-Video |