InstructVEdit: A Holistic Approach for Instructional Video Editing

Chi Zhang1 Chengjian Feng2 Feng Yan2 Qiming Zhang3 Mingjin Zhang1 Yujie Zhong2 Jing Zhang4 Lin Ma2
1Xidian University, 2Meituan Inc. 3University of Sydney 4Wuhan University

InstructVEdit edits realistic videos according to instructions.

Make it snow.
Make it snow.
Make it sparkling crtstal.
Make it sparkling crtstal.
Turn the oranges into apples and the basket are made of metal.
Turn the oranges into apples and the basket are made of metal.

Abstract

Video editing according to instructions is a highly challenging task due to the difficulty in collecting large-scale, high-quality edited video pair data. This scarcity not only limits the availability of training data but also hinders the systematic exploration of model architectures and training strategies. While prior work has improved specific aspects of video editing (e.g., synthesizing a video dataset using image editing techniques or decomposed video editing training), a holistic framework addressing the above challenges remains underexplored.

In this study, we introduce InstructVEdit, a full-cycle instructional video editing approach that: (1) establishes a reliable dataset curation workflow to initialize training, (2) incorporates two model architectural improvements to enhance edit quality while preserving temporal consistency, and (3) proposes an iterative refinement strategy leveraging real-world data to enhance generalization and minimize train-test discrepancies.

Extensive experiments show that InstructVEdit achieves state-of-the-art performance in instruction-based video editing, demonstrating robust adaptability to diverse real-world scenarios. Codes, models, and datasets will be released to facilitate further research.

Local Edit

Object Manipulation

Object manipulation example 1
Change the table to a floating cloud.
Object manipulation example 2
Change the kite into a hot air balloon.
Object manipulation example 1
Change the dog to a cat.
Object manipulation example 2
Change the samoyed to a golden retriever.
Object manipulation example 1
Change the dirt into a field of purple flowers.
Object manipulation example 2
Turn the background into a starry night.

Attribute Manipulation

Attribute manipulation example 1
Change the American flag to a rainbow flag.
Attribute manipulation example 2
Give the vase a metallic finish.
Attribute manipulation example 3
The waves are made of fire.
Attribute manipulation example 4
Adjust the sky to have a more vibrant orange hue.
Attribute manipulation example 3
Make roses rainbow colored.
Attribute manipulation example 4
For a more tranquil effect.

Global Edit

Global edit example 1
Transform the scene into a mystical forest setting.
Global edit example 2
Change the background to a futuristic cityscape with neon lights.
Global edit example 2
Turn it into a dreamy underwater world.
Global edit example 4
Make it cartoon.
Global edit example 5
Make it oil painting.
Global edit example 6
Transform the image into a hand-drawn pencil sketch.

Visual Comparison on TGVE dataset

Make the style Minecraft.

Original

Original

Tune-a-Video

Tune-a-Video

AnyV2V

AnyV2V

TokenFlow

TokenFlow

InsV2V

InsV2V"

Ours

Ours

Make it snowy day.

Original

Original

Tune-a-Video

Tune-a-Video

AnyV2V

AnyV2V

TokenFlow

TokenFlow

InsV2V

InsV2V

Ours

Ours

Change the cat to be made of paper.

Original

Original

Tune-a-Video

Tune-a-Video

AnyV2V

AnyV2V

TokenFlow

TokenFlow

InsV2V

InsV2V

Ours

Ours