RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models

Supplementary Material

In this supplementary file, we provide the full videos of the results shown in the paper, as well as additional qualitative results. We also provide a demo code for our method. Please refer to the corresponding section linked below for more details.

Our Results
Comparisons to Baselines - Figure 7
Additional Comparisons to Baselines
Extreme Shape Editing
Additional Qualitative Results
Comparison with Existing Attention Modules - Figure 2
Consistency Across Grids - Figure 5
Ablations - Figure 8
Examples from User Study
Limitations

Our Results

Here we include both the complete videos of the images shown in Figure 1 (teaser) and Figure 6 (qualitative results) and additional video-text pairs. Our model has ability to perform editing on a wide range of

video lengths without requiring extra computational resources, e.g. 9 frames, 144 frames (Additional Qualitative Results) .
resolutions, e.g., 512x320 (row 1), 512x512 (row 2).
motions, e.g. exo-motion (row 2), ego-motion (row 4), ego-exo motion (Additional Comparisons to Baselines row 1), occlusions (Examples from User Study first comparison, 'a cheetah is moving'), multiple objects with appearances/disappearances (Additional Qualitative Results row 1)
objects, e.g., humans (row 1 and 7), animals (row 2 and 3), vehicles (row 3).
types of edits, e.g. local editing (row 1 column 5), visual style editing (row 5 column 4), background editing (monkey, row 6 column 3), shape/attribute editing (row 2), extreme shape editing (Extreme Shape Editing), multiple editing types at once (background + shape editing, row 6 column 4)
text detail levels, e.g. long and detailed descriptions (Additional Qualitative Results row 2 (bear), 3 (flamingo)), single word prompts (row 1 column 4)

Input video - 45 Frames	"an ancient Egyptian pharaoh is typing"	"a skeleton is typing"	"a zombie"	"a man wearing a glitter jacket is typing"

Input video - 27 Frames	"a white cat"	"a shiny silver robotic wolf, futuristic"	"a dinosaur"	"a bear"

Input video - 90 Frames	"autumn background with maple leaves"	"a penguin is swimming"	"a stone is swimming"	"a pokemon character is swimming"

Input video - 90 Frames	"a rocket ship is preparing for the launch"	"a crystal blue Swarovski tower"	"a candle"	"anime style"

Input video - 90 Frames	"a jeep moving in the grassy field"	"a spaceship is moving throught the milky way"	"Van gogh style"

Input video - 8 Frames	"a teddy bear is eating an apple"	"a monkey is playing on the coast"	"a golden retriever is eating a banana in the cornfield"

Input video - 99 Frames	"a firefighter is stretching"	"watercolor style"	"a zombie is stretching"

Comparisons to Baselines - Figure 7

Here we put the complete videos of the images shown in Figure 7. We compare our method with videos used by previous approaches from DAVIS (row 1, 2) and a human video (row 3). We perform comparison with

RAVE w/o Shuffle: our method without the shuffling step.
Tokenflow ([1])
FateZero ([2])
Rerender-A-Video ([3])
Text2Video-Zero ([4])
Pix2Video ([5])

Note that we do not use any customized model (Realistic Vision v5.1) in videos below for fair comparison. Stable Diffusion v1.5 is used for RAVE.

"Mysterious purple and blue hues dominate, with twinkling stars and a glowing moon in the backdrop"	RAVE	RAVE w/o Shuffle	Tokenflow ([1])

FateZero ([2])	Rerender-A-Video ([3])	Text2Video-Zero ([4])	Pix2Video ([5])

"a jeep moving at night"	RAVE	RAVE w/o Shuffle	Tokenflow ([1])

FateZero ([2])	Rerender-A-Video ([3])	Text2Video-Zero ([4])	Pix2Video ([5])

"a senior lady is running"	RAVE	RAVE w/o Shuffle	Tokenflow ([1])

FateZero ([2])	Rerender-A-Video ([3])	Text2Video-Zero ([4])	Pix2Video ([5])

Additional Comparisons to Baselines

Here we put extra comparisons with additional baselines:

FLATTEN ([6])
Tune-A-Video ([7])
ControlVideo ([8])
Gen-1 ([9])

Note that we conduct a direct qualitative comparison with previous approaches by directly acquiring videos from the corresponding project webpages. The first comparison involves trucks from FLATTEN, and the second comparison involves stork from Tokenflow.

"wooden trucks drive on a racetrack"	RAVE	FLATTEN ([6])	Tokenflow ([1])

FateZero ([2])	Tune-A-Video ([7])	Text2Video-Zero ([4])	ControlVideo ([8])

"an origami of stork"	RAVE	Tokenflow ([1])	Rerender-A-Video ([3])

FateZero ([2])	Gen-1 ([9])	Text2Video-Zero ([4])	Tune-A-Video ([7])

Extreme Shape Editing

Here, we provide examples of extreme shape editing on the car-turn videos, transforming the car into various entities such as a train, a tractor, a black van, a firetruck, and a tank. These transformations require significant changes in the output. Our method adeptly handles such extreme shape editing.

Input Video - 27 Frames	"Switzerland SBB CFF FFS train"	"a tractor"

"a black van"	"a firetruck"	"a tank"

Additional Qualitative Results

Here we provide additional qualitative results with RAVE.

Input Video - 144 Frames	"whales are swimming"	"banknotes are falling from the sky"	"fire in the woods"

Input Video - 36 Frames	"Electric neon colors illuminate the scene, casting a futuristic, cyberpunk vibe"	"Soft, blended colors and visible brushstrokes make the scene appear as if painted with watercolors"	"The bear becomes a dark silhouette against a fiery sunset, with the horizon painted in oranges, reds, and purples"

Input Video - 36 Frames	"An intense, fiery sky with embers floating around, contrasting the cool water and highlighting the flamingos' grace amid nature's fury"	"Mystical surroundings with magical creatures, sparkles on the water, and an aura of enchantment"	"The flamingos in deep shadow, set against a radiant sunset with oranges, purples, and pinks"

Input Video - 18 Frames	"swarovski blue crystal swan"	"crochet swan"

Input Video - 8 Frames	"swarovski blue crystal stones falling down sequentially"	"crochet boxes, falling down sequentially"

Input Video - 36 Frames	"a black panther"	"a pink dragon"	"a lion"

Input Video - 45 Frames	"an astronout is typing"	"a medieval knight"	"a man from avatar movie is typing"	"a robot is typing"

Input Video - 72 Frames	"zombies are dancing"	"a black person"	"watercolor style"

Input Video - 117 Frames	"a red tshirt"	"a robot"	"neon colors in cypberpunk style"

Comparison with Existing Attention Modules - Figure 2

Here we show the complete videos of the images shown in Figure 2. We compare our method with:

Self Attention only.
Sparse-Causal Attention

While the generated frames align with the text prompt in terms of motion and color style, they lack consistency due to the neglect of temporal context as seen in the inconsistencies in the background and the car's bumper when using self attention only. The sparse-causal attention method produces more consistent frames with reduced time complexity, however, its performance tends to decline in longer videos due to the diminishing temporal awareness as can be seen from the structural changes in the car. RAVE produces consistent frames with the correct motion and color style throughout the video.

"a red car, moving on the road, autumn, maple leaves"	RAVE	Self Attention	Sparse-Causal Attention

Consistency Across Grids - Figure 5

We present the editing results in three scenarios:

Grid: processing grids independently,
Grid + SC: adapting sparse-causal (SC) attention using grids
RAVE

While the grid technique enables consistent editing, ensuring consistency across multiple grids remains a challenge. One could modify well-known attention mechanisms, like sparse-causal attention, for the grid structure. In this adaptation, attention is shifted from focusing on the initial frame and the previous frame to the initial grid and the previous grid. However, this approach can still face difficulties in maintaining consistency with longer videos. Our novel approach RAVE, on the other hand, is able to preserve the consistency.

"a pink car in a snowy landscape, sunset lighting"	RAVE	Grid	Grid + SC

Ablations

We conduct an ablation study by separately ablating `shuffling', `DDIM inversion', and ControlNet conditions (lineart, softedge and depth (RAVE)) in our framework. Applying shuffling helps maintaining global style consistency. Additionally, using DDIM inversion contributes to preserving the structure similar to the original video.

DDIM inversion and Shuffling Ablation - Figure 8

"dark chocolate cake"	RAVE	w/o Shuffling	w/o DDIM Inversion

Condition Ablation - Figure 8

Furthermore, our approach proves to be adaptable to different controls, such as lineart and softedge compared to depth used in RAVE. Even though there are style differences, these adjustments do not compromise the overall consistency.

RAVE (Depth)	w/ Lineart	w/ Softedge

Depth Control	Lineart Control	Softedge Control

Realistic Vision vs Stable Diffusion

To demonstrate that the enhancement in the video editing is not solely attributed to the use of a customized model, Realistic Vision V5.1, we further conduct a comparison with the outcomes obtained using Stable Diffusion v1.5. Note that we employ Realistic Vision V5.1 to leverage its diverse editing capabilities.

"sandwiches are moving on the railroad"	Stable Diffusion	Realistic Vision v5.1

"a white cat"	Stable Diffusion	Realistic Vision v5.1

"watercolor style"	Stable Diffusion	Realistic Vision v5.1

"a teddy bear is eating an apple"	Stable Diffusion	Realistic Vision v5.1

Ebsynth ([10])

Here we perform a comparison with Ebsynth, a keyframe propogation method, is combined with the Grid without shuffling approach. It is evident that significant changes occur in the structure of the car and bear. In contrast, our approach demonstrates superior handling of temporal structural consistency.

"Mysterious purple and blue hues dominate, with twinkling stars and a glowing moon in the backdrop"	RAVE	Ebsynth

"a jeep moving at night"	RAVE	Ebsynth

Examples of User Study

Note that we formulate a metric as the frequency of each method chosen among the top two edits. Below, we provide two examples from our user study, one selected as the best and the other not selected, in response to Question 1 with that metric. We also provide the results of the user study (among 130 anonymous participants) for each question as histograms. Note that the colors of the titles correspond to the colors of the histograms.

Question 1 - General Editing: Regarding the input video, which specific edits would you consider to be among the top two most successful in general?

Question 2 - Temporal Consistency: Regarding the modified videos below, select the top 2 that have the smoothest motion.

Question 3 - Textual Alignment: Which video best aligns with the text below?

"a cheetah is moving"	Q1 - General Editing	Q2 - Temporal Consistency	Q3 - Textual Alignment

Text2Video-Zero ([4])	RAVE	Tokenflow ([1])	Rerender ([3])

"boats floating on the sea, villas on the coastal"	Q1 - General Editing	Q2 - Temporal Consistency	Q3 - Textual Alignment

Text2Video-Zero ([4])	RAVE	Rerender ([3])	Tokenflow ([1])

Limitations

Extreme shape editing in long videos

While our method can handle extreme shape edits successfully, it encounters limitations when performing extreme shape edits as the video length increases. In particular, the ability of our method to maintain the distinct shape of these extreme objects weakens, resulting in some flickering. It's noteworthy that in cases of extreme editing, such as with the car-turn example, our method effectively manages shape transformations for up to 27 frames, beyond which the quality of the edit starts to degrade. This 27-frame threshold is significant as it represents the upper limit of the editing capabilities of many competing methods, such as FLATTEN ([6]) (on RTX4090), for similar tasks.

"classic car"	27 Frames	45 Frames	45 Frames + Deflickering ([11])

Fine details flickering

Certain extreme shape editings (e.g., transforming the wolf into 'a unicorn') require high-frequency edits in the video (such as long and rich hair details of the unicorn). In such cases, flickering may occur as our model does not explicitly utilize pixel-level methods to address video deflickering. Furthermore, the unavoidable losses incurred during the compression in the encoding/decoding steps of latent diffusion models and the selection of inversion methods (DDIM inversion in our case) impact the quality of reconstructing fine details. Note that this is a common challenge present in existing approaches as well.

Input Video	"a unicorn"

[1] Michal Geyer, Omer Bar-Tal, Shai Bagon, and Tali Dekel. Tokenflow: Consistent diffusion features for consistent video editing. arXiv preprint arXiv:2307.10373, 2023.

[2] Chenyang QI, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan, and Qifeng Chen. Fatezero: Fusing attentions for zero-shot text-based video editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15932–15942, 2023.

[3] Shuai Yang, Yifan Zhou, Ziwei Liu, , and Chen Change Loy. Rerender a video: Zero-shot text-guided video-to-video translation. In ACM SIGGRAPH Asia Conference Proceedings, 2023.

[4] Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15954–15964, 2023.

[5] Duygu Ceylan, Chun-Hao P Huang, and Niloy J Mitra. Pix2video: Video editing using image diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 23206–23217, 2023.

[6] Yuren Cong, Mengmeng Xu, Christian Simon, Shoufa Chen, Jiawei Ren, Yanping Xie, Juan-Manuel Perez-Rua, Bodo Rosenhahn, Tao Xiang, and Sen He. FLATTEN: optical flow-guided attention for consistent text-to-video editing. arXiv preprint arXiv:2310.05922, 2023.

[7] Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7623–7633, 2023.

[8] Yabo Zhang, Yuxiang Wei, Dongsheng Jiang, Xiaopeng Zhang, Wangmeng Zuo, and Qi Tian. Controlvideo: Training-free controllable text-to-video generation. arXiv preprint arXiv:2305.13077, 2023

[9] Esser, Patrick, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. "Structure and content-guided video synthesis with diffusion models." In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7346-7356. 2023.

[10] Ondrej Jamriska. Ebsynth: Fast example-based image synthesis and style transfer, 2018

[11] Lei, Chenyang, Xuanchi Ren, Zhaoxiang Zhang, and Qifeng Chen. "Blind Video Deflickering by Neural Filtering with a Flawed Atlas." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10439-10448. 2023.