RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models
Supplementary Material
In this supplementary file, we provide the full videos of the results shown in the paper, as well as additional qualitative results. We also provide a demo code for our method. Please refer to the corresponding section linked below for more details.
Our Results
Here we include both the complete videos of the images shown in Figure 1 (teaser) and Figure 6 (qualitative results) and additional video-text pairs. Our model has ability to perform editing on a wide range of
- video lengths without requiring extra computational resources, e.g. 9 frames, 144 frames (Additional Qualitative Results) .
- resolutions, e.g., 512x320 (row 1), 512x512 (row 2).
- motions, e.g. exo-motion (row 2), ego-motion (row 4), ego-exo motion (Additional Comparisons to Baselines row 1), occlusions (Examples from User Study first comparison, 'a cheetah is moving'), multiple objects with appearances/disappearances (Additional Qualitative Results row 1)
- objects, e.g., humans (row 1 and 7), animals (row 2 and 3), vehicles (row 3).
- types of edits, e.g. local editing (row 1 column 5), visual style editing (row 5 column 4), background editing (monkey, row 6 column 3), shape/attribute editing (row 2), extreme shape editing (Extreme Shape Editing), multiple editing types at once (background + shape editing, row 6 column 4)
- text detail levels, e.g. long and detailed descriptions (Additional Qualitative Results row 2 (bear), 3 (flamingo)), single word prompts (row 1 column 4)
Input video - 45 Frames |
"an ancient Egyptian pharaoh is typing" |
"a skeleton is typing" |
"a zombie" |
"a man wearing a glitter jacket is typing" |
|
|
|
|
|
Input video - 27 Frames |
"a white cat" |
"a shiny silver robotic wolf, futuristic" |
"a dinosaur" |
"a bear" |
|
|
|
|
|
Input video - 90 Frames |
"autumn background with maple leaves" |
"a penguin is swimming" |
"a stone is swimming" |
"a pokemon character is swimming" |
|
|
|
|
|
Input video - 90 Frames |
"a rocket ship is preparing for the launch" |
"a crystal blue Swarovski tower" |
"a candle" |
"anime style" |
|
|
|
|
|
Input video - 90 Frames |
"a jeep moving in the grassy field" |
"a spaceship is moving throught the milky way" |
"Van gogh style" |
|
|
|
|
Input video - 8 Frames |
"a teddy bear is eating an apple" |
"a monkey is playing on the coast" |
"a golden retriever is eating a banana in the cornfield" |
|
|
|
|
Input video - 99 Frames |
"a firefighter is stretching" |
"watercolor style" |
"a zombie is stretching" |
|
|
|
|
|
Comparisons to Baselines - Figure 7
Here we put the complete videos of the images shown in Figure 7. We compare our method with videos used by previous approaches from DAVIS (row 1, 2) and a human video (row 3). We perform comparison with
- RAVE w/o Shuffle: our method without the shuffling step.
- Tokenflow ([1])
- FateZero ([2])
- Rerender-A-Video ([3])
- Text2Video-Zero ([4])
- Pix2Video ([5])
Note that we do not use any customized model (Realistic Vision v5.1) in videos below for fair comparison. Stable Diffusion v1.5 is used for RAVE.
"Mysterious purple and blue hues dominate, with twinkling stars and a glowing moon in the backdrop" |
RAVE |
RAVE w/o Shuffle |
Tokenflow ([1]) |
|
|
|
|
FateZero ([2]) |
Rerender-A-Video ([3]) |
Text2Video-Zero ([4]) |
Pix2Video ([5]) |
|
|
|
|
"a jeep moving at night" |
RAVE |
RAVE w/o Shuffle |
Tokenflow ([1]) |
|
|
|
|
FateZero ([2]) |
Rerender-A-Video ([3]) |
Text2Video-Zero ([4]) |
Pix2Video ([5]) |
|
|
|
|
"a senior lady is running" |
RAVE |
RAVE w/o Shuffle |
Tokenflow ([1]) |
|
|
|
|
FateZero ([2]) |
Rerender-A-Video ([3]) |
Text2Video-Zero ([4]) |
Pix2Video ([5]) |
|
|
|
|
Additional Comparisons to Baselines
Here we put extra comparisons with additional baselines:
- FLATTEN ([6])
- Tune-A-Video ([7])
- ControlVideo ([8])
- Gen-1 ([9])
Note that we conduct a direct qualitative comparison with previous approaches
by directly acquiring videos from the corresponding project webpages.
The first comparison involves trucks from FLATTEN, and the second comparison involves stork from Tokenflow.
"wooden trucks drive on a racetrack" |
RAVE |
FLATTEN ([6]) |
Tokenflow ([1]) |
|
|
|
|
FateZero ([2]) |
Tune-A-Video ([7]) |
Text2Video-Zero ([4]) |
ControlVideo ([8]) |
|
|
|
|
"an origami of stork" |
RAVE |
Tokenflow ([1]) |
Rerender-A-Video ([3]) |
|
|
|
|
FateZero ([2]) |
Gen-1 ([9]) |
Text2Video-Zero ([4]) |
Tune-A-Video ([7]) |
|
|
|
|
Extreme Shape Editing
Here, we provide examples of extreme shape editing on the car-turn videos, transforming the car into various entities such as
a train, a tractor, a black van, a firetruck, and a tank. These transformations require significant changes in the output.
Our method adeptly handles such extreme shape editing.
Input Video - 27 Frames |
"Switzerland SBB CFF FFS train" |
"a tractor" |
|
|
|
"a black van" |
"a firetruck" |
"a tank" |
|
|
|
|
Additional Qualitative Results
Here we provide additional qualitative results with RAVE.
Input Video - 144 Frames |
"whales are swimming" |
"banknotes are falling from the sky" |
"fire in the woods" |
|
|
|
|
|
Input Video - 36 Frames |
"Electric neon colors illuminate the scene, casting a futuristic, cyberpunk vibe" |
"Soft, blended colors and visible brushstrokes make the scene appear as if painted with watercolors" |
"The bear becomes a dark silhouette against a fiery sunset, with the horizon painted in oranges, reds, and purples" |
|
|
|
|
|
Input Video - 36 Frames |
"An intense, fiery sky with embers floating around, contrasting the cool water and highlighting the flamingos' grace amid nature's fury" |
"Mystical surroundings with magical creatures, sparkles on the water, and an aura of enchantment" |
"The flamingos in deep shadow, set against a radiant sunset with oranges, purples, and pinks" |
|
|
|
|
|
Input Video - 18 Frames |
"swarovski blue crystal swan" |
"crochet swan" |
|
|
|
|
Input Video - 8 Frames |
"swarovski blue crystal stones falling down sequentially" |
"crochet boxes, falling down sequentially" |
|
|
|
|
Input Video - 36 Frames |
"a black panther" |
"a pink dragon" |
"a lion" |
|
|
|
|
|
Input Video - 45 Frames |
"an astronout is typing" |
"a medieval knight" |
"a man from avatar movie is typing" |
"a robot is typing" |
|
|
|
|
|
|
Input Video - 72 Frames |
"zombies are dancing" |
"a black person" |
"watercolor style" |
|
|
|
|
|
Input Video - 117 Frames |
"a red tshirt" |
"a robot" |
"neon colors in cypberpunk style" |
|
|
|
|
|
Comparison with Existing Attention Modules - Figure 2
Here we show the complete videos of the images shown in Figure 2. We compare our method with:
- Self Attention only.
- Sparse-Causal Attention
While the generated frames align with the text prompt in terms of motion and color style, they lack consistency
due to the neglect of temporal context as seen in the inconsistencies in the background and the car's bumper when using self attention only.
The sparse-causal attention method produces more consistent frames with reduced time complexity,
however, its performance tends to decline in longer videos due to the diminishing temporal awareness as can be seen from the structural changes in the car.
RAVE produces consistent frames with the correct motion and color style throughout the video.
"a red car, moving on the road, autumn, maple leaves" |
RAVE |
Self Attention |
Sparse-Causal Attention |
|
|
|
|
|
Consistency Across Grids - Figure 5
We present the editing results in three scenarios:
- Grid: processing grids independently,
- Grid + SC: adapting sparse-causal (SC) attention using grids
- RAVE
While the grid technique enables consistent editing, ensuring consistency across
multiple grids remains a challenge. One could modify well-known attention mechanisms,
like sparse-causal attention, for the grid structure. In this adaptation, attention
is shifted from focusing on the initial frame and the previous frame to the initial grid
and the previous grid. However, this approach can still face difficulties in maintaining
consistency with longer videos. Our novel approach RAVE, on the other hand, is able to preserve the consistency.
"a pink car in a snowy landscape, sunset lighting" |
RAVE |
Grid |
Grid + SC |
|
|
|
|
|
Ablations
We conduct an ablation study by separately ablating `shuffling', `DDIM inversion', and ControlNet conditions
(lineart, softedge and depth (RAVE)) in our framework. Applying shuffling helps maintaining global style
consistency. Additionally, using DDIM inversion contributes to preserving the structure similar to the
original video.
DDIM inversion and Shuffling Ablation - Figure 8
"dark chocolate cake" |
RAVE |
w/o Shuffling |
w/o DDIM Inversion |
|
|
|
|
|
Condition Ablation - Figure 8
Furthermore, our approach proves to be adaptable to different controls, such as lineart
and softedge compared to depth used in RAVE. Even though there are style differences, these adjustments do not compromise the overall consistency.
"dark chocolate cake" |
RAVE (Depth) |
w/ Lineart |
w/ Softedge |
|
|
|
|
|
Depth Control |
Lineart Control |
Softedge Control |
|
|
|
|
|
Realistic Vision vs Stable Diffusion
To demonstrate that the enhancement in the video editing
is not solely attributed to the use of a customized model,
Realistic Vision V5.1, we further conduct a comparison with
the outcomes obtained using Stable Diffusion v1.5. Note that
we employ Realistic Vision V5.1 to leverage its diverse editing capabilities.
"sandwiches are moving on the railroad" |
Stable Diffusion |
Realistic Vision v5.1 |
|
|
|
|
"a white cat" |
Stable Diffusion |
Realistic Vision v5.1 |
|
|
|
|
"watercolor style" |
Stable Diffusion |
Realistic Vision v5.1 |
|
|
|
|
"a teddy bear is eating an apple" |
Stable Diffusion |
Realistic Vision v5.1 |
|
|
|
|
Ebsynth ([10])
Here we perform a comparison with Ebsynth, a keyframe propogation method,
is combined with the Grid without shuffling approach. It is evident that significant changes occur
in the structure of the car and bear. In contrast, our approach demonstrates superior handling of
temporal structural consistency.
"Mysterious purple and blue hues dominate, with twinkling stars and a glowing moon in the backdrop" |
RAVE |
Ebsynth |
|
|
|
"a jeep moving at night" |
RAVE |
Ebsynth |
|
|
|
Examples of User Study
Note that we formulate a metric as the frequency of each method chosen among the top two edits.
Below, we provide two examples from our user study, one selected as the best and the other not selected, in response to Question 1 with that metric.
We also provide the results of the user study (among 130 anonymous participants) for each question as histograms. Note that the colors of the titles correspond to the colors of the histograms.
Question 1 - General Editing: Regarding the input video, which specific edits would you consider to be among the top two most successful in general?
Question 2 - Temporal Consistency: Regarding the modified videos below, select the top 2 that have the smoothest motion.
Question 3 - Textual Alignment: Which video best aligns with the text below?
"a cheetah is moving" |
Q1 - General Editing |
Q2 - Temporal Consistency |
Q3 - Textual Alignment |
|
|
|
|
Text2Video-Zero ([4]) |
RAVE |
Tokenflow ([1]) |
Rerender ([3]) |
|
|
|
|
|
"boats floating on the sea, villas on the coastal" |
Q1 - General Editing |
Q2 - Temporal Consistency |
Q3 - Textual Alignment |
|
|
|
|
Text2Video-Zero ([4]) |
RAVE |
Rerender ([3]) |
Tokenflow ([1]) |
|
|
|
|
|
Limitations
Extreme shape editing in long videos
While our method can handle extreme shape edits successfully, it encounters
limitations when performing extreme shape edits as the video length increases. In particular,
the ability of our method to maintain the distinct shape of these extreme objects weakens, resulting
in some flickering. It's noteworthy that in cases of extreme editing, such as with the car-turn example,
our method effectively manages shape transformations for up to 27 frames, beyond which the quality of the
edit starts to degrade. This 27-frame threshold is significant as it represents the upper limit of the editing
capabilities of many competing methods, such as FLATTEN (
[6]) (on RTX4090), for similar tasks.
"classic car" |
27 Frames |
45 Frames |
45 Frames + Deflickering ([11]) |
|
|
|
|
|
Fine details flickering
Certain extreme shape editings (e.g., transforming the wolf into 'a unicorn') require high-frequency
edits in the video (such as long and rich hair details of the unicorn). In such cases, flickering may
occur as our model does not explicitly utilize pixel-level methods to address video deflickering.
Furthermore, the unavoidable losses incurred during the compression in the encoding/decoding steps of
latent diffusion models and the selection of inversion methods (DDIM inversion in our case) impact
the quality of reconstructing fine details. Note that this is a common challenge present in existing
approaches as well.
[1] Michal Geyer, Omer Bar-Tal, Shai Bagon, and Tali Dekel. Tokenflow: Consistent diffusion features for consistent video editing. arXiv preprint arXiv:2307.10373, 2023.
[2] Chenyang QI, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan, and Qifeng Chen. Fatezero: Fusing attentions for zero-shot text-based video editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15932–15942, 2023.
[3] Shuai Yang, Yifan Zhou, Ziwei Liu, , and Chen Change Loy. Rerender a video: Zero-shot text-guided video-to-video translation. In ACM SIGGRAPH Asia Conference Proceedings, 2023.
[4] Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15954–15964, 2023.
[5] Duygu Ceylan, Chun-Hao P Huang, and Niloy J Mitra. Pix2video: Video editing using image diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 23206–23217, 2023.
[6] Yuren Cong, Mengmeng Xu, Christian Simon, Shoufa Chen, Jiawei Ren, Yanping Xie, Juan-Manuel Perez-Rua, Bodo Rosenhahn, Tao Xiang, and Sen He. FLATTEN: optical flow-guided attention for consistent text-to-video editing. arXiv preprint arXiv:2310.05922, 2023.
[7] Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7623–7633, 2023.
[8] Yabo Zhang, Yuxiang Wei, Dongsheng Jiang, Xiaopeng Zhang, Wangmeng Zuo, and Qi Tian. Controlvideo: Training-free controllable text-to-video generation. arXiv preprint arXiv:2305.13077, 2023
[9] Esser, Patrick, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. "Structure and content-guided video synthesis with diffusion models." In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7346-7356. 2023.
[10] Ondrej Jamriska. Ebsynth: Fast example-based image synthesis and style transfer, 2018
[11] Lei, Chenyang, Xuanchi Ren, Zhaoxiang Zhang, and Qifeng Chen. "Blind Video Deflickering by Neural Filtering with a Flawed Atlas." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10439-10448. 2023.