README

Rerender A Video

Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation
Shuai Yang, Yifan Zhou, Ziwei Liu and Chen Change Loy
in SIGGRAPH Asia 2023 Conference Proceedings
Project Page | Paper | Supplementary Video | Input Data and Video Results

Abstract: Large text-to-image diffusion models have exhibited impressive proficiency in generating high-quality images. However, when applying these models to video domain, ensuring temporal consistency across video frames remains a formidable challenge. This paper proposes a novel zero-shot text-guided video-to-video translation framework to adapt image models to videos. The framework includes two parts: key frame translation and full video translation. The first part uses an adapted diffusion model to generate key frames, with hierarchical cross-frame constraints applied to enforce coherence in shapes, textures and colors. The second part propagates the key frames to other frames with temporal-aware patch matching and frame blending. Our framework achieves global style and local texture temporal consistency at a low cost (without re-training or optimization). The adaptation is compatible with existing image diffusion techniques, allowing our framework to take advantage of them, such as customizing a specific subject with LoRA, and introducing extra spatial guidance with ControlNet. Extensive experimental results demonstrate the effectiveness of our proposed framework over existing methods in rendering high-quality and temporally-coherent videos.

Features:

  • Temporal consistency: cross-frame constraints for low-level temporal consistency.

  • Zero-shot: no training or fine-tuning required.

  • Flexibility: compatible with off-the-shelf models (e.g., ControlNet, LoRA) for customized translation.

Rerender A Video

Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation
Shuai Yang, Yifan Zhou, Ziwei Liu and Chen Change Loy
in SIGGRAPH Asia 2023 Conference Proceedings
Project Page | Paper | Supplementary Video | Input Data and Video Results

Abstract: Large text-to-image diffusion models have exhibited impressive proficiency in generating high-quality images. However, when applying these models to video domain, ensuring temporal consistency across video frames remains a formidable challenge. This paper proposes a novel zero-shot text-guided video-to-video translation framework to adapt image models to videos. The framework includes two parts: key frame translation and full video translation. The first part uses an adapted diffusion model to generate key frames, with hierarchical cross-frame constraints applied to enforce coherence in shapes, textures and colors. The second part propagates the key frames to other frames with temporal-aware patch matching and frame blending. Our framework achieves global style and local texture temporal consistency at a low cost (without re-training or optimization). The adaptation is compatible with existing image diffusion techniques, allowing our framework to take advantage of them, such as customizing a specific subject with LoRA, and introducing extra spatial guidance with ControlNet. Extensive experimental results demonstrate the effectiveness of our proposed framework over existing methods in rendering high-quality and temporally-coherent videos.

Features:

  • Temporal consistency: cross-frame constraints for low-level temporal consistency.

  • Zero-shot: no training or fine-tuning required.

  • Flexibility: compatible with off-the-shelf models (e.g., ControlNet, LoRA) for customized translation.

Actions
RUN
About
Zero-Shot Text-Guided Video-to-Video Translation
Updated on Oct 27, 2023
mjpyeon/rerender-a-video
Popularity
26 stars
646 runs
4 pulls
Maintainers
mjpyeon