Search: [video2video] - • Curated knowledge about art and AI •

RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Modelshttps://rave-video.github.io/

RAVE introduces a pioneering zero-shot video editing approach, leveraging pre-trained text-to-image diffusion models to achieve high-quality results with advanced user control. Addressing the lag in visual quality and user manipulation seen in video editing models, RAVE employs a novel noise shuffling strategy, exploiting spatio-temporal interactions between frames for faster and temporally consistent video production. Notably efficient in memory usage, RAVE handles longer videos while accommodating a broad spectrum of edits, from local attribute adjustments to shape transformations. The method's versatility is demonstrated through a comprehensive video evaluation dataset spanning diverse scenes, including object-focused, human activities, and dynamic scenarios. Rigorous quantitative and qualitative experiments underscore RAVE's superior performance in diverse video editing scenarios compared to existing methods, marking a substantial advancement in the realm of video editing with diffusion models.

rave

TokenFlow: Consistent Diffusion Features for Consistent Video Editinghttps://diffusion-tokenflow.github.io/

TokenFlow is a framework for text-driven video editing using a text-to-image diffusion model. The framework aims to generate high-quality videos that adhere to a target text while preserving the spatial layout and dynamics of the input video. The method leverages the observation that consistency in the edited video can be achieved by enforcing consistency in the diffusion feature space. This is done by propagating edited features across frames based on inter-frame correspondences. The framework does not require training or fine-tuning and can be used with any text-to-image editing method. Experimental results demonstrate state-of-the-art video editing on real-world videos. The method involves inverting input video frames, extracting tokens, and using nearest-neighbor search for feature correspondences. Denoising is performed by replacing generated tokens with propagated tokens from the original video.