AnimateLCM: Computation-Efficient Personalized Style Video Generation without Personalized Video Data

Fu-Yun Wang¹ Zhaoyang Huang^{2 📮} Weikang Bian¹ Xiaoyu Shi¹ Keqiang Sun¹ Guanglu Song⁴ Yu Liu³ Hongsheng Li^1📮

{fywang@link, hsli.ee}@cuhk.edu.hk zhaoyanghuang@avolutionai.com
¹MMLab, CUHK ²Avolution AI ³Shanghai AI Lab ⁴Sensetime Research
arXiv Code Project Page Hugging Face Demo 🤗 Pretrained Models (47k downloads for a month)

Some of the best animations generated with AnimateLCM in 4 steps!

Abstract

The video diffusion model has been gaining increasing attention for its ability to produce videos that are both coherent and of high fidelity. However, the iterative denoising process makes it computationally intensive and time-consuming, thus limiting its applications. Inspired by the Consistency Model~(CM) that distills pretrained image diffusion models to accelerate the sampling with minimal steps and its successful extension Latent Consistency Model~(LCM) on conditional image generation, we propose AnimateLCM, allowing for high-fidelity video generation within minimal steps. Instead of directly conducting consistency learning on the raw video dataset, we propose a decoupled consistency learning strategy that decouples the distillation of image generation priors and motion generation priors, which improves the training efficiency and enhances the generation visual quality. Additionally, to enable the combination of plug-and-play adapters in the stable diffusion community to achieve various functions~(e.g., ControlNet for controllable generation), we propose an efficient strategy to adapt existing adapters to our distilled text-conditioned video consistency model or train adapters from scratch without harming the sampling speed. We validate the proposed strategy in image-conditioned video generation and layout-conditioned video generation, all achieving top-performing results.

Side-By-Side Comparison

Animations generated by AnimateLCM and alternative methods in 4 steps. AnimateLCM is significantly better.

DDIM

DPM

DPM++

AnimateLCM

Prompt: "Photo of a serene Buddhist temple in autumn, golden leaves, peaceful and spiritual, architectural beauty."

Prompt: "Photo of a dramatic cliffside lighthouse in a storm, waves crashing, symbol of guidance and resilience."

Prompt: "Photo of a jazz musician playing the saxophone in a smoky club, intimate and moody, capturing the soul of jazz."

Prompt: "Photo of a colorful hot air balloon festival at sunset, vibrant against the sky, sense of adventure and freedom."

Prompt: "Close-up portrait of a wise old man with a white beard, deep wrinkles, eyes that tell a story, timeless character."

Prompt: "Artistic photo of ballet shoes on a wooden floor, spotlight and shadows, capturing the hard work behind the art of dance."

Prompt: "Portrait of a firefighter in action, intense gaze, amid smoke and flames, heroism and bravery."

Extension for efficient video-to-video stylization

The base length of AnimateLCM is 16 (2 seconds), which is in line with most mainstream video generation models. AnimateLCM can be applied for longer video generation or video-to-video stylization in a zero-shot manner . The quality degrades slightly since it is never really trained on such long videos.

Prompt: "Green alien, red eyes." You can find the source video from X.

Prompt: "Cyberpunk, neon lights, turtle." You can find the source video from X.

Extension for efficient longer video generation

The base length of AnimateLCM is 16 (2 seconds), which is in line with most mainstream video generation models. AnimateLCM can be applied for longer video generation or video-to-video stylization in a zero-shot manner . The quality degrades slightly since it is never really trained on such long videos.

Click to Play the Animations.

The following videos are 4 times longer than the base length.