Wink Pings

NVIDIA LongLive: Real-time Interactive Long Video Generation, 240 Seconds on a Single H100

NVIDIA and its partners have released the LongLive text-to-video system, breaking through the existing 5-10 second limitations of current models. It achieves smooth generation of up to 240 seconds on a single H100 card, with support for switching prompts mid-generation while maintaining visual continuity. Key technologies include KV recaching, streaming long-tuning, and short-window attention mechanisms.

NVIDIA and its collaborators have just unveiled LongLive, a text-to-video system that finally solves the challenges of long-form and interactive video generation.

Current models can typically only output 5 to 10 second clips, but LongLive can handle videos up to 240 seconds long on a single H100 GPU, and it maintains smooth visuals and responsiveness even when you switch prompts mid-generation.

It combines several key technologies:

- **KV Recaching**: Enables seamless transitions between prompts

- **Streaming Long-Tuning**: Manages generation for ultra-long sequences

- **Short-Window Attention + Frame Submerging**: Balances speed and context

Benchmark tests show that while baseline models achieve less than 1 frame per second, LongLive can deliver over 20 frames per second while maintaining high-quality output.

Paper link: https://arxiv.org/abs/2509.22622

HuggingFace model: https://huggingface.co/Efficient-Large-Model/LongLive-1.3B

Video demonstration: https://youtu.be/caDE6f54pvA

发布时间: 2025-09-29 22:03