Wink Pings

Meituan Open Sources LongCat-Video: One Model for Text-to-Video, Image-to-Video, and Video Continuation

Meituan's LongCat team has open-sourced the LongCat-Video model, with 13.6B parameters, achieving open-source SOTA levels in text-to-video, image-to-video, and video continuation tasks, capable of generating several minutes of high-quality video without noticeable color shift.

Today, I saw that the Meituan LongCat team open-sourced their video generation model, LongCat-Video.

This model is quite interesting—it handles three tasks: text-to-video, image-to-video, and video continuation with a single model. 1.36 billion parameters, based on the DiT architecture, claiming to have reached SOTA levels among open-source video generation models.

![Model Evaluation Comparison](https://wink.run/image?url=https%3A%2F%2Fpbs.twimg.com%2Fmedia%2FG4Gt7gTWwAE3Ur1%3Fformat%3Djpg%26name%3Dlarge)

From the evaluation results, LongCat-Video can already compete with commercial models like Wan 2.2 in some dimensions. It performs well in text alignment and visual quality, but there's room for improvement in motion quality.

Technically, it uses a C2F (coarse-to-fine) pipeline and block sparse attention, capable of generating 720p/30fps videos in just a few minutes. The most crucial part is that they've addressed the color shift issue in long video generation, ensuring the generated video maintains stable quality for several minutes.

Here are a few video examples:

![City Night Scene Video](https://wink.run/image?url=https%3A%2F%2Fpbs.twimg.com%2Famplify_video_thumb%2F1982076252665126912%2Fimg%2F4hbFcJRqr4aeVQ5J.jpg)

![Ballet Dancer Video](https://wink.run/image?url=https%3A%2F%2Fpbs.twimg.com%2Famplify_video_thumb%2F1982076636137697280%2Fimg%2FfnmKSMsBYm4gg2Rj.jpg)

![Little Girl Running on the Beach](https://wink.run/image?url=https%3A%2F%2Fpbs.twimg.com%2Famplify_video_thumb%2F1982083844586057728%2Fimg%2FhXVpBZSLmcTdhbJr.jpg)

The video continuation feature is quite practical. For example, if you have a 10-second video and want the AI to continue generating it, the model can understand the video content and maintain continuity.

Someone commented that they hope to see a lightweight version for consumer-grade GPUs, and this suggestion indeed hits a pain point for the open-source community. Currently, this 13.6B model has high hardware requirements.

Project links:

- GitHub: https://github.com/meituan-longcat/LongCat-Video

- Hugging Face: https://huggingface.co/meituan-longcat/LongCat-Video

- Project Home: https://meituan-longcat.github.io/LongCat-Video/

Recently, there has been a surge in open-sourced video models, from Sora open-source alternatives to LongCat-Video. Progress in this field is faster than expected. Meituan, as a company with a background in food delivery, venturing into this area is quite interesting.

发布时间: 2025-10-25 21:57