Long-Form Video Generation Models with Reference Image Support
Generating a 10-second clip is easy now. Every major model does it. The real question is: can you generate 5 or 10 minutes of coherent video where a character looks the same in minute one and minute eight? Where the scene holds together across hundreds of frames?
That's the hard problem. And it's where things are shifting fast. This guide covers every model we've found that either generates long video natively or supports the workflows needed to build long-form content with consistent characters through reference images. We split them into three tiers: models that generate minutes-long video directly, models with strong reference image support that you extend through continuation, and open-source options you can run yourself.
Tier 1: Native Long-Form Generation (Minutes+)
These models generate video measured in minutes, not seconds. They're built from the ground up for temporal consistency over long sequences.
LongCat Video
Meituan released LongCat Video in late 2025. It's a 13.6 billion parameter diffusion transformer and it's the first model that can reliably generate coherent video up to 15 minutes long.
The model supports text-to-video, image-to-video, and video continuation in a unified pipeline. In I2V mode, the input image becomes the literal first frame of the video. It's not a loose character reference you can place in any scene. The model animates forward from that starting frame while using "Cross-Chunk Latent Stitching" to keep referencing the original image throughout generation, preventing color drift and maintaining visual consistency over long sequences. An updated 2026 variant adds audio-driven avatar generation with lip-sync for 5+ minute talking head videos.
Under the hood, LongCat uses a coarse-to-fine generation approach with Block Sparse Attention to handle the massive sequence lengths. RLHF tuning improves motion quality. It currently ranks third globally behind Google Veo 3 and ShanghaiAI on video quality benchmarks.
Availability: Open source under MIT license. Available through fal.ai API at $0.04 per generated second ($36 for a 15-minute video at 720p). Also available through LongCat's own platform with credit-based pricing.
| Spec | Value |
|---|---|
| Max duration | ~15 minutes |
| Resolution | 720p at 30fps |
| Parameters | 13.6B |
| Reference images | First-frame only (I2V mode, not character reference) |
| License | MIT |
| API cost | ~$0.04/second (fal.ai) |
Seaweed APT2
ByteDance's Seaweed APT2 takes a different approach. Instead of generating a complete video upfront, it produces frames autoregressively at 24fps with just 0.16 seconds latency per frame on a single H100. The result is stable video up to 5 minutes with temporal consistency that holds.
The technical trick is Autoregressive Adversarial Post-Training (AAPT), which converts a pretrained bidirectional video diffusion model into a unidirectional autoregressive generator. Single network forward evaluation per frame. That's what makes real-time generation possible.
What makes this model interesting beyond raw length is interactivity. You can control the camera, animate characters through pose detection, and manipulate scenes while the video renders. Think of it less as "generate a video" and more as "steer a video in real time."
Availability: Research phase only. Not publicly available yet. The 7B base model (Seaweed-7B) has a published paper but the APT2 weights haven't been released.
| Spec | Value |
|---|---|
| Max duration | ~5 minutes |
| Resolution | 736x416 (single GPU), up to 720p (8 GPUs) |
| Parameters | 8B |
| Reference images | Via I2V and interactive pose control |
| License | Not released |
| Status | Research preview |
Helios
Helios comes from Peking University, built on top of Wan 2.1. It's a 14B parameter model that generates minute-scale video at 19.5 FPS on a single H100. The key innovation is how it handles long-video drifting. Instead of using conventional anti-drifting techniques like self-forcing or keyframe sampling, Helios simulates drifting during training so the model learns to correct for it.
It natively supports text-to-video, image-to-video, and video-to-video tasks. The I2V mode accepts reference images to seed the generation.
Availability: Fully open source under Apache 2.0. Released March 2026. Code and weights on GitHub (PKU-YuanGroup/Helios). Integrated into Diffusers, SGLang, and vLLM-Omni. Gradio demo on HuggingFace Spaces.
| Spec | Value |
|---|---|
| Max duration | Minute-scale (no fixed cap) |
| Resolution | 720p |
| Parameters | 14B |
| Reference images | Yes (I2V mode) |
| License | Apache 2.0 |
| Hardware | Single H100 for real-time |
SkyReels V2 / V3
Skywork's SkyReels line aims for infinite-length video. V2 uses an AutoRegressive Diffusion-Forcing architecture that generates video without a fixed duration cap. V3, released January 2026, unifies reference image-to-video, video-to-video extension, and audio-guided avatar generation in a single model.
V3 accepts 1 to 4 reference images and preserves subject identity across the generated video. The video-to-video mode enables seamless single-shot continuation and multi-shot switching with cinematographic transitions.
Availability: Fully open source. Models from 1.3B to 14B parameters. Available at 540p and 720p. Code and weights on GitHub and HuggingFace.
| Spec | Value |
|---|---|
| Max duration | Unlimited (autoregressive) |
| Resolution | 540p, 720p |
| Parameters | 1.3B, 5B, 14B |
| Reference images | 1-4 images (V3) |
| License | Open source |
| Hardware | Minimum RTX 4090, recommended 4-8x A100 |
Tier 2: Short Clips with Strong Reference + Extension
These models generate 8-60 second clips but offer strong reference image support and video extension features. For long-form content, you chain clips together using the model's continuation or extension endpoints. Character consistency comes from reference images that persist across generations.
This is the practical workflow most creators use today for content longer than a minute. The quality per-clip is often higher than the native long-form models.
Kling 3.0 Omni (Kuaishou)
Kling has the most complete reference image system of any video model. It separates reference inputs into three distinct categories, each serving a different purpose:
Reference Images (image_urls): Up to 4 images for style and appearance guidance. You tag them in your prompt as @Image1, @Image2, etc. These influence the overall look, scene style, and environment without being the first frame.
Elements (elements): Dedicated character/object inputs. Each element takes a frontal_image_url (clear front-facing photo) plus optional reference_image_urls (additional angles). You reference them as @Element1, @Element2 in your prompt. The model extracts the character's identity and places them in any scene you describe. This is the key feature for adventure-movie-style content: upload a character photo, then describe them walking through a forest, fighting a dragon, whatever you want.
Start/End Frames (start_image_url, end_image_url): Pin specific images as the first or last frame. These are literal frames, not style guides.
The total across all three categories is up to 7 reference inputs (drops to 4 when also using a reference video). A single prompt like "@Element1 and @Element2 are having dinner at this table on @Image1" can combine characters with scene references.
For long-form content, Kling offers two paths. Multi-shot mode generates up to 6 scenes in a single call, each with its own prompt and duration (3-15s each). Character elements persist across all shots automatically. The extend API continues from where a completed video left off, reaching roughly 3 minutes through chained extensions.
Kling 3.0 Omni unifies text-to-video, image-to-video, reference-to-video, and video editing in a single model with native audio generation and lip-sync.
Availability: Commercial API through Kuaishou, fal.ai ($0.084-0.112/sec), and Replicate. Web interface at klingai.com.
| Spec | Value |
|---|---|
| Native clip length | 3-15 seconds |
| Extended length | ~3 minutes (via chained extensions) |
| Resolution | 720p (standard), 1080p (pro) |
| Reference images | Up to 4 (@Image style refs) |
| Elements | Up to 4 (@Element character refs with frontal + angles) |
| Total references | Up to 7 combined (4 with video ref) |
| Multi-shot | Yes (up to 6 shots in storyboard) |
| Audio | Native synchronized audio + lip-sync |
| Video editing | Yes (text-guided editing of existing video) |
| API | Kuaishou, fal.ai, Replicate |
Grok Imagine (xAI)
xAI launched Grok Imagine's Reference-to-Video mode in early 2026 with support for 1-7 reference images. The documentation explicitly distinguishes this from image-to-video: "Unlike image-to-video where the source image becomes the starting frame, reference images influence what appears in the video without locking in the first frame."
You tag images in your prompt as <IMAGE_1>, <IMAGE_2>, etc. A prompt like "the model from <IMAGE_1> walks onto the runway wearing the shirt from <IMAGE_2>" combines a person reference with a clothing reference. The model handles virtual try-on, product placement, and character-consistent storytelling across scenes.
One constraint: you can't combine reference images with image-to-video in the same request. It's either first-frame mode or reference mode, not both.
Grok Imagine also has a video extension endpoint that adds new footage to the end of an existing video. The duration parameter controls only the new portion. You can chain extensions to build longer content.
Availability: xAI API (launched January 2026), fal.ai, and Replicate. Python SDK, JavaScript/AI SDK, and REST API. $0.05/sec at 720p with audio. Also available to X Premium subscribers.
| Spec | Value |
|---|---|
| Native clip length | 1-15 seconds |
| Extended length | Chain-able via extension API |
| Resolution | 480p, 720p |
| Reference images | 1-7 (true reference, not first-frame) |
| Prompt tags | <IMAGE_1>, <IMAGE_2>, etc. |
| Audio | Yes (720p) |
| Video editing | Yes (text-guided) |
| API | xAI API, fal.ai, Replicate |
| API cost | $0.05/second (720p with audio) |
Seedance 2.0 (ByteDance)
ByteDance's Seedance 2.0 accepts the most reference inputs of any model: up to 12 files simultaneously, including up to 9 images, 3 videos, and 3 audio files. The model supports native audio-video generation with phoneme-level lip-sync in 8+ languages.
Individual images can be up to 30MB each. Reference videos must be 2-15 seconds. The model uses the references for character appearance, scene styling, and motion guidance.
Availability: ByteDance official API (via Volcengine, launched February 2026) and third-party API providers. Output at 480p-720p via API, up to 2K cinema resolution through the platform.
| Spec | Value |
|---|---|
| Native clip length | 4-15 seconds |
| Resolution | Up to 2K (cinema) |
| Reference images | Up to 9 images + 3 videos + 3 audio (12 total) |
| Audio | Native with lip-sync (8+ languages) |
| API | ByteDance/Volcengine, third-party providers |
Runway Gen-4.5
Runway Gen-4.5 ranks #1 on the Artificial Analysis Text-to-Video leaderboard with 1,247 ELO, beating Veo 3 and Sora 2 Pro. The model generates 2-10 second clips for text-to-video and supports character-consistent long-form video up to one minute through multi-shot sequencing.
Image-to-video was added in January 2026 and supports reference images for all aspect ratios. The model integrates neural radiance fields and Gaussian splatting within the diffusion architecture, giving it 3D geometric understanding rather than pixel-level prediction alone. This means better object permanence and physically plausible motion.
Availability: Commercial API and web interface. SDKs for Node and Python. Also available on Replicate.
| Spec | Value |
|---|---|
| Native clip length | 2-10 seconds |
| Long-form mode | Up to ~1 minute |
| Resolution | Up to 1080p |
| Reference images | 0-1 per generation |
| Audio | Native audio generation |
| Multi-shot | Yes |
| API | Yes (Runway, Replicate) |
Google Veo 3.1
Google's Veo 3.1 generates 4, 6, or 8 second clips natively. The "Extend Video" feature (currently in preview) chains clips to reach approximately 1-2.5 minutes, though coherence can drift on longer sequences.
The "Ingredients to Video" feature accepts up to 3 reference images as input. You can provide characters to animate, backgrounds, and material textures. When you use reference images, the model sticks closer to your visual references and makes fewer random alterations. One limitation: reference image mode only works with the 8-second duration option.
As of January 2026, Veo 3.1 added vertical video (9:16) for reference-based generation and 4K upscaling on Vertex AI.
Availability: Google Vertex AI API, Gemini API, and Google Flow. Requires Google Cloud account.
| Spec | Value |
|---|---|
| Native clip length | 4, 6, or 8 seconds |
| Extended length | ~1-2.5 minutes |
| Resolution | Up to 4K (with upscaling) |
| Reference images | Up to 3 ("Ingredients to Video") |
| Audio | Synchronized dialogue and music |
| API | Vertex AI, Gemini API |
OpenAI Sora 2 / Sora 2 Pro
Sora 2 Pro generates clips up to 20 seconds. The Characters API uses a different approach from Kling or Grok: instead of uploading static images, you create a character_id by pointing the API at a video clip (with a 1-3 second timestamp range). Sora analyzes the video frames to extract facial structure, body proportions, clothing style, and other identifying features. That character_id persists indefinitely and can be reused across unlimited future generations.
You can reference up to 2 uploaded characters per generation. As of March 2026, character references work for objects and animals too, not just people. Video extension uses the full initial clip as context for continuation.
The character system requires video input (not static images) to create characters. If you only have photos, you'd need to generate a short video first, then extract the character from that.
Availability: OpenAI API with Batch API support for production workflows.
| Spec | Value |
|---|---|
| Native clip length | Up to 20 seconds |
| Resolution | Up to 1920x1080 |
| Character references | Up to 2 per generation (persistent character_id) |
| Character input | Video clip (1-3s timestamp range), not static images |
| Audio | Synchronized |
| Extension | Yes (full clip as context) |
| API | OpenAI API + Batch API |
MiniMax Hailuo 02
Hailuo 02 ranks #2 globally on the Artificial Analysis benchmark, beating Veo 3. It generates 10-second clips at native 1080p with some of the best physics simulation in the field. The model handles extreme motion like gymnastics and acrobatics without breaking apart.
It supports image-to-video generation with strong character consistency through facial recognition and body tracking. The Noise-aware Compute Redistribution architecture dynamically allocates compute based on scene complexity.
Availability: Commercial API. Available through MiniMax platform, fal.ai, and Replicate. $0.28 per video.
| Spec | Value |
|---|---|
| Native clip length | Up to 10 seconds |
| Resolution | 1080p native |
| Reference images | Yes (I2V mode) |
| Audio | Not native |
| Physics | Best-in-class simulation |
| API | MiniMax, fal.ai, Replicate |
Luma Ray2
Ray2 generates 5-10 second clips at up to 1080p with 4K upscaling available. The Extend feature continues videos up to 30 seconds total. Image-to-video accepts reference images as start or end keyframes.
The model is trained on a multi-modal architecture with 10x the compute of Ray1. It handles photorealistic content well but the 30-second extension cap limits long-form use.
Availability: Luma API and web interface.
| Spec | Value |
|---|---|
| Native clip length | 5-10 seconds |
| Extended length | Up to 30 seconds |
| Resolution | Up to 4K (with upscaling) |
| Reference images | Yes (start/end keyframes) |
| API | Luma API |
Pika 2.5
Pika takes a keyframe-based approach with Pikaframes. Upload 2-5 keyframes (reference images at key moments) and the model generates smooth transitions between them. Total duration reaches 20-25 seconds.
Pikascenes accepts up to 10 reference images and combines them into a single video. The model uses image recognition to figure out each reference's role (character, background, prop) automatically.
Availability: Pika web platform and API. Subscription plans from free to Pro.
| Spec | Value |
|---|---|
| Native clip length | 5-10 seconds |
| Pikaframes length | 20-25 seconds |
| Resolution | Up to 1080p |
| Reference images | Up to 10 (Pikascenes), 2-5 keyframes (Pikaframes) |
| API | Yes |
Tier 3: Open-Source Models for Self-Hosted Workflows
These models generate shorter clips but they're fully open. You can run them on your own hardware, fine-tune them, and build custom extension pipelines without API dependencies.
Wan 2.1 (Alibaba)
Wan 2.1 is the foundation several other models build on (including Helios). The Wan-VAE architecture encodes and decodes 1080p video of any length while preserving temporal information. The model comes in I2V variants at 480p and 720p, plus a First-Last-Frame-to-Video model that generates video between two reference images.
Wan-Edit allows style and content transfer using reference images while maintaining specific structures or character poses.
| Spec | Value |
|---|---|
| Parameters | 1.3B, 5B, 14B |
| I2V modes | I2V-480P, I2V-720P, FLF2V-720P |
| License | Apache 2.0 |
| Hardware | 8GB+ VRAM (smaller variants) |
| Platforms | Diffusers, ComfyUI |
HunyuanVideo (Tencent)
Tencent's 13B parameter model was the open-source video generation leader through most of 2025. HunyuanVideo-I2V uses a token replace technique with a pre-trained MLLM to incorporate reference image information. HunyuanVideo-1.5, released November 2025, improved efficiency. HunyuanCustom enables multimodal-driven customized video generation.
| Spec | Value |
|---|---|
| Parameters | 13B |
| I2V | Yes (token replace technique) |
| License | Open source |
| Hardware | 60GB+ VRAM (720p) |
| Variants | Base, I2V, 1.5, Avatar, Custom |
CogVideoX (Tsinghua/Zhipu AI)
CogVideoX uses a 3D causal VAE that reduces sequence length and prevents flickering. The adaptive LayerNorm transformer improves text-video alignment. Available in 2B (Apache 2.0) and 5B (research license) variants with native Diffusers integration.
Clips are 6-10 seconds at 720x480. Short, but the quality-to-compute ratio is good and it runs on a 12GB GPU.
| Spec | Value |
|---|---|
| Parameters | 2B, 5B |
| I2V | Yes (CogVideoXImageToVideoPipeline) |
| Resolution | 720x480 at 8fps |
| License | Apache 2.0 (2B), Research (5B) |
| Hardware | 12GB VRAM |
First-Frame vs. True Reference: The Key Distinction
Not all "reference image" support is the same. Understanding the difference is critical for choosing the right model.
First-frame models (LongCat, Helios, Hailuo, Luma Ray2, HunyuanVideo) treat your image as the literal opening frame. The model animates forward from that exact visual. You can't upload a character headshot and describe them in a different scene. The image is the scene.
True reference models (Kling, Grok Imagine, Seedance, SkyReels V3) extract identity from your image and place that character/object into any scene you describe. Upload a photo of a person, then prompt "that person walks through a forest at sunset." The character appears in a completely new environment while maintaining their identity. This is what you need for multi-scene narrative content like an adventure movie.
Character ID models (Sora 2 Pro) extract identity from video clips rather than static images. You create a persistent character ID once and reuse it across unlimited future generations.
Style/ingredient models (Veo 3.1) use reference images to influence visual style, textures, and overall look rather than extracting specific character identities. Good for maintaining visual consistency across a project, less precise for individual character control.
The Real Workflow for 10-Minute Videos
Here's the honest take on where things stand in March 2026. No single model reliably generates 10 minutes of consistent, high-quality video in one shot. LongCat Video gets closest with claims of 15 minutes, but quality and coherence vary significantly at those lengths. Helios and SkyReels V2 generate "minute-scale" and "infinite-length" video respectively, but the outputs need careful prompting and often multiple attempts.
The workflow that actually works for most creators building 5-15 minute videos combines multiple approaches:
For talking head / avatar content: LongCat Video's 2026 audio-driven mode or SkyReels V3's avatar generation can produce 5+ minutes of a consistent talking character. This is the closest thing to "press a button, get long video."
For narrative content with multiple scenes (adventure movie style): Use Kling 3.0, Grok Imagine, or Seedance 2.0 with true character reference images. Generate individual shots of 10-15 seconds each. Use the same @Element or <IMAGE> references across every generation to maintain character identity. Chain shots together using multi-shot mode (Kling supports 6 shots per call) or the extend API. Kling is the most battle-tested for this workflow. Grok Imagine's explicit separation between "reference mode" and "first-frame mode" makes it a strong alternative. Seedance 2.0 accepts the most reference inputs (12 files) but is newer and less proven.
For character consistency across many clips: Sora 2 Pro's persistent character_id system is the cleanest approach for very long projects. Extract the character once from a short video, then generate dozens of clips referencing that ID. The character identity doesn't degrade over time because it's stored as a persistent embedding, not re-interpreted from an image each time.
For style-transferred content: Lucy Restyle on fal.ai processes existing video up to 30 minutes, applying AI style transformations while preserving motion. If you have source footage, this sidesteps the generation length problem entirely. $0.01 per second of source video.
For open-source pipelines: Build on Wan 2.1 or Helios with a video continuation loop. Generate a clip, use the last frame as the start frame for the next clip, repeat. ComfyUI workflows automate this. Consistency degrades over many iterations but it's free and controllable.
The core challenge remains: even with true reference image support, character drift compounds across dozens of clips. Facial features, hair, clothing, and skin tone gradually shift. The workarounds (high-quality reference photos, consistent prompting, shot batching) are necessary. But models like Kling and Grok Imagine that separate character identity from scene composition make this dramatically easier than the first-frame-only models.
Comparison Table
| Model | Max Native Duration | Extended Duration | Reference Type | Max Refs | Resolution | API Available | Open Source |
|---|---|---|---|---|---|---|---|
| LongCat Video | ~15 min | N/A | First-frame only | 1 | 720p/30fps | Yes (fal.ai) | Yes (MIT) |
| Seaweed APT2 | ~5 min | N/A | I2V + pose | 1 | 720p | No | No |
| Helios | Minute-scale | N/A | First-frame (I2V) | 1 | 720p | HF Spaces | Yes (Apache 2.0) |
| SkyReels V3 | Unlimited | N/A | True reference | 1-4 | 720p | No | Yes |
| Kling 3.0 | 15s | ~3 min | Elements + style refs | 7 | 1080p | Yes | No |
| Grok Imagine | 15s | Chain-able | True reference | 7 | 720p | Yes | No |
| Seedance 2.0 | 15s | N/A | Multi-modal refs | 12 | 2K | Yes | No |
| Runway Gen-4.5 | 10s | ~1 min | I2V (0-1) | 1 | 1080p | Yes | No |
| Veo 3.1 | 8s | ~2.5 min | Ingredients (style) | 3 | 4K | Yes | No |
| Sora 2 Pro | 20s | Chain-able | Character ID (video) | 2 | 1080p | Yes | No |
| Hailuo 02 | 10s | N/A | I2V (first-frame) | 1 | 1080p | Yes | No |
| Luma Ray2 | 10s | 30s | First-frame | 1 | 4K | Yes | No |
| Pika 2.5 | 10s | 25s | Pikascenes | 10 | 1080p | Yes | No |
| Wan 2.1 | Short clips | Via continuation | I2V / FLF2V | 1-2 | 720p | Via fal.ai | Yes (Apache 2.0) |
| HunyuanVideo | Short clips | Via continuation | I2V (first-frame) | 1 | 720p | Via fal.ai | Yes |
| CogVideoX | 6-10s | Via continuation | I2V (first-frame) | 1 | 720x480 | Via fal.ai | Yes |
What's Coming
The trajectory through 2026 is clear. LongCat Video proved that minute-scale generation with consistency is possible in an open model. Helios showed it can happen in real-time. Seaweed APT2 demonstrated interactive long-form generation. And the true-reference models (Kling, Grok, Seedance) proved that character identity can persist across arbitrary scenes.
The next step is combining these capabilities: native long-form generation with true character reference support. Right now you pick one or the other. When a model can generate 5 minutes of video while maintaining characters from reference images across dozens of scene changes, the chained-clips workflow becomes obsolete.
For now, the practical answer depends on your use case:
Best for multi-character reference: Kling 3.0 (up to 7 refs with separate element + style system) or Seedance 2.0 (up to 12 multimodal inputs).
Best API for reference-to-video: Grok Imagine (clean API, explicit reference mode, $0.05/sec) or Kling via fal.ai ($0.084-0.112/sec).
Best for persistent characters across many clips: Sora 2 Pro (character ID system, no drift over time).
Best open source: SkyReels V3 (1-4 true reference images, unlimited length) or Helios (real-time, Apache 2.0).
Best for raw duration: LongCat Video (~15 min, but first-frame only).
More Reading
- Frontier Open-Source Gen AI Models — practical guide to open-source generative AI for video, image, 3D, audio, and more
- Video Generator — our video generation tool powered by Kling 3.0 Pro
- How to go from sketch to animated 3D character — using image and video generation for character animation