Skip to content

Long-Form Video Generation Models with Reference Image Support

Generating a 10-second clip is easy now. Every major model does it. The real question is: can you generate 5 or 10 minutes of coherent video where a character looks the same in minute one and minute eight? Where the scene holds together across hundreds of frames?

That's the hard problem. And it's where things are shifting fast. This guide covers every model we've found that either generates long video natively or supports the workflows needed to build long-form content with consistent characters through reference images. We split them into three tiers: models that generate minutes-long video directly, models with strong reference image support that you extend through continuation, and open-source options you can run yourself.

Tier 1: Native Long-Form Generation (Minutes+)

These models generate video measured in minutes, not seconds. They're built from the ground up for temporal consistency over long sequences.

LongCat Video

LongCat Video generates minutes-long coherent video from a single prompt, with no color drift or temporal inconsistency across the full duration.

Meituan released LongCat Video in late 2025. It's a 13.6 billion parameter diffusion transformer and it's the first model that can reliably generate coherent video up to 15 minutes long.

The model supports text-to-video, image-to-video, and video continuation in a unified pipeline. In I2V mode, the input image becomes the literal first frame of the video. It's not a loose character reference you can place in any scene. The model animates forward from that starting frame while using "Cross-Chunk Latent Stitching" to keep referencing the original image throughout generation, preventing color drift and maintaining visual consistency over long sequences. An updated 2026 variant adds audio-driven avatar generation with lip-sync for 5+ minute talking head videos.

Under the hood, LongCat uses a coarse-to-fine generation approach with Block Sparse Attention to handle the massive sequence lengths. RLHF tuning improves motion quality. It currently ranks third globally behind Google Veo 3 and ShanghaiAI on video quality benchmarks.

Availability: Open source under MIT license. Available through fal.ai API at $0.04 per generated second ($36 for a 15-minute video at 720p). Also available through LongCat's own platform with credit-based pricing.

SpecValue
Max duration~15 minutes
Resolution720p at 30fps
Parameters13.6B
Reference imagesFirst-frame only (I2V mode, not character reference)
LicenseMIT
API cost~$0.04/second (fal.ai)

Seaweed APT2

Seaweed APT2 generates video autoregressively at 24fps with interactive camera and pose control, closer to a game engine than a render queue.

ByteDance's Seaweed APT2 takes a different approach. Instead of generating a complete video upfront, it produces frames autoregressively at 24fps with just 0.16 seconds latency per frame on a single H100. The result is stable video up to 5 minutes with temporal consistency that holds.

The technical trick is Autoregressive Adversarial Post-Training (AAPT), which converts a pretrained bidirectional video diffusion model into a unidirectional autoregressive generator. Single network forward evaluation per frame. That's what makes real-time generation possible.

What makes this model interesting beyond raw length is interactivity. You can control the camera, animate characters through pose detection, and manipulate scenes while the video renders. Think of it less as "generate a video" and more as "steer a video in real time."

Availability: Research phase only. Not publicly available yet. The 7B base model (Seaweed-7B) has a published paper but the APT2 weights haven't been released.

SpecValue
Max duration~5 minutes
Resolution736x416 (single GPU), up to 720p (8 GPUs)
Parameters8B
Reference imagesVia I2V and interactive pose control
LicenseNot released
StatusResearch preview

Helios

Helios runs at 19.5 FPS on a single H100, generating minute-scale video while simulating and correcting for temporal drift during training.

Helios comes from Peking University, built on top of Wan 2.1. It's a 14B parameter model that generates minute-scale video at 19.5 FPS on a single H100. The key innovation is how it handles long-video drifting. Instead of using conventional anti-drifting techniques like self-forcing or keyframe sampling, Helios simulates drifting during training so the model learns to correct for it.

It natively supports text-to-video, image-to-video, and video-to-video tasks. The I2V mode accepts reference images to seed the generation.

Availability: Fully open source under Apache 2.0. Released March 2026. Code and weights on GitHub (PKU-YuanGroup/Helios). Integrated into Diffusers, SGLang, and vLLM-Omni. Gradio demo on HuggingFace Spaces.

SpecValue
Max durationMinute-scale (no fixed cap)
Resolution720p
Parameters14B
Reference imagesYes (I2V mode)
LicenseApache 2.0
HardwareSingle H100 for real-time

SkyReels V2 / V3

SkyReels V3 accepts 1-4 reference images and generates unlimited-length video with multi-shot switching and audio-guided avatar synthesis.

Skywork's SkyReels line aims for infinite-length video. V2 uses an AutoRegressive Diffusion-Forcing architecture that generates video without a fixed duration cap. V3, released January 2026, unifies reference image-to-video, video-to-video extension, and audio-guided avatar generation in a single model.

V3 accepts 1 to 4 reference images and preserves subject identity across the generated video. The video-to-video mode enables seamless single-shot continuation and multi-shot switching with cinematographic transitions.

Availability: Fully open source. Models from 1.3B to 14B parameters. Available at 540p and 720p. Code and weights on GitHub and HuggingFace.

SpecValue
Max durationUnlimited (autoregressive)
Resolution540p, 720p
Parameters1.3B, 5B, 14B
Reference images1-4 images (V3)
LicenseOpen source
HardwareMinimum RTX 4090, recommended 4-8x A100

Tier 2: Short Clips with Strong Reference + Extension

These models generate 8-60 second clips but offer strong reference image support and video extension features. For long-form content, you chain clips together using the model's continuation or extension endpoints. Character consistency comes from reference images that persist across generations.

This is the practical workflow most creators use today for content longer than a minute. The quality per-clip is often higher than the native long-form models.

Kling 3.0 Omni (Kuaishou)

Kling 3.0 Omni combines character elements, style references, and multi-shot storyboarding in a single call with native 4K 60fps output.

Kling has the most complete reference image system of any video model. It separates reference inputs into three distinct categories, each serving a different purpose:

Reference Images (image_urls): Up to 4 images for style and appearance guidance. You tag them in your prompt as @Image1, @Image2, etc. These influence the overall look, scene style, and environment without being the first frame.

Elements (elements): Dedicated character/object inputs. Each element takes a frontal_image_url (clear front-facing photo) plus optional reference_image_urls (additional angles). You reference them as @Element1, @Element2 in your prompt. The model extracts the character's identity and places them in any scene you describe. This is the key feature for adventure-movie-style content: upload a character photo, then describe them walking through a forest, fighting a dragon, whatever you want.

Start/End Frames (start_image_url, end_image_url): Pin specific images as the first or last frame. These are literal frames, not style guides.

The total across all three categories is up to 7 reference inputs (drops to 4 when also using a reference video). A single prompt like "@Element1 and @Element2 are having dinner at this table on @Image1" can combine characters with scene references.

For long-form content, Kling offers two paths. Multi-shot mode generates up to 6 scenes in a single call, each with its own prompt and duration (3-15s each). Character elements persist across all shots automatically. The extend API continues from where a completed video left off, reaching roughly 3 minutes through chained extensions.

Kling 3.0 Omni unifies text-to-video, image-to-video, reference-to-video, and video editing in a single model with native audio generation and lip-sync.

Availability: Commercial API through Kuaishou, fal.ai ($0.084-0.112/sec), and Replicate. Web interface at klingai.com.

SpecValue
Native clip length3-15 seconds
Extended length~3 minutes (via chained extensions)
Resolution720p (standard), 1080p (pro)
Reference imagesUp to 4 (@Image style refs)
ElementsUp to 4 (@Element character refs with frontal + angles)
Total referencesUp to 7 combined (4 with video ref)
Multi-shotYes (up to 6 shots in storyboard)
AudioNative synchronized audio + lip-sync
Video editingYes (text-guided editing of existing video)
APIKuaishou, fal.ai, Replicate

Grok Imagine (xAI)

Grok Imagine separates reference mode from first-frame mode, letting you tag up to 7 images as character or object references in your prompt.

xAI launched Grok Imagine's Reference-to-Video mode in early 2026 with support for 1-7 reference images. The documentation explicitly distinguishes this from image-to-video: "Unlike image-to-video where the source image becomes the starting frame, reference images influence what appears in the video without locking in the first frame."

You tag images in your prompt as <IMAGE_1>, <IMAGE_2>, etc. A prompt like "the model from <IMAGE_1> walks onto the runway wearing the shirt from <IMAGE_2>" combines a person reference with a clothing reference. The model handles virtual try-on, product placement, and character-consistent storytelling across scenes.

One constraint: you can't combine reference images with image-to-video in the same request. It's either first-frame mode or reference mode, not both.

Grok Imagine also has a video extension endpoint that adds new footage to the end of an existing video. The duration parameter controls only the new portion. You can chain extensions to build longer content.

Availability: xAI API (launched January 2026), fal.ai, and Replicate. Python SDK, JavaScript/AI SDK, and REST API. $0.05/sec at 720p with audio. Also available to X Premium subscribers.

SpecValue
Native clip length1-15 seconds
Extended lengthChain-able via extension API
Resolution480p, 720p
Reference images1-7 (true reference, not first-frame)
Prompt tags<IMAGE_1>, <IMAGE_2>, etc.
AudioYes (720p)
Video editingYes (text-guided)
APIxAI API, fal.ai, Replicate
API cost$0.05/second (720p with audio)

Seedance 2.0 (ByteDance)

Seedance 2.0 accepts up to 12 multimodal inputs simultaneously and generates video with native audio sync and phoneme-level lip-sync in 8+ languages.

ByteDance's Seedance 2.0 accepts the most reference inputs of any model: up to 12 files simultaneously, including up to 9 images, 3 videos, and 3 audio files. The model supports native audio-video generation with phoneme-level lip-sync in 8+ languages.

Individual images can be up to 30MB each. Reference videos must be 2-15 seconds. The model uses the references for character appearance, scene styling, and motion guidance.

Availability: ByteDance official API (via Volcengine, launched February 2026) and third-party API providers. Output at 480p-720p via API, up to 2K cinema resolution through the platform.

SpecValue
Native clip length4-15 seconds
ResolutionUp to 2K (cinema)
Reference imagesUp to 9 images + 3 videos + 3 audio (12 total)
AudioNative with lip-sync (8+ languages)
APIByteDance/Volcengine, third-party providers

Runway Gen-4.5

Runway Gen-4.5 leads the Artificial Analysis leaderboard at 1,247 ELO, with 3D geometric understanding from neural radiance fields and Gaussian splatting.

Runway Gen-4.5 ranks #1 on the Artificial Analysis Text-to-Video leaderboard with 1,247 ELO, beating Veo 3 and Sora 2 Pro. The model generates 2-10 second clips for text-to-video and supports character-consistent long-form video up to one minute through multi-shot sequencing.

Image-to-video was added in January 2026 and supports reference images for all aspect ratios. The model integrates neural radiance fields and Gaussian splatting within the diffusion architecture, giving it 3D geometric understanding rather than pixel-level prediction alone. This means better object permanence and physically plausible motion.

Availability: Commercial API and web interface. SDKs for Node and Python. Also available on Replicate.

SpecValue
Native clip length2-10 seconds
Long-form modeUp to ~1 minute
ResolutionUp to 1080p
Reference images0-1 per generation
AudioNative audio generation
Multi-shotYes
APIYes (Runway, Replicate)

Google Veo 3.1

Veo 3.1's "Ingredients to Video" mode accepts up to 3 reference images for characters, backgrounds, and textures with native audio and 4K upscaling.

Google's Veo 3.1 generates 4, 6, or 8 second clips natively. The "Extend Video" feature (currently in preview) chains clips to reach approximately 1-2.5 minutes, though coherence can drift on longer sequences.

The "Ingredients to Video" feature accepts up to 3 reference images as input. You can provide characters to animate, backgrounds, and material textures. When you use reference images, the model sticks closer to your visual references and makes fewer random alterations. One limitation: reference image mode only works with the 8-second duration option.

As of January 2026, Veo 3.1 added vertical video (9:16) for reference-based generation and 4K upscaling on Vertex AI.

Availability: Google Vertex AI API, Gemini API, and Google Flow. Requires Google Cloud account.

SpecValue
Native clip length4, 6, or 8 seconds
Extended length~1-2.5 minutes
ResolutionUp to 4K (with upscaling)
Reference imagesUp to 3 ("Ingredients to Video")
AudioSynchronized dialogue and music
APIVertex AI, Gemini API

OpenAI Sora 2 / Sora 2 Pro

Sora 2 Pro creates persistent character IDs from video clips, reusable across unlimited generations with no identity drift over time.

Sora 2 Pro generates clips up to 20 seconds. The Characters API uses a different approach from Kling or Grok: instead of uploading static images, you create a character_id by pointing the API at a video clip (with a 1-3 second timestamp range). Sora analyzes the video frames to extract facial structure, body proportions, clothing style, and other identifying features. That character_id persists indefinitely and can be reused across unlimited future generations.

You can reference up to 2 uploaded characters per generation. As of March 2026, character references work for objects and animals too, not just people. Video extension uses the full initial clip as context for continuation.

The character system requires video input (not static images) to create characters. If you only have photos, you'd need to generate a short video first, then extract the character from that.

Availability: OpenAI API with Batch API support for production workflows.

SpecValue
Native clip lengthUp to 20 seconds
ResolutionUp to 1920x1080
Character referencesUp to 2 per generation (persistent character_id)
Character inputVideo clip (1-3s timestamp range), not static images
AudioSynchronized
ExtensionYes (full clip as context)
APIOpenAI API + Batch API

MiniMax Hailuo 02

Hailuo 02 generates native 1080p video with best-in-class physics simulation, handling extreme motion like gymnastics without breaking apart.

Hailuo 02 ranks #2 globally on the Artificial Analysis benchmark, beating Veo 3. It generates 10-second clips at native 1080p with some of the best physics simulation in the field. The model handles extreme motion like gymnastics and acrobatics without breaking apart.

It supports image-to-video generation with strong character consistency through facial recognition and body tracking. The Noise-aware Compute Redistribution architecture dynamically allocates compute based on scene complexity.

Availability: Commercial API. Available through MiniMax platform, fal.ai, and Replicate. $0.28 per video.

SpecValue
Native clip lengthUp to 10 seconds
Resolution1080p native
Reference imagesYes (I2V mode)
AudioNot native
PhysicsBest-in-class simulation
APIMiniMax, fal.ai, Replicate

Luma Ray2

Ray2 animates reference images into 5-10 second clips with photorealistic quality, trained on 10x the compute of its predecessor.

Ray2 generates 5-10 second clips at up to 1080p with 4K upscaling available. The Extend feature continues videos up to 30 seconds total. Image-to-video accepts reference images as start or end keyframes.

The model is trained on a multi-modal architecture with 10x the compute of Ray1. It handles photorealistic content well but the 30-second extension cap limits long-form use.

Availability: Luma API and web interface.

SpecValue
Native clip length5-10 seconds
Extended lengthUp to 30 seconds
ResolutionUp to 4K (with upscaling)
Reference imagesYes (start/end keyframes)
APILuma API

Pika 2.5

Pikaframes generates smooth transitions between 2-5 keyframe images, producing up to 25 seconds of coherent video from reference stills.

Pika takes a keyframe-based approach with Pikaframes. Upload 2-5 keyframes (reference images at key moments) and the model generates smooth transitions between them. Total duration reaches 20-25 seconds.

Pikascenes accepts up to 10 reference images and combines them into a single video. The model uses image recognition to figure out each reference's role (character, background, prop) automatically.

Availability: Pika web platform and API. Subscription plans from free to Pro.

SpecValue
Native clip length5-10 seconds
Pikaframes length20-25 seconds
ResolutionUp to 1080p
Reference imagesUp to 10 (Pikascenes), 2-5 keyframes (Pikaframes)
APIYes

Tier 3: Open-Source Models for Self-Hosted Workflows

These models generate shorter clips but they're fully open. You can run them on your own hardware, fine-tune them, and build custom extension pipelines without API dependencies.

Wan 2.1 (Alibaba)

Wan 2.1 provides the foundation several other models build on, with I2V, First-Last-Frame, and video editing modes across 1.3B to 14B parameter variants.

Wan 2.1 is the foundation several other models build on (including Helios). The Wan-VAE architecture encodes and decodes 1080p video of any length while preserving temporal information. The model comes in I2V variants at 480p and 720p, plus a First-Last-Frame-to-Video model that generates video between two reference images.

Wan-Edit allows style and content transfer using reference images while maintaining specific structures or character poses.

SpecValue
Parameters1.3B, 5B, 14B
I2V modesI2V-480P, I2V-720P, FLF2V-720P
LicenseApache 2.0
Hardware8GB+ VRAM (smaller variants)
PlatformsDiffusers, ComfyUI

HunyuanVideo (Tencent)

HunyuanVideo's 13B parameter model was the open-source leader through most of 2025, with variants for I2V, avatars, and customized generation.

Tencent's 13B parameter model was the open-source video generation leader through most of 2025. HunyuanVideo-I2V uses a token replace technique with a pre-trained MLLM to incorporate reference image information. HunyuanVideo-1.5, released November 2025, improved efficiency. HunyuanCustom enables multimodal-driven customized video generation.

SpecValue
Parameters13B
I2VYes (token replace technique)
LicenseOpen source
Hardware60GB+ VRAM (720p)
VariantsBase, I2V, 1.5, Avatar, Custom

CogVideoX (Tsinghua/Zhipu AI)

CogVideoX runs on a 12GB GPU, generating 6-10 second clips at 720x480 with text-to-video, image-to-video, and video-to-video modes.

CogVideoX uses a 3D causal VAE that reduces sequence length and prevents flickering. The adaptive LayerNorm transformer improves text-video alignment. Available in 2B (Apache 2.0) and 5B (research license) variants with native Diffusers integration.

Clips are 6-10 seconds at 720x480. Short, but the quality-to-compute ratio is good and it runs on a 12GB GPU.

SpecValue
Parameters2B, 5B
I2VYes (CogVideoXImageToVideoPipeline)
Resolution720x480 at 8fps
LicenseApache 2.0 (2B), Research (5B)
Hardware12GB VRAM

First-Frame vs. True Reference: The Key Distinction

Not all "reference image" support is the same. Understanding the difference is critical for choosing the right model.

First-frame models (LongCat, Helios, Hailuo, Luma Ray2, HunyuanVideo) treat your image as the literal opening frame. The model animates forward from that exact visual. You can't upload a character headshot and describe them in a different scene. The image is the scene.

True reference models (Kling, Grok Imagine, Seedance, SkyReels V3) extract identity from your image and place that character/object into any scene you describe. Upload a photo of a person, then prompt "that person walks through a forest at sunset." The character appears in a completely new environment while maintaining their identity. This is what you need for multi-scene narrative content like an adventure movie.

Character ID models (Sora 2 Pro) extract identity from video clips rather than static images. You create a persistent character ID once and reuse it across unlimited future generations.

Style/ingredient models (Veo 3.1) use reference images to influence visual style, textures, and overall look rather than extracting specific character identities. Good for maintaining visual consistency across a project, less precise for individual character control.

The Real Workflow for 10-Minute Videos

Here's the honest take on where things stand in March 2026. No single model reliably generates 10 minutes of consistent, high-quality video in one shot. LongCat Video gets closest with claims of 15 minutes, but quality and coherence vary significantly at those lengths. Helios and SkyReels V2 generate "minute-scale" and "infinite-length" video respectively, but the outputs need careful prompting and often multiple attempts.

The workflow that actually works for most creators building 5-15 minute videos combines multiple approaches:

For talking head / avatar content: LongCat Video's 2026 audio-driven mode or SkyReels V3's avatar generation can produce 5+ minutes of a consistent talking character. This is the closest thing to "press a button, get long video."

For narrative content with multiple scenes (adventure movie style): Use Kling 3.0, Grok Imagine, or Seedance 2.0 with true character reference images. Generate individual shots of 10-15 seconds each. Use the same @Element or <IMAGE> references across every generation to maintain character identity. Chain shots together using multi-shot mode (Kling supports 6 shots per call) or the extend API. Kling is the most battle-tested for this workflow. Grok Imagine's explicit separation between "reference mode" and "first-frame mode" makes it a strong alternative. Seedance 2.0 accepts the most reference inputs (12 files) but is newer and less proven.

For character consistency across many clips: Sora 2 Pro's persistent character_id system is the cleanest approach for very long projects. Extract the character once from a short video, then generate dozens of clips referencing that ID. The character identity doesn't degrade over time because it's stored as a persistent embedding, not re-interpreted from an image each time.

For style-transferred content: Lucy Restyle on fal.ai processes existing video up to 30 minutes, applying AI style transformations while preserving motion. If you have source footage, this sidesteps the generation length problem entirely. $0.01 per second of source video.

For open-source pipelines: Build on Wan 2.1 or Helios with a video continuation loop. Generate a clip, use the last frame as the start frame for the next clip, repeat. ComfyUI workflows automate this. Consistency degrades over many iterations but it's free and controllable.

The core challenge remains: even with true reference image support, character drift compounds across dozens of clips. Facial features, hair, clothing, and skin tone gradually shift. The workarounds (high-quality reference photos, consistent prompting, shot batching) are necessary. But models like Kling and Grok Imagine that separate character identity from scene composition make this dramatically easier than the first-frame-only models.

Comparison Table

ModelMax Native DurationExtended DurationReference TypeMax RefsResolutionAPI AvailableOpen Source
LongCat Video~15 minN/AFirst-frame only1720p/30fpsYes (fal.ai)Yes (MIT)
Seaweed APT2~5 minN/AI2V + pose1720pNoNo
HeliosMinute-scaleN/AFirst-frame (I2V)1720pHF SpacesYes (Apache 2.0)
SkyReels V3UnlimitedN/ATrue reference1-4720pNoYes
Kling 3.015s~3 minElements + style refs71080pYesNo
Grok Imagine15sChain-ableTrue reference7720pYesNo
Seedance 2.015sN/AMulti-modal refs122KYesNo
Runway Gen-4.510s~1 minI2V (0-1)11080pYesNo
Veo 3.18s~2.5 minIngredients (style)34KYesNo
Sora 2 Pro20sChain-ableCharacter ID (video)21080pYesNo
Hailuo 0210sN/AI2V (first-frame)11080pYesNo
Luma Ray210s30sFirst-frame14KYesNo
Pika 2.510s25sPikascenes101080pYesNo
Wan 2.1Short clipsVia continuationI2V / FLF2V1-2720pVia fal.aiYes (Apache 2.0)
HunyuanVideoShort clipsVia continuationI2V (first-frame)1720pVia fal.aiYes
CogVideoX6-10sVia continuationI2V (first-frame)1720x480Via fal.aiYes

What's Coming

The trajectory through 2026 is clear. LongCat Video proved that minute-scale generation with consistency is possible in an open model. Helios showed it can happen in real-time. Seaweed APT2 demonstrated interactive long-form generation. And the true-reference models (Kling, Grok, Seedance) proved that character identity can persist across arbitrary scenes.

The next step is combining these capabilities: native long-form generation with true character reference support. Right now you pick one or the other. When a model can generate 5 minutes of video while maintaining characters from reference images across dozens of scene changes, the chained-clips workflow becomes obsolete.

For now, the practical answer depends on your use case:

Best for multi-character reference: Kling 3.0 (up to 7 refs with separate element + style system) or Seedance 2.0 (up to 12 multimodal inputs).

Best API for reference-to-video: Grok Imagine (clean API, explicit reference mode, $0.05/sec) or Kling via fal.ai ($0.084-0.112/sec).

Best for persistent characters across many clips: Sora 2 Pro (character ID system, no drift over time).

Best open source: SkyReels V3 (1-4 true reference images, unlimited length) or Helios (real-time, Apache 2.0).

Best for raw duration: LongCat Video (~15 min, but first-frame only).


More Reading