Multimodal model by ByteDance for cinematic video generation with native stereo audio. #1 on Artificial Analysis for I2V+Audio quality. Up to 15 seconds, 2K@60FPS, multi-reference from 9 photos + 3 videos + 3 audios.
Storyboard mode: you describe an idea, an AI director breaks it into shots, an image model (Nano Banana 2 / GPT Image 2) draws a single grid image with all the frames, and Seedance 2.0 animates it into a coherent multi-shot scene. You can upload subject references (e.g., photos of the planes) so the model knows what they look like.

Generation in up to 2K resolution at 60 frames per second — cinematic quality without upscaling
Duration from 4 to 15 seconds in 1-second increments. Multi-shot scene support
Native stereo audio with sound effects, background music, and synchronized dialogue
Up to 9 photos + 3 videos + 3 audios — the model extracts style, motion, and sound from all sources
Standard — maximum quality up to 1080p. Fast — faster and cheaper, up to 720p
Enhanced character preservation during full 360-degree camera rotation
Cost = rate per second × duration. The rate depends on mode (Standard/Fast), resolution, and whether a video reference is used.
Standard for maximum quality (up to 1080p) or Fast for speed (up to 720p). Pick resolution and duration from 4 to 15 seconds.
Upload up to 9 images for style/characters, up to 3 videos for motion, up to 3 audios for sync. Or use text only.
Enter a description in any language. Use AI prompt enhancement for optimal structure. Then translate to English.
Press 'Generate' and get your result. Video with native stereo audio is generated in 2-5 minutes.
Seedance 2.0 is a multimodal model by ByteDance for generating cinematic video with native stereo audio. #1 on Artificial Analysis for I2V+Audio quality. Supports up to 15 seconds at 2K@60FPS resolution.
Seedance 2.0 is a completely new architecture: 2K@60FPS (instead of 1080p), up to 15 seconds (instead of 12), multi-reference (up to 9 photos + 3 videos + 3 audios), enhanced identity preservation at 360° rotations, text overlays, two modes Standard/Fast.
Standard: 480p, 720p, 1080p. Fast: 480p, 720p. Duration: from 4 to 15 seconds in 1-second increments.
Upload up to 9 images for style/characters, up to 3 videos for copying motion/camera, up to 3 audios for synchronization. The model automatically extracts key features and combines them with the text prompt.
Standard generates at maximum quality 2K@60FPS with 1080p support. Fast is faster and cheaper but limited to 720p. Both support all input modes and native audio.
Price = rate per second × duration. The rate depends on mode (Standard/Fast), resolution, and whether a video reference is used. With a video reference the rate is lower, but calculated from the sum of input and output video durations.
English, Chinese (Mandarin + dialects), Japanese, Korean, Spanish, Indonesian, Portuguese, and more. Multi-character dialogue with unique voices and accurate lip-sync.
Seedance 2.0 is the best choice for: professional ads, cinematic content, video with native audio and lip-sync, complex scenes with character preservation, multi-reference generation from multiple sources.
2K@60FPS, stereo audio, multi-reference, up to 15 seconds — from — credits
We use cookies to operate the service, keep your session, and collect anonymous statistics. See our Privacy Policy.