Flagship model from Alibaba Taotian Lab with 4 generation modes (T2V/I2V/R2V/Edit), up to 9 character references, and native joint audio+video in a single pass.
Product shots, brand storytelling, consistent product series via R2V with up to 9 references.
Cinematic spots with native lip-sync, measured camera movement, and atmospheric audio in a single pass.
Cinematic quality with high visual fidelity and temporal stability. For short films and brand storytelling.
Generate from text prompt. All 5 aspect ratios (16:9, 9:16, 1:1, 4:3, 3:4).
Animate a single image with native audio. Aspect ratio is taken from the image.
Up to 9 character references. Use character1..N in the prompt for precise identification. Ideal for advertising series.
Edit existing video by prompt: style transfer, character swap. Original camera tracking is preserved.
Budget-friendly premium
Maximum quality for final production
Pricing is identical for all 4 modes. Range: from — cr (3s 720p) to — cr (15s 1080p).
HappyHorse 1.0 tolerates dense, detailed prompts — but the density must do real work. Clear structured delivery works better than long flowing prose. Concrete details > generic words.
The word "photorealistic" engages photo mode. The strongest lever is to NAME IMPERFECTIONS: pores, fine lines, fabric wear, available light, slight motion blur. Anti-cues against AI-look: "no glamorization", "no heavy retouching".
Specify MEASURED motion: "slight breathing", "no more than 3% push-in", "subtle eye blink only". This keeps the frame away from typical AI-drift and chaotic motion.
Concrete parameters: lens (50mm, 35mm), DOF (shallow / deep), motion blur, frame rate feel (24fps cinematic, 60fps sports). The model interprets lens specs loosely — they're a look-cue, not physical simulation.
Not just "background" — describe ambient, foley, musical mood, dialogue in quotes with tone. Lip-sync works in many languages — do NOT translate dialogue, keep it in the user's original language.
A real T2V generation example using the 6-block structure. Prompt on the left, video generated by HappyHorse 1.0 on the right.
A sunlit Tokyo cafe in late morning, eight seconds. A young woman in her late twenties, wearing a beige linen shirt, sitting at a window seat, hands wrapped around a ceramic mug. She slowly turns her head toward the window — no more than 5% camera push-in. Low-angle slow orbit, 180-degree arc, eye height ~30 cm above the ground. Soft window light from camera-left, shallow depth of field, 50mm lens, photorealistic, natural skin texture with visible pores, no glamorization. Audio: distant street traffic, ceramic mug placed gently, faint cafe chatter. [Woman, soft thoughtful voice]: "Я давно тут не была."
Photorealistic mode, measured motion (5% push-in), low-angle slow orbit, native lip-sync in Russian.
Full freedom. Use all 6 blocks. Aspect ratio is chosen separately from 5 options (16:9, 9:16, 1:1, 4:3, 3:4).
Do NOT redescribe the subject's appearance from the uploaded image. Describe MOTION, ACTION, scene evolution. The camera can come alive (slight push-in, subtle parallax). Aspect ratio is taken from the image automatically.
In the prompt, use character1, character2 , etc. to reference characters (order = upload order of references). Do NOT describe appearance in detail — the model takes it from the references.
Example: "character1 jogs through a sunlit forest. character2 floats playfully behind her like a small comet leaving a luminous trail."
PRESERVE the source video's camera tracking and motion. Describe WHAT TO REPLACE (style transfer, character swap, environment change). Audio-setting "auto" (new audio) or "origin" (keep original).
Example: "Replace the teenager on the skateboard with SpongeBob SquarePants in a 3D realistic style. Keep the original camera tracking and park lighting exactly the same."
HappyHorse 1.0 is the 2026 flagship video generation model from Alibaba Taotian Future Life Lab. A 15-billion-parameter unified Transformer that generates joint video and audio in a single forward pass. 4 modes (T2V/I2V/R2V/Edit), up to 9 character references, 3-15 seconds up to 1080p with native joint audio+video.
Key differentiators of HappyHorse 1.0: joint audio+video in a single pass (native audio-video synchronization), 4 generation modes including R2V with up to 9 character references and Video-Edit. Lip-sync support in multiple languages and per-second pricing from 3 to 15 seconds. A strong choice for advertising, E-commerce, and cinematic narratives.
Four generation modes: Text-to-Video (from text), Image-to-Video (animate an image), Reference-to-Video (1-9 character references via character1..N), Video-Edit (edit existing video — style transfer, character swap). All modes 3-15 seconds, 720p or 1080p, with native joint audio+video.
Upload 1-9 character reference images. In the prompt, use character1, character2, ... to reference them (order = upload order). The model preserves each character's identity across the scene. Ideal for multi-character commercial spots and consistent series.
Per-second pricing: 25 cr/sec for 720p, 43 cr/sec for 1080p. Identical pricing across all 4 modes.
Yes, HappyHorse 1.0 is well-suited for advertising, E-commerce (product shots, brand storytelling), and cinematic projects. Native joint audio+video delivers flawless dialogue and foley sync. Reference-to-Video with up to 9 references is ideal for consistent product series.
Use a 6-block structure: 1) Scene and time, 2) Subject, 3) Action and motion (with a measured "movement budget", e.g., "no more than 5% push-in"), 4) Camera language (lens, DOF, shot size), 5) Light and texture, 6) Audio (ambient, foley, dialogue in quotes). The model tolerates dense, detailed prompts — but the density must do real work. Use "photorealistic" + named imperfections (pores, fabric wear) for the best photoreal mode.
Duration 3-15 seconds (any integer). Resolution 720p or 1080p. Aspect ratios 16:9, 9:16, 1:1, 4:3, 3:4 (for T2V and R2V; in I2V it's taken from the image, in Edit from the source video). Lip-sync works in multiple languages.
We use cookies to operate the service, keep your session, and collect anonymous statistics. See our Privacy Policy.