Multi-mode model by Alibaba

HappyHorse 1.0 — AI video with 4 modes and native audio

Flagship model from Alibaba Taotian Lab with 4 generation modes (T2V/I2V/R2V/Edit), up to 9 character references, and native joint audio+video in a single pass.

up to 9
R2V references
4
generation modes
3-15с
up to 1080p
Native
joint audio+video

Ideal for premium tasks

E-commerce

Product shots, brand storytelling, consistent product series via R2V with up to 9 references.

Premium advertising

Cinematic spots with native lip-sync, measured camera movement, and atmospheric audio in a single pass.

Cinematic narrative

Cinematic quality with high visual fidelity and temporal stability. For short films and brand storytelling.

4 generation modes

Text-to-Video (T2V)

Generate from text prompt. All 5 aspect ratios (16:9, 9:16, 1:1, 4:3, 3:4).

Image-to-Video (I2V)

Animate a single image with native audio. Aspect ratio is taken from the image.

Reference-to-Video (R2V)

Up to 9 character references. Use character1..N in the prompt for precise identification. Ideal for advertising series.

Video-Edit

Edit existing video by prompt: style transfer, character swap. Original camera tracking is preserved.

Pricing

720p

25 cr/sec

Budget-friendly premium

1080p

43 cr/sec

Maximum quality for final production

Pricing is identical for all 4 modes. Range: from — cr (3s 720p) to — cr (15s 1080p).

How to write prompts for HappyHorse 1.0

HappyHorse 1.0 tolerates dense, detailed prompts — but the density must do real work. Clear structured delivery works better than long flowing prose. Concrete details > generic words.

6-block prompt structure

  1. 1
    Scene and time — WHERE and WHEN. Example: "a sunlit Tokyo cafe in late morning, eight seconds".
  2. 2
    Subject — who/what is in the frame: scale, pose, gaze, action.
  3. 3
    Action and motion — what moves and by how much. Example: "slowly turns her head toward the window, no more than 5% camera push-in".
  4. 4
    Camera language — movement, angle, shot size. Example: "low-angle slow orbit, 180-degree arc, eye height ~30 cm above the ground".
  5. 5
    Light and texture — direction, quality, time of day. Example: "soft window light from camera-left, shallow depth of field, 50mm lens".
  6. 6
    Audio — voice-over (in quotes), dialogue, foley (named sounds), ambient bed.

Key techniques

Photorealistic mode

The word "photorealistic" engages photo mode. The strongest lever is to NAME IMPERFECTIONS: pores, fine lines, fabric wear, available light, slight motion blur. Anti-cues against AI-look: "no glamorization", "no heavy retouching".

Movement budget

Specify MEASURED motion: "slight breathing", "no more than 3% push-in", "subtle eye blink only". This keeps the frame away from typical AI-drift and chaotic motion.

Camera as a director's direction

Concrete parameters: lens (50mm, 35mm), DOF (shallow / deep), motion blur, frame rate feel (24fps cinematic, 60fps sports). The model interprets lens specs loosely — they're a look-cue, not physical simulation.

Audio as a full layer

Not just "background" — describe ambient, foley, musical mood, dialogue in quotes with tone. Lip-sync works in many languages — do NOT translate dialogue, keep it in the user's original language.

Example: prompt → result

A real T2V generation example using the 6-block structure. Prompt on the left, video generated by HappyHorse 1.0 on the right.

Prompt (T2V, 1080p, 16:9, 8 sec)
A sunlit Tokyo cafe in late morning, eight seconds. A young woman in her late twenties, wearing a beige linen shirt, sitting at a window seat, hands wrapped around a ceramic mug. She slowly turns her head toward the window — no more than 5% camera push-in. Low-angle slow orbit, 180-degree arc, eye height ~30 cm above the ground. Soft window light from camera-left, shallow depth of field, 50mm lens, photorealistic, natural skin texture with visible pores, no glamorization. Audio: distant street traffic, ceramic mug placed gently, faint cafe chatter. [Woman, soft thoughtful voice]: "Я давно тут не была."

Tokyo cafe — a thoughtful moment

Photorealistic mode, measured motion (5% push-in), low-angle slow orbit, native lip-sync in Russian.

photorealisticnative audioorbit shot

Tips per mode

Text-to-Video

Full freedom. Use all 6 blocks. Aspect ratio is chosen separately from 5 options (16:9, 9:16, 1:1, 4:3, 3:4).

Image-to-Video

Do NOT redescribe the subject's appearance from the uploaded image. Describe MOTION, ACTION, scene evolution. The camera can come alive (slight push-in, subtle parallax). Aspect ratio is taken from the image automatically.

Reference-to-Video

In the prompt, use character1, character2 , etc. to reference characters (order = upload order of references). Do NOT describe appearance in detail — the model takes it from the references.

Example: "character1 jogs through a sunlit forest. character2 floats playfully behind her like a small comet leaving a luminous trail."

Video-Edit

PRESERVE the source video's camera tracking and motion. Describe WHAT TO REPLACE (style transfer, character swap, environment change). Audio-setting "auto" (new audio) or "origin" (keep original).

Example: "Replace the teenager on the skateboard with SpongeBob SquarePants in a 3D realistic style. Keep the original camera tracking and park lighting exactly the same."

What NOT to do

  • Vague epithets ("beautiful scene", "cinematic look") without specifics — the model ignores them.
  • Describing 5+ actions in a short 3-5s clip — better to use one measured action.
  • Translating dialogue to English — keep it in the original language for proper lip-sync.
  • Describing character1..N appearance in detail in R2V — it's already set by the references.
  • Exceeding the 5000-character limit — the model will truncate the tail.

FAQ

What is HappyHorse 1.0?

HappyHorse 1.0 is the 2026 flagship video generation model from Alibaba Taotian Future Life Lab. A 15-billion-parameter unified Transformer that generates joint video and audio in a single forward pass. 4 modes (T2V/I2V/R2V/Edit), up to 9 character references, 3-15 seconds up to 1080p with native joint audio+video.

What makes HappyHorse 1.0 different from other models?

Key differentiators of HappyHorse 1.0: joint audio+video in a single pass (native audio-video synchronization), 4 generation modes including R2V with up to 9 character references and Video-Edit. Lip-sync support in multiple languages and per-second pricing from 3 to 15 seconds. A strong choice for advertising, E-commerce, and cinematic narratives.

Which modes does HappyHorse 1.0 support?

Four generation modes: Text-to-Video (from text), Image-to-Video (animate an image), Reference-to-Video (1-9 character references via character1..N), Video-Edit (edit existing video — style transfer, character swap). All modes 3-15 seconds, 720p or 1080p, with native joint audio+video.

How does Reference-to-Video with 9 characters work?

Upload 1-9 character reference images. In the prompt, use character1, character2, ... to reference them (order = upload order). The model preserves each character's identity across the scene. Ideal for multi-character commercial spots and consistent series.

How much does generation cost?

Per-second pricing: 25 cr/sec for 720p, 43 cr/sec for 1080p. Identical pricing across all 4 modes.

Is HappyHorse 1.0 suitable for advertising and E-commerce?

Yes, HappyHorse 1.0 is well-suited for advertising, E-commerce (product shots, brand storytelling), and cinematic projects. Native joint audio+video delivers flawless dialogue and foley sync. Reference-to-Video with up to 9 references is ideal for consistent product series.

How to write a HappyHorse 1.0 prompt?

Use a 6-block structure: 1) Scene and time, 2) Subject, 3) Action and motion (with a measured "movement budget", e.g., "no more than 5% push-in"), 4) Camera language (lens, DOF, shot size), 5) Light and texture, 6) Audio (ambient, foley, dialogue in quotes). The model tolerates dense, detailed prompts — but the density must do real work. Use "photorealistic" + named imperfections (pores, fabric wear) for the best photoreal mode.

Which formats and durations are supported?

Duration 3-15 seconds (any integer). Resolution 720p or 1080p. Aspect ratios 16:9, 9:16, 1:1, 4:3, 3:4 (for T2V and R2V; in I2V it's taken from the image, in Edit from the source video). Lip-sync works in multiple languages.

We use cookies to operate the service, keep your session, and collect anonymous statistics. See our Privacy Policy.