AI video

Finally! Someone’s Taking Charge of AI Video Voice & Performance: GAGA AI Test

Honghao Wang

10 Oct 2025 — 3 min read

Early Test of Sand.ai's New Dialogue Performance Model — GAGA‑1

I had an early opportunity to try the newly released GAGA‑1 model from Cao Yue at Sand.ai (gaga.art) — designed specifically for nuanced human dialogue performance.

It’s probably the strongest model so far when it comes to delivering detailed, subtle acting in conversations. In performance quality, it may even surpass Sora 2.

I initially approached it casually, but was surprised at its power. Below is an intro and montage of my early tests.

---

Core Features

Voice + video generated together — even side‐view lip sync is highly accurate, with voice effects.
Facial expressions tightly matched to vocal tone — nuanced, natural, and prompt-compliant.
Dual‑person scene acting — differentiates character voices faithfully.
Multilingual output — can mix languages in a single scene.
Free trial available — image‑to‑video support, up to 10 s clips at 720 P, rich detail despite time cap.

---

Test Scenarios

1. Self Introduction

Prompt:

> Calm smile: “Hi, I’m testing Gaga AI.”

> Then more serious: “What do you think of its performance?”

Observations:

Micro‑expressions (eyebrow raise, camera gaze, nod) are astonishingly real.
Such subtle details aren’t easily specified in prompts — they rely on the model’s own intelligence.

---

2. Lip Sync & Rhythm

Prompt:

> Clear mouth shapes, even rhythm: “Bā bǎi biāo bīng bēn běi pō...”

> (after speaking, a light breath)

Observations:

Excellent audio–facial movement sync.
Animated hand movement kept the scene lively.
Naturally inserted sigh — unprompted but contextually fitting.

---

3. Environment Sound + Speech Sync

Prompt:

> While speaking, slight hand lift: “Do you hear the light clink of the cup rim?”

> (lightly taps cup, subtle clink sound)

> (pause) “It’s just like being there.”

Observations:

Correct ordering of speech and sound effects.
Nuanced facial changes aligned with tone (“slight astonishment + pride”).
Tip: Avoid overly complex hand movements — some instability may occur.

---

4. Multilingual Capability

English Prompt:

Gentle: “At first, I was very optimistic.” (smile, pause) “But the data tells me we need to decide calmly.”

Japanese Prompt:

Polite: 「こんにちは。大事な発表です。落ち着いて、聞いてください。」 (slight nod)

More Examples:

Spanish (Warm & Confident):
“Gracias por venir. La verdad es clara: ahora reimaginamos el cine con IA.”
Chinese–English Mix (Calm):
“结论很简单——we’re ready for production.” (pause) “就现在。”

Observations:

All languages were precise in lip sync and facial cues.
No perceptible difference in performance quality across languages.

---

Emotional Performance Tests

Shame & Guilt

> Avoids eye contact, lowers voice: “I cheated.”

> Tightens jaw, trembling: “I’m sorry, I shouldn’t have done that.”

Initial evasive gaze + voice drop was authentic human-like acting.
Voice shift from weak to strong felt natural.

Desperation & Pleading

> Rain sound. “Don’t go, at least let me finish my words.”

> Louder: “I’ll change, really.”

Added atmospheric piano BGM automatically.
Correct rain sound volume for inside-car perspective.
Trembling voice nuance matched emotional state.

---

Two-Person Performance

Simple Dialogue

A (smiling): “One sentence to summarize GAGA-1?”

B (steady): “Voice, lip-sync, facial expressions — perfectly in sync.”

A (nodding): “Film-grade, straight-to-use?”

B (confident): “Of course.”

Handled side-profile lip sync flawlessly.
Only minor stumble on the English word “GAGA” in a Chinese sentence.

Scene-Based Argument

Male (angry): “Who changed the budget?”

Female (guilty): “I… it was me, but I had no choice.”

Male: “You did have a choice.”

Added dramatic gestures and correct facing during confrontation.
Nuanced inhale before speech was accurate.

---

Conclusion

GAGA‑1 demonstrates:

Long list of strengths and minimal weaknesses.
Clear focus on voice & acting performance.

---

Key Usage Tips

a. Order prompts logically — describe emotion changes first, then tone/content. Explicitly note pauses if needed.

b. Best with two speakers — specify them as “left/right,” “male/female,” or “A/B.” Performance declines beyond two.

c. Image-to-video — limit visible limbs; avoid complex movement prompts.

d. Match dialogue length to clip length — under 10 chars → 5 s; long → 10 s.

e. Currently 16:9 only — vertical 9:16 support coming soon.

---

Broader Context in AI Video

Video models are evolving beyond:

Physics and prompt compliance — entering a stage with richer emotional expression and integrated audio–visual sync.
World knowledge integration — models can visually reason, script, and edit internally without external agents.

Models like GAGA‑1 indicate domestic innovation is competitive globally.

---

Bonus for Creators

Platforms like AiToEarn官网 offer:

Open-source global AI content monetization
Integration of generation → publishing → analytics
Cross-platform publishing: Douyin, Kwai, WeChat, Bilibili, Xiaohongshu, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, X/Twitter
AI model rankings: AI模型排名

Paired with GAGA‑1, you can produce, publish, and monetize multilingual emotionally-rich videos at scale.

---

Original Post: Read here

WeChat link: Open in WeChat

Finally! Someone’s Taking Charge of AI Video Voice & Performance: GAGA AI Test

Honghao Wang

Early Test of Sand.ai's New Dialogue Performance Model — GAGA‑1

Core Features

Test Scenarios

1. Self Introduction

2. Lip Sync & Rhythm

3. Environment Sound + Speech Sync

4. Multilingual Capability

Emotional Performance Tests

Shame & Guilt

Desperation & Pleading

Two-Person Performance

Simple Dialogue

Scene-Based Argument

Conclusion

Key Usage Tips

Broader Context in AI Video

Bonus for Creators

Read more

Andrej Karpathy: Ten More Years to Artificial General Intelligence

Google DeepMind Launches CodeMender: An Intelligent Agent for Automatic Code Repair

What Signal Is Behind People’s Daily’s Consecutive Interviews with Entrepreneurs?

Form Labels: Wrap or Separate?