dramabox • resemble tts research

DramaBox is the first directable speech engine. Open source and ready for its close-up.

Describe the speaker, the scene, and the delivery and DramaBox generates the performance. Every expressive output is watermarked for provenance.

Trusted by
built for control

Closing the gap between synthesis and performance

Traditional TTS provides a consistent voice but offers no way to direct the performance. Paralinguistic cues—pauses, breath, and emotional shifts are stripped away, leaving the output robotic. DramaBox restores this control, using your prompt to define the speaker identity, emotional range, and paralinguistic texture of the final audio.

STEP 1 • WRITE

Describe the scene

Write a prompt describing the speaker and the delivery. Dialogue goes inside double quotes. Stage directions — sighs, pauses, a voice cracking — go outside them and are never spoken aloud.No signs of AI generation detected. What you're seeing appears to be real.

STEP 2 • REFERENCE

Clone a target voice

Provide 10+ seconds of reference audio. DramaBox applies that timbre to your prompt’s performance. Skip the reference and the model invents a voice to match the description.

STEP 3 • GENERATE

Get an expressive output

The model produces 48 kHz stereo audio in approximately 2.5 seconds on a warm H100 server. Every output is watermarked with PerTh at generation.

what dramabox does

Everything a director controls, now in a prompt.

Voice, emotion, delivery, texture. Direction, not dictation.
Speaker identity from description
Describe the speaker: age, affect, accent, register. DramaBox generates a voice to match or applies a reference voice to whatever the prompt asks for.
Directed emotional range
Write the direction. The model follows it. Cold fury, warm enthusiasm, sinister calm, grief.
Paralinguistic texture
Stage directions outside the quotes become performance cues. The model reads them, never speaks them.
PerTh watermarking by default
Every output watermarked at generation. Survives MP3/AAC encoding and common edits at ~100% detection accuracy.
100%
Detection accuracy after MP3/AAC
48 kHz
Stereo output
~2.5s
Per generation on H100
HOW THE MODEL WORKS

A DiT trained to read the room

DramaBox is an IC-LoRA fine-tune of LTX-2.3, the audio-only branch of Lightricks' open video DiT. The design decision that makes it different from standard TTS is the prompt structure. Most TTS models take text and produce speech. With DramaBox, the prompt is the director's scrip, taking a scene description and producing a performance. The distinction lives in what the prompt can contain.
Dialogue goes inside double quotes and is spoken literally.
Stage directions go outside the quotes and function as performance cues, never spoken.
Phonetic vocalisations — laughter, gasps, sounds — go inside the quotes as single words, e.g. "Hahaha" not "Ha ha ha".
Listen

Cold fury to a venomous whisper

A single prompt directing a full emotional arc. No post-processing.

The script:
A regal woman speaks with cold fury in a measured, low voice. She sighs deeply, "I have told you a thousand times, and yet here we are again." Her voice sharpens with rising anger, "Do you honestly think I enjoy repeating myself?! Do you?!" She lets out a cold, mocking laugh, "Hahaha, how utterly pathetic you are." She drops to a venomous whisper, leaning close, "Now get out of my sight before I do something we will both regret.
REFERENCE
GENERATED
BUILT ON RESEMBLE AI

DramaBox generates the performance. The stack handles what comes after.

Every DramaBox output is watermarked at generation. Verify it anywhere it travels and detect synthetic audio from sources you didn’t generate.
Resemble Watermarker
Watermark generated audio and verify it later via API.
explore resemble watermarker
Resemble Voice Creation
Clone a voice from 10 seconds of audio or generate one from a text description.
explore voice creation
PerTh Watermarker
The neural watermarker embedded in every DramaBox output. Near 100% recovery across a standard attack suite.
Explore the model
Frequently asked questions
What makes DramaBox different from standard TTS?
Standard TTS produces words in a voice. DramaBox produces a performance directed by the prompt. Identity, emotion, pacing, laughter, sighs — all written in, all performed out.
Do I need a voice reference?
No. Without one, DramaBox generates a voice to match the speaker description. With 10+ seconds of reference audio, the model applies that timbre to the performance. The reference is a casting choice, not a requirement.
How do I write a prompt?
Dialogue inside double quotes is spoken literally. Stage directions outside the quotes are never spoken. Phonetic vocalisations like “Hahaha” go inside the quotes. Avoid Sigh, Gasp, or Cough inside quotes — the model speaks the word rather than performs the action.
Is every output watermarked?
Yes. PerTh watermarking is on by default. Imperceptible to listeners. Survives MP3/AAC encoding at ~100% detection accuracy. Can be disabled for debugging only.
What are the hardware requirements?
A warm H100 server with approximately 24 GB peak VRAM produces each request in approximately 2.5 seconds. Cold inference,  loading the Gemma model per request runs in approximately 30 seconds with a peak of approximately 8 GB VRAM.
Get complete generative AI security
Join thousands of developers and enterprises securing with Resemble AI