DramaBox: Expressive Text to Speech Model

built for control

Closing the gap between synthesis and performance

Traditional TTS provides a consistent voice but offers no way to direct the performance. Paralinguistic cues—pauses, breath, and emotional shifts are stripped away, leaving the output robotic. DramaBox restores this control, using your prompt to define the speaker identity, emotional range, and paralinguistic texture of the final audio.



STEP 1 • WRITE

Describe the scene

Write a prompt describing the speaker and the delivery. Dialogue goes inside double quotes. Stage directions — sighs, pauses, a voice cracking — go outside them and are never spoken aloud.No signs of AI generation detected. What you're seeing appears to be real.



STEP 2 • REFERENCE

Clone a target voice

Provide 10+ seconds of reference audio. DramaBox applies that timbre to your prompt’s performance. Skip the reference and the model invents a voice to match the description.



STEP 3 • GENERATE

Get an expressive output

The model produces 48 kHz stereo audio in approximately 2.5 seconds on a warm H100 server. Every output is watermarked with PerTh at generation.

what dramabox does

Everything a director controls, now in a prompt.

Voice, emotion, delivery, texture. Direction, not dictation.

Speaker identity from description

Describe the speaker: age, affect, accent, register. DramaBox generates a voice to match or applies a reference voice to whatever the prompt asks for.

Directed emotional range

Write the direction. The model follows it. Cold fury, warm enthusiasm, sinister calm, grief.

Paralinguistic texture

Stage directions outside the quotes become performance cues. The model reads them, never speaks them.

PerTh watermarking by default

Every output watermarked at generation. Survives MP3/AAC encoding and common edits at ~100% detection accuracy.

100%

Detection accuracy after MP3/AAC

48 kHz

Stereo output

~2.5s

Per generation on H100

HOW THE MODEL WORKS

A DiT trained to read the room

DramaBox is an IC-LoRA fine-tune of LTX-2.3, the audio-only branch of Lightricks' open video DiT. The design decision that makes it different from standard TTS is the prompt structure. Most TTS models take text and produce speech. With DramaBox, the prompt is the director's scrip, taking a scene description and producing a performance. The distinction lives in what the prompt can contain.

Dialogue goes inside double quotes and is spoken literally.

Stage directions go outside the quotes and function as performance cues, never spoken.

Phonetic vocalisations — laughter, gasps, sounds — go inside the quotes as single words, e.g. "Hahaha" not "Ha ha ha".

Listen

Cold fury to a venomous whisper

A single prompt directing a full emotional arc. No post-processing.

The script:
A regal woman speaks with cold fury in a measured, low voice. She sighs deeply, "I have told you a thousand times, and yet here we are again." Her voice sharpens with rising anger, "Do you honestly think I enjoy repeating myself?! Do you?!" She lets out a cold, mocking laugh, "Hahaha, how utterly pathetic you are." She drops to a venomous whisper, leaning close, "Now get out of my sight before I do something we will both regret.”

REFERENCE

GENERATED

BUILT ON RESEMBLE AI

DramaBox generates the performance. The stack handles what comes after.

Every DramaBox output is watermarked at generation. Verify it anywhere it travels and detect synthetic audio from sources you didn’t generate.

Resemble Watermarker

Watermark generated audio and verify it later via API.

explore resemble watermarker

Resemble Voice Creation

Clone a voice from 10 seconds of audio or generate one from a text description.

explore voice creation

PerTh Watermarker

The neural watermarker embedded in every DramaBox output. Near 100% recovery across a standard attack suite.

Explore the model

Frequently asked questions

What makes DramaBox different from standard TTS?

Standard TTS produces words in a voice. DramaBox produces a performance directed by the prompt. Identity, emotion, pacing, laughter, sighs — all written in, all performed out.

Do I need a voice reference?

No. Without one, DramaBox generates a voice to match the speaker description. With 10+ seconds of reference audio, the model applies that timbre to the performance. The reference is a casting choice, not a requirement.

How do I write a prompt?

Dialogue inside double quotes is spoken literally. Stage directions outside the quotes are never spoken. Phonetic vocalisations like “Hahaha” go inside the quotes. Avoid Sigh, Gasp, or Cough inside quotes — the model speaks the word rather than performs the action.

Is every output watermarked?

Yes. PerTh watermarking is on by default. Imperceptible to listeners. Survives MP3/AAC encoding at ~100% detection accuracy. Can be disabled for debugging only.

What are the hardware requirements?

A warm H100 server with approximately 24 GB peak VRAM produces each request in approximately 2.5 seconds. Cold inference, loading the Gemma model per request runs in approximately 30 seconds with a peak of approximately 8 GB VRAM.

DramaBox is the first directable speech engine. Open source and ready for its close-up.

Closing the gap between synthesis and performance

Everything a director controls, now in a prompt.

A DiT trained to read the room

Cold fury to a venomous whisper

DramaBox generates the performance. The stack handles what comes after.