Blog

•

May 13, 2026

DramaBox TTS: Saving drama for the performance, not the security review

At some point, everyone who has built with TTS hits the same wall. The voice sounds fine, the words are correct, and it still sounds wrong in a way that is hard to explain until you have heard it enough times to name it: flat. Technically accurate, completely unconvincing.

The problem was that we were using machine language to describe something fundamentally human. Tags, parameters, style tokens, all of it was a translation layer between what you actually wanted and what the model could understand. The translation is where the performance struggled to sound…human.

A real director does not say "increase emotional intensity by 20%." They set a scene, they describe a character, and the performance follows from that direction. DramaBox is the first TTS model that works the same way. You describe what you want in plain language, the way you would describe it to a person, and the model understands it that way.

The second problem was quieter but just as serious. Every piece of synthetic audio left your hands with no way to prove it was yours. No signal in the file, no chain of custody, nothing that holds up when legal gets involved.

Today we are releasing DramaBox, and it addresses both.

Huge congratulations to Resemble AI on the release of DramaBox, their latest expressive voice model. It's also amazing to see the Resemble TTS family surpass 10M downloads on Hugging Face, a testament to the strength of the open model community.

- Joshua Lochner, Open Source Machine Learning Engineer

Directable speech

DramaBox is a prompt-driven expressive TTS model. One prompt controls everything: speaker identity, emotion, delivery style, laughs, sighs, breaths, pauses, transitions. You are not adjusting parameters, you are writing a scene.

The format works like a screenplay. Speaker description and stage directions go outside the quotes. Dialogue goes inside. The model speaks the dialogue and interprets everything outside the quotes as performance direction, never as words to say.

Read this prompt:

A regal woman speaks with cold fury in a measured, low voice. She sighs deeply, "I have told you a thousand times, and yet here we are again." Her voice sharpens with rising anger, "Do you honestly think I enjoy repeating myself?!" She drops to a venomous whisper, leaning close, "Now get out of my sight before I do something we will both regret."

That is the exact prompt we passed to the model, one generation with no post-processing and no stitching of takes together.

Sample: Regal Queen — Cold Fury to Venomous Whisper

How it works

The prompt format distinguishes between two zones. Inside double quotes, the model speaks literally. Outside double quotes, it performs. Write, She sighs deeply outside the quotes and the model produces a sigh. Write "Sigh" inside the quotes and the model says the word out loud, which is not what you want. That distinction is what makes the whole thing work, and it is also what makes the prompting feel genuinely different from anything else in TTS.

Here is what that looks like when you give the model a character who cannot stop laughing:

A playful girl speaks in a bright, singsong voice, already mid-giggle, "Hehehe, oh my gosh you should see your face right now, it is priceless!" She gasps for air between giggles, "Oh my, I cannot stop laughing!" She tries to compose herself with a long sigh, "Ahhhhh okay okay, I will stop, I promise." She leans in and whispers, "But seriously though, between you and me," then immediately loses it again, "Haha, I just cannot! You are way too funny!"

Sample: Catgirl - Uncontrollable Giggling

The gasps, the failed attempts to compose herself, the snort at the end — all of it came from the prompt, not from post-processing.

What's under the hood

DramaBox is a 3.3B-parameter audio-only diffusion transformer, LoRA-merged, with a Gemma 3 12B text encoder running at 4-bit quantization. It outputs 48kHz stereo audio in AAC or WAV. On a warm H100, generation takes around 2.5 seconds.

Voice cloning is optional. Pass a 10-second reference clip and the model clones the target timbre while still following the prompt for everything else: emotion, delivery, the laugh in the middle of a sentence, the breath before the hard line. The reference sets the voice and the prompt directs the performance. Without a reference, the model invents a voice that fits the scene description.

And it’s a step towards compliant TTS for coming regulations like the EU AI Act. Every output is watermarked with Resemble Watermarker. The watermark is embedded in the signal itself, not in file metadata, and constrained to speech-relevant frequencies so it is inaudible to listeners. It survives MP3 and AAC compression, re-encoding, and common edits at ~100% detection accuracy. Pass that output through our deepfake detection API and you get back a binary decode, not a probability score you have to interpret, but a watermark-present-or-not answer that is more defensible when legal gets involved. Generation and protection are not separate decisions here. They are the same decision, made at the moment of creation.

The range

The model handles tonal range across the full dramatic spectrum. A villain whose menace never tips into parody:

Sample: Villain - Sinister Laugh

A talk show host who loses it completely and cannot find his way back:

Sample: Talk Show Host - Wheezing Laugh

Five voices building a late-90s pop harmony from soft synchronized layers to a full-group chorus:

Backstreet Boys, a polished late-90s boy band with five smooth, harmonizing male voices. "Step by step… out the door… new day… ready for more…" they sing in soft, synchronized harmony. One voice steps forward. "Keys in my hand… got my plan…" The others swell behind him. Their voices rise together. "Tell me why… every morning feels the same…" then "I'm ready to go…" The full group returns in a bright, unified chorus. "We'll make it our way… through the rush, through the noise, we keep moving strong, yeah!"

Sample: Boy Band - Pop Harmony

And a football commentator calling a fridge opening with the full gravitas of a Champions League final:

Sample: Football Commmentator - Boring Activities

That last one is worth listening to carefully because it demonstrates something more specific than expressiveness: precise tonal control across a long, slow build with crowd audio layered underneath, which is a harder thing to get right than a single emotional peak. The model is not just generating speech. It is generating a scene with pacing.

What and who this is for

DramaBox is an English-only release, and that is intentional. Getting directable speech right in one language is harder than getting flat TTS right in thirty, and we were not willing to trade quality for coverage on this one.

The use cases are anywhere flat TTS has always been the bottleneck: game dialogue that players do not skip, audiobook narration with real character differentiation, voice agents that do not sound like they are reading from a script, dubbing work where the performance has to match the scene and not just the words. Every file that comes out carries a provenance signal embedded at the moment of creation, so wherever it ends up, you can prove it was yours.

Run it self-hosted. The model card and quick-start code are on Hugging Face. If you’re interested in running DramaBox at scale, we suggest you reach out to Cerebrium or GMI.

What's next

DramaBox is the first of four open-source TTS models we are shipping this month, and each one is a different answer to a different problem.

Chatterbox Nano: 110M parameters. 10x realtime on GPU, 3x realtime on CPU. Runs at the edge. Full paralinguistic tags, voice cloning from 5 seconds. The smallest serious TTS model we've ever shipped, and arguably the fastest open-source TTS.
Chatterbox Flash: Chatterbox, rebuilt on a diffusion-LLM architecture. 2x faster than our AR baseline on vLLM. Ships with a novel prior-subtraction technique we believe generalizes to any dLLM TTS, one of the first production TTS models on this architecture.
Chatterbox Multilingual V3: Better speaker similarity, fewer hallucinations, more natural delivery across languages. Plus dedicated single-language models for Mandarin, LATAM Spanish, Brazilian Portuguese, Spain Spanish, Portugal Portuguese, and Hindi.

We will cover each one when it's available.

Try Resemble AI free

Generate with confidence. Verify ownership. Detect deception. Only with Resemble AI.

Get started