Describe the speaker, the scene, and the delivery and DramaBox generates the performance. Every expressive output is watermarked for provenance.
Traditional TTS provides a consistent voice but offers no way to direct the performance. Paralinguistic cues—pauses, breath, and emotional shifts are stripped away, leaving the output robotic. DramaBox restores this control, using your prompt to define the speaker identity, emotional range, and paralinguistic texture of the final audio.
Describe the scene
Write a prompt describing the speaker and the delivery. Dialogue goes inside double quotes. Stage directions — sighs, pauses, a voice cracking — go outside them and are never spoken aloud.No signs of AI generation detected. What you're seeing appears to be real.
Clone a target voice
Provide 10+ seconds of reference audio. DramaBox applies that timbre to your prompt’s performance. Skip the reference and the model invents a voice to match the description.
Get an expressive output
The model produces 48 kHz stereo audio in approximately 2.5 seconds on a warm H100 server. Every output is watermarked with PerTh at generation.

A single prompt directing a full emotional arc. No post-processing.


