Multilingual. Turbo. Emotion control. Super fast. Consistently outperforms ElevenLabs in blind evaluations. Built for developers, creators, and enterprises who demand both quality and freedom.
High quality, fast text-to-speech with emotion control and zero-shot voice cloning.
Hear samples →Open source TTS in 23 languages with full control and zero-shot voice cloning.
Learn more →Blazing fast inference with paralinguistic tagging for non-speech sounds.
Learn more →Chatterbox ships everything you expect from a premium closed-source model — without the lock-in.
The first open source model with emotion exaggeration control. Adjust intensity from monotone to dramatically expressive with a single parameter.
Faster-than-realtime inference with alignment-informed generation. Perfect for voice assistants, agents, and interactive media.
Clone any voice with just a few seconds of reference audio. No training required. Voice conversion scripts included out of the box.
Built-in PerTh watermarking on every generation. Know when content was created by Chatterbox while maintaining high audio quality.
Simple pip install, comprehensive docs, and a permissive MIT license. Available on GitHub and Hugging Face.
Supports 23+ languages for truly global content — all with the same quality and controllability as English.
Zero-shot clones generated directly from short reference clips. No fine-tuning, no prompt engineering, no post-processing.
Dramatic delivery with full range of emotion exaggeration, cloned from a 5-second reference.
Accent and persona preserved across wildly different contexts, maintaining timbre and cadence.
Steer performance directly from the input text — no separate prompts, no knobs to tune.
Capitalization actually shifts emphasis — a rare degree of controllability in any TTS model.
The only TTS family that combines open source freedom with production quality, real-time latency, and on-premise deployment.
| Capability | Chatterbox | ElevenLabs | OpenAI TTS | Azure TTS |
|---|---|---|---|---|
| Open Source | ✓MIT License | ×Closed | ×Closed | ×Closed |
| Multilingual | ✓23+ languages | ×Limited | ×Limited | ×Limited |
| Emotion Control | ✓Unique feature | ×Limited | ×None | ×Basic |
| Voice Cloning | ✓Zero-shot | ✓Premium only | ×None | ✓Limited |
| Latency | ✓~200ms | 200–300ms | ~300ms | ~300ms |
| On-Premise Deploy | ✓Full control | ×Cloud only | ×Cloud only | ×Cloud only |
| Cost | ✓Free forever | $0.15 / 1K chars | $15 / 1M chars | $24 / 1M chars |
We ran a blind evaluation through Podonos designed to assess the performance of Chatterbox and ElevenLabs in generating natural, high-quality speech. Both systems generated audio from identical text inputs using 7–20 second reference clips — zero-shot, no prompt engineering, no post-processing.
63.75% of blind evaluators preferred Chatterbox (combined A4 + A5 responses favoring Model B).
Evaluators listened to paired samples and rated preference on a 5-point scale from “ElevenLabs strongly preferred” to “Chatterbox strongly preferred.”
Read the full report on Podonos →Every audio file generated by Chatterbox includes PerTh — a Perceptual Threshold deep neural watermarker that embeds data in an imperceptible and difficult-to-detect way. Not just a feature — our commitment to responsible AI deployment.
PerTh operates on principles of psychoacoustics, exploiting the way we perceive audio to find sounds that are inaudible and then encoding data into those regions. It relies on a quirk of how humans process audio, by which tones with high audibility essentially “mask” nearby tones of lesser amplitude.
When someone speaks and produces peaks at specific frequencies, PerTh can embed structured tones within a few hertz that remain imperceptible to listeners while staying robust against removal attempts.
View PerTh on GitHub →A production-grade watermark that travels with every generation, built for real-world distribution.
Install from pip, point at a reference clip, and generate. No API keys, no rate limits, no sign-ups.
A permissive MIT license, a single pip package, and comprehensive docs. Deploy locally, on GPU, or fully on-premise — whichever fits your stack.
# Install pip install chatterbox-tts # Generate from chatterbox.tts import ChatterboxTTS import torchaudio model = ChatterboxTTS.from_pretrained(device="cuda") wav = model.generate( text="Chatterbox is fast, expressive, and open source.", audio_prompt_path="reference.wav", exaggeration=0.7, ) torchaudio.save("output.wav", wav, model.sr)
Everything you need to know about licensing, deployment, and the Chatterbox family.