Chatterbox: Open Source Text-to-Speech

Built for production

Designed for developers, engineered for quality

Chatterbox ships everything you expect from a premium closed-source model — without the lock-in.

♥

Unique emotion control

The first open source model with emotion exaggeration control. Adjust intensity from monotone to dramatically expressive with a single parameter.

⚡

Real-time voice synthesis

Faster-than-realtime inference with alignment-informed generation. Perfect for voice assistants, agents, and interactive media.

🎤

Zero-shot voice cloning

Clone any voice with just a few seconds of reference audio. No training required. Voice conversion scripts included out of the box.

🛡

Watermarked & secure

Built-in PerTh watermarking on every generation. Know when content was created by Chatterbox while maintaining high audio quality.

💻

Developer first

Simple pip install, comprehensive docs, and a permissive MIT license. Available on GitHub and Hugging Face.

🌎

Multilingual

Supports 23+ languages for truly global content — all with the same quality and controllability as English.

Listen

Voice cloning with 5 seconds of reference audio

Zero-shot clones generated directly from short reference clips. No fine-tuning, no prompt engineering, no post-processing.

Expressive speech

Gladiator · Rick

Dramatic delivery with full range of emotion exaggeration, cloned from a 5-second reference.

Accent control

Duff · Stewie

Accent and persona preserved across wildly different contexts, maintaining timbre and cadence.

Text-based controllability

Chef · Stewie

Steer performance directly from the input text — no separate prompts, no knobs to tune.

Case sensitive

Network · Conan

Capitalization actually shifts emphasis — a rare degree of controllability in any TTS model.

How we stack up

Enterprise-grade quality with complete transparency

The only TTS family that combines open source freedom with production quality, real-time latency, and on-premise deployment.

Capability	Chatterbox	ElevenLabs	OpenAI TTS	Azure TTS
Open Source	✓MIT License	×Closed	×Closed	×Closed
Multilingual	✓23+ languages	×Limited	×Limited	×Limited
Emotion Control	✓Unique feature	×Limited	×None	×Basic
Voice Cloning	✓Zero-shot	✓Premium only	×None	✓Limited
Latency	✓~200ms	200–300ms	~300ms	~300ms
On-Premise Deploy	✓Full control	×Cloud only	×Cloud only	×Cloud only
Cost	✓Free forever	$0.15 / 1K chars	$15 / 1M chars	$24 / 1M chars

Subjective evaluation

63.75% of evaluators preferred Chatterbox over ElevenLabs

We ran a blind evaluation through Podonos designed to assess the performance of Chatterbox and ElevenLabs in generating natural, high-quality speech. Both systems generated audio from identical text inputs using 7–20 second reference clips — zero-shot, no prompt engineering, no post-processing.

63.75% of blind evaluators preferred Chatterbox (combined A4 + A5 responses favoring Model B).

Evaluators listened to paired samples and rated preference on a 5-point scale from “ElevenLabs strongly preferred” to “Chatterbox strongly preferred.”

Read the full report on Podonos →

Chatterbox strongly preferred38.75%

Chatterbox preferred25.00%

No preference8.75%

ElevenLabs preferred16.25%

ElevenLabs strongly preferred11.25%

Chatterbox preference Other

Responsible AI

Every clip is marked by Resemble AI’s PerTh Watermarker

Every audio file generated by Chatterbox includes PerTh — a Perceptual Threshold deep neural watermarker that embeds data in an imperceptible and difficult-to-detect way. Not just a feature — our commitment to responsible AI deployment.

PerTh operates on principles of psychoacoustics, exploiting the way we perceive audio to find sounds that are inaudible and then encoding data into those regions. It relies on a quirk of how humans process audio, by which tones with high audibility essentially “mask” nearby tones of lesser amplitude.

When someone speaks and produces peaks at specific frequencies, PerTh can embed structured tones within a few hertz that remain imperceptible to listeners while staying robust against removal attempts.

View PerTh on GitHub →

What PerTh gives you

A production-grade watermark that travels with every generation, built for real-world distribution.

Imperceptible to listeners, robust against common audio processing
Traceable provenance for every clip Chatterbox generates
Works at inference time — no extra pipeline to maintain
Open source and inspectable, like the rest of the stack

Quickstart

Running in under a minute

Install from pip, point at a reference clip, and generate. No API keys, no rate limits, no sign-ups.

Built by developers, for developers

A permissive MIT license, a single pip package, and comprehensive docs. Deploy locally, on GPU, or fully on-premise — whichever fits your stack.

1Install the Chatterbox package from pip or clone the repo from GitHub.
2Drop in a 5-second reference clip of any voice you have permission to use.
3Generate expressive, watermarked speech — in real time.

# Install
pip install chatterbox-tts

# Generate
from chatterbox.tts import ChatterboxTTS
import torchaudio

model = ChatterboxTTS.from_pretrained(device="cuda")
wav = model.generate(
    text="Chatterbox is fast, expressive, and open source.",
    audio_prompt_path="reference.wav",
    exaggeration=0.7,
)
torchaudio.save("output.wav", wav, model.sr)

FAQ

Frequently asked questions

Everything you need to know about licensing, deployment, and the Chatterbox family.

Yes. Chatterbox, Chatterbox Multilingual, and Chatterbox Turbo are released under the MIT license. You can use them in commercial products, self-host, modify the weights, and ship to production — no royalties, no revenue share, no usage caps.

Point Chatterbox at any 5–20 second reference clip of a voice you have permission to use. No fine-tuning step, no training data needed — the model conditions on the reference at inference time.

Chatterbox Multilingual supports 23+ languages with zero-shot voice cloning and the same controllability as the English model. See the multilingual release post for the full list.

Chatterbox is the original, high-quality model with emotion exaggeration and zero-shot cloning. Turbo is optimized for the fastest open-source inference and adds paralinguistic tagging for non-speech sounds like laughter and breaths.

PerTh is part of our commitment to responsible AI deployment. It’s embedded imperceptibly into the output and is robust against common audio processing. The watermark exists so generated content remains attributable — removing it is explicitly against the intended use.

Yes. Chatterbox is designed for full on-premise deployment. Self-host on your own GPUs for air-gapped environments, regulated industries, or latency-critical workloads. Resemble AI also offers managed hosting via app.resemble.ai.

Issues and discussions live on the Chatterbox GitHub repo. For enterprise support, deployment help, or custom voice work, schedule a call with our team.

Chatterbox: the leading family of open source AI voice models, cloned from five seconds of audio.

The original

23 languages

Fastest open-source TTS