Chatterbox text to speech Logo

Chatterbox: Leading Open Source Voice Cloning AI Model

MIT licensed. Emotion control. Super fast. Consistently outperforms ElevenLabs in blind evaluations. Built for developers, creators, and enterprises who demand both quality and freedom.

TRUSTED BY DEVELOPERS AT

Developers at Age of Learning trust Resemble AI
Developers at Red Games trust Resemble AI
Developers at Netflix trust Resemble AI

Built for Production, Designed for Developers

j

Unique Emotion Control

First open source model with emotion exaggeration control. Adjust intensity from monotone to dramatically expressive with a single parameter.

Real-Time Voice Synthesis

Faster than realtime inference time with alignment-informed generation. Perfect for real-time applications, voice assistants, and interactive media.

Zero-Shot Voice Cloning

Clone any voice with just a few seconds of reference audio. No training required. Includes easy voice conversion scripts.

Watermarked & Secure

Built-in watermarking for generated audio. Know when content was created by Chatterbox while maintaining high audio quality.

Developer First

Simple pip install, comprehensive docs. Built by developers, for developers. Available on Github and Hugging Face.

Proven Performance

Consistently preferred over ElevenLabs in side-by-side evaluations. Trained on 500K hours of cleaned, high-quality data.

Try it Yourself

Hear how emotion control and high-quality synthesis make your content come alive.

How We Stack Up

The only TTS solution that combines enterprise-grade quality with complete transparency and control.

Open Source
Emotion Control
Voice Cloning
Latency
On-Premise Deploy
Cost
Chatterbox by Resemble AI
✓ MIT License
✓ Unique Feature
✓ Zero-shot
✓ ~200ms
✓ Full Control
✓ Free Forever
ElevenLabs
✗ Closed
✗ Limited
✓ Premium Only
On Average 200ms-300ms
✗ Cloud Only
$0.15/1000 chars
OpenAI TTS
✗ Closed
✗ None
✗ None
~300ms
✗ Cloud Only
$15/1M chars
Azure TTS
✗ Closed
✗ Basic
✓ Limited
~300ms
✗ Cloud Only
$24/1M chars

Subjective Evaluation

We conducted a test through Podonos designed to assess the performance of Resemble AI’s Chatterbox and ElevenLabs in generating natural and high-quality speech. Both systems produce audio clips based on 7 to 20 second long audio clips and identical text inputs (zero-shot, no prompt engineering and audio processing).

A1. 2. ElevenLabs (Model A) is strongly preferred
9
(11.25%)
A2. 1
13
(16.25%)
A3. 0
7
(8.75%)
A4. -1
20
(25.00%)
A5. -2. Resemble AI Chatterbox (Model B) is strongly preferred
31
(38.75%)
ElevenLabs Preferred
No Preference
Chatterbox Preferred
Result: 63.75% of evaluators preferred Chatterbox over ElevenLabs
(Combined A4 + A5 responses showing preference for Model B)

Marked by Resemble AI’s PerTh Watermarker

Every audio file generated by Chatterbox includes Resemble AI’s PerTh (Perceptual Threshold) Watermarker — a deep neural network watermarker that embeds data in an imperceptible and difficult-to-detect way. This isn’t just a feature; it’s our commitment to responsible AI deployment.

The watermarker operates on principles of psychoacoustics — exploiting the way we perceive audio to find sounds that are inaudible, and then encoding data into these regions. It relies on a quirk of how humans process audio, by which tones with high audibility essentially “mask” nearby tones of lesser amplitude.

When someone speaks and produces peaks at specific frequencies, PerTh can embed structured tones within a few hertz that remain imperceptible to listeners while being robust against removal attempts.

Ready to Build with Generative Voice?

Join developers already using Chatterbox in production