Chatterbox: Leading Open Source Voice Cloning AI Model
MIT licensed. Emotion control. Super fast. Consistently outperforms ElevenLabs in blind evaluations. Built for developers, creators, and enterprises who demand both quality and freedom.
TRUSTED BY DEVELOPERS AT
Voice Cloning with 5 seconds of reference audio
Expressive Speech
Accent Control
Text-based Controllability
Case Sensitive
Built for Production, Designed for Developers
Unique Emotion Control
First open source model with emotion exaggeration control. Adjust intensity from monotone to dramatically expressive with a single parameter.
Real-Time Voice Synthesis
Faster than realtime inference time with alignment-informed generation. Perfect for real-time applications, voice assistants, and interactive media.
Zero-Shot Voice Cloning
Clone any voice with just a few seconds of reference audio. No training required. Includes easy voice conversion scripts.
Watermarked & Secure
Built-in watermarking for generated audio. Know when content was created by Chatterbox while maintaining high audio quality.
Developer First
Simple pip install, comprehensive docs. Built by developers, for developers. Available on Github and Hugging Face.
Proven Performance
Consistently preferred over ElevenLabs in side-by-side evaluations. Trained on 500K hours of cleaned, high-quality data.
Try it Yourself
Hear how emotion control and high-quality synthesis make your content come alive.
How We Stack Up
The only TTS solution that combines enterprise-grade quality with complete transparency and control.
Subjective Evaluation
We conducted a test through Podonos designed to assess the performance of Resemble AI’s Chatterbox and ElevenLabs in generating natural and high-quality speech. Both systems produce audio clips based on 7 to 20 second long audio clips and identical text inputs (zero-shot, no prompt engineering and audio processing).
(Combined A4 + A5 responses showing preference for Model B)
Marked by Resemble AI’s PerTh Watermarker
Every audio file generated by Chatterbox includes Resemble AI’s PerTh (Perceptual Threshold) Watermarker — a deep neural network watermarker that embeds data in an imperceptible and difficult-to-detect way. This isn’t just a feature; it’s our commitment to responsible AI deployment.
The watermarker operates on principles of psychoacoustics — exploiting the way we perceive audio to find sounds that are inaudible, and then encoding data into these regions. It relies on a quirk of how humans process audio, by which tones with high audibility essentially “mask” nearby tones of lesser amplitude.
When someone speaks and produces peaks at specific frequencies, PerTh can embed structured tones within a few hertz that remain imperceptible to listeners while being robust against removal attempts.
Ready to Build with Generative Voice?
Join developers already using Chatterbox in production