Chatterbox: Leading Open Source Voice Cloning AI Model, Now Multilingual

MIT licensed. Multilingual. Emotion control. Super fast. Consistently outperforms ElevenLabs in blind evaluations. Built for developers, creators, and enterprises who demand both quality and freedom.

Try it for free

Get on Github

TRUSTED BY DEVELOPERS AT

Voice Cloning with 5 seconds of reference audio

Expressive Speech

Accent Control

Text-based Controllability

Case Sensitive

Built for Production, Designed for Developers

j

Unique Emotion Control

First open source model with emotion exaggeration control. Adjust intensity from monotone to dramatically expressive with a single parameter.



Real-Time Voice Synthesis

Faster than realtime inference time with alignment-informed generation. Perfect for real-time applications, voice assistants, and interactive media.



Zero-Shot Voice Cloning

Clone any voice with just a few seconds of reference audio. No training required. Includes easy voice conversion scripts.



Watermarked & Secure

Built-in watermarking for generated audio. Know when content was created by Chatterbox while maintaining high audio quality.



Developer First

Simple pip install, comprehensive docs. Built by developers, for developers. Available on Github and Hugging Face.



Multilingual

Supports 23+ languages to help you create truly multilingual content that is global

How We Stack Up

The only TTS solution that combines enterprise-grade quality with complete transparency and control.

Open Source

Multilingual

Emotion Control

Voice Cloning

Latency

On-Premise Deploy

Cost

Chatterbox by Resemble AI

✓ MIT License

✓ Yes

✓ Unique Feature

✓ Zero-shot

✓ ~200ms

✓ Full Control

✓ Free Forever

ElevenLabs

✗ Closed

✗ Limited

✓ Premium Only

On Average 200ms-300ms

✗ Cloud Only

$0.15/1000 chars

OpenAI TTS

✗ Closed

✗ Limited

✗ None

~300ms

✗ Cloud Only

$15/1M chars

Azure TTS

✗ Closed

✗ Limited

✗ Basic

✓ Limited

~300ms

✗ Cloud Only

$24/1M chars

Subjective Evaluation

We conducted a test through Podonos designed to assess the performance of Resemble AI’s Chatterbox and ElevenLabs in generating natural and high-quality speech. Both systems produce audio clips based on 7 to 20 second long audio clips and identical text inputs (zero-shot, no prompt engineering and audio processing).

A1. 2. ElevenLabs (Model A) is strongly preferred

9

(11.25%)

A2. 1

13

(16.25%)

A3. 0

7

(8.75%)

A4. -1

20

(25.00%)

A5. -2. Resemble AI Chatterbox (Model B) is strongly preferred

31

(38.75%)

ElevenLabs Preferred

No Preference

Chatterbox Preferred

Result: 63.75% of evaluators preferred Chatterbox over ElevenLabs
(Combined A4 + A5 responses showing preference for Model B)

Read the Full Report on Podonos

Marked by Resemble AI’s PerTh Watermarker

Every audio file generated by Chatterbox includes Resemble AI’s PerTh (Perceptual Threshold) Watermarker — a deep neural network watermarker that embeds data in an imperceptible and difficult-to-detect way. This isn’t just a feature; it’s our commitment to responsible AI deployment.

The watermarker operates on principles of psychoacoustics — exploiting the way we perceive audio to find sounds that are inaudible, and then encoding data into these regions. It relies on a quirk of how humans process audio, by which tones with high audibility essentially “mask” nearby tones of lesser amplitude.

When someone speaks and produces peaks at specific frequencies, PerTh can embed structured tones within a few hertz that remain imperceptible to listeners while being robust against removal attempts.

View PerTh Watermarker on Github

Ready to Build with Generative Voice?

Join developers already using Chatterbox in production

⚡️ Use on Resemble AI

View Github Repo