Chatterbox Turbo: Open Source, Ultrafast Text-to-Speech

Listen

Voice cloning from 5 seconds of reference audio

Outperforms proprietary closed-source models head-to-head. Same prompts, same reference audio — zero-shot, no prompt engineering, no post-processing.

Gen Z Girl ElevenLabs

Gen Z Girl Chatterbox Turbo

Liam Neeson ElevenLabs

Liam Neeson Chatterbox Turbo

Built for production

Fast enough for real-time, trustworthy in production

Chatterbox Turbo is the first open-source TTS that doesn’t ask you to choose your fighter. It’s fast, expressive, and MIT licensed — with every output authenticated by PerTh watermarking, so you can build voice AI that’s both open and accountable.

6×

Faster than real-time on a GPU

Streaming-ready inference for voice assistants, interactive media, and low-latency agent loops.

350M

Parameters · 75ms latency

Lean model size — alignment-informed generation keeps latency tight without sacrificing quality.

5s

Audio for voice cloning

Zero-shot cloning from a few seconds of reference audio. No training run, no fine-tune required.

Feature set

Built for production, designed for developers

Everything you expect from a modern TTS model — and a few things no other open-source model ships with.

Emotion

Unique emotion control

First open-source model with emotion exaggeration control. Adjust intensity from monotone to dramatically expressive with a single parameter.

Realtime

Real-time voice synthesis

Faster-than-realtime inference with alignment-informed generation. Perfect for real-time applications, voice assistants, and interactive media.

Cloning

Zero-shot voice cloning

Clone any voice with just a few seconds of reference audio. No training required. Includes easy voice conversion scripts.

Provenance

Watermarked & secure

Built-in PerTh watermarking on every generated audio. Know when content was created by Chatterbox while maintaining high audio quality.

Prompting

Paralinguistic prompting

Text-based tags that tell the model to perform natural vocal reactions in the cloned voice. Supported tags include sigh, gasp, cough and more.

Developer first

Simple install, deep docs

A single pip install, comprehensive docs, and reference code. Built by developers, for developers — available on GitHub and Hugging Face.

Head-to-head testing

Win rates across independent matchups

We ran a test through Podonos comparing Chatterbox Turbo against ElevenLabs Turbo v2.5, Cartesia Sonic 3, and VibeVoice 7B. All systems produced audio from 5–10 second reference clips and identical text inputs — zero-shot, no prompt engineering.

Matchup

vs ElevenLabs Turbo v2.5

Turbo 65.3%

Neutral 10.2%

ElevenLabs 24.5%

View study →

Matchup

vs Cartesia Sonic 3

Turbo 49.8%

Neutral 10.4%

Cartesia 39.8%

View study →

Matchup

vs VibeVoice 7B

Turbo 59.1%

Neutral 9.3%

VibeVoice 31.6%

View study →

Paralinguistic tags

More than words

Voice AI that sounds human — complete with reactions and emotions expressed in sighs, laughs, and more.

Natural vocal reactions, in the cloned voice

Chatterbox Turbo introduces paralinguistic prompting: text-based tags that tell the model to perform natural vocal reactions in the cloned voice.

The model performs these reactions naturally, in the same cloned voice, with the same emotional tone — no post-processing, no splicing, no manual audio editing.

[sigh] [gasp] [cough] [laugh] [whisper] [breath]

Paralinguistic prompting diagram showing text tags driving vocal reactions in a cloned voice

Authenticated by default

Marked by Resemble AI’s PerTh Watermarker

Every audio file generated by Chatterbox includes Resemble AI’s PerTh (Perceptual Threshold) Watermarker — a deep neural network that embeds data in an imperceptible, difficult-to-detect way. This isn’t just a feature; it’s our commitment to responsible AI deployment.

Diagram illustrating how AI watermarking works inside an audio file

Psychoacoustic watermarking, on every output

The watermarker operates on principles of psychoacoustics — exploiting the way we perceive audio to find sounds that are inaudible, and then encoding data into those regions.

The result: authenticated audio that sounds unchanged to humans but stays traceable for detection, provenance, and incident response.

Learn more about PerTh →

For developers

Ship in minutes, not weeks

A single install, a permissive license, and two distribution channels. Pick your path — the model is the same on all of them.

Get started

Try Chatterbox Turbo inside Resemble AI — no install required. Use the hosted playground to test voices, tune emotion, and ship straight to production.

Use on Resemble AI →

GitHub

MIT-licensed source, reference scripts, and voice conversion tools. Clone the repo, read the docs, open a PR.

View repo on GitHub →

Hugging Face

Weights hosted on Hugging Face for fast pulls, versioning, and Spaces integration. Use with transformers or pin a revision.

Open model page →

$ pip install chatterbox-tts

FAQ

Frequently asked questions

Quick answers on licensing, performance, and what’s in the box.

Yes. Chatterbox Turbo ships under the MIT license. You can use it in personal, research, and commercial projects — including closed-source products — subject to the license terms on GitHub.

Chatterbox Turbo runs up to 6× faster than real-time on a modern GPU, with roughly 75ms latency. That’s fast enough for streaming voice assistants, real-time agents, and interactive media.

Around 5 seconds of reference audio is enough for zero-shot voice cloning. No fine-tuning, no training run — pass the clip at inference time and generate.

Text-based tags like [sigh], [gasp], [cough], and [laugh] that the model performs in the cloned voice with matching emotional tone. No splicing or manual editing required.

Yes. Every audio file generated by Chatterbox Turbo is marked with our PerTh watermarker — imperceptible to listeners, but detectable for provenance and incident response. It’s the first open-source TTS to ship authentication on by default.

The code is on GitHub, the weights are on Hugging Face, and you can try it instantly in the hosted playground at app.resemble.ai.

Fast, expressive, open source TTS — authenticated by default.

Voice cloning from 5 seconds of reference audio

Fast enough for real-time, trustworthy in production

Faster than real-time on a GPU

Parameters · 75ms latency

Audio for voice cloning

Built for production, designed for developers

Unique emotion control

Real-time voice synthesis

Zero-shot voice cloning

Watermarked & secure

Paralinguistic prompting

Simple install, deep docs

Win rates across independent matchups

More than words

Natural vocal reactions, in the cloned voice

Marked by Resemble AI’s PerTh Watermarker

Psychoacoustic watermarking, on every output

Ship in minutes, not weeks

Get started

GitHub

Hugging Face

Frequently asked questions