chatterbox multilingual V3 • resemble TTS research

Multilingual text-to-speech with compliance-oriented watermarking

Clone a voice once and hold it across 20+ languages. Every output is watermarked by default.

Trusted by
THE PROBLEM

Voice cloning across languages breaks down fast.

Most multilingual TTS models prioritize language count over output quality. Clone a voice in English, switch to Spanish, and the accent drifts, the rhythm changes, the speaker sounds like someone else.

VOICE IDENTITY

The speaker sounds like themselves across every language.

Provide 10 seconds or more of reference audio. Chatterbox V3 captures voice identity, including timbre, accent, and rhythm, and holds it across every target language.

LANGUAGE COVERAGE

One model. No rebuilding per market.

Submit text in any of the 20+ supported languages. Chatterbox V3 synthesizes speech in the cloned voice with no separate model, training run, or vendor per language.

SECURITY

Every clip is watermarked before it leaves the model.

PerTh watermarking is embedded at generation, imperceptible to listeners, persistent through re-encoding, and verifiable on demand.

HOW THE MODEL WORKS

Broad coverage, consistent identity.

Chatterbox Multilingual is designed as a single model for teams who need broad language coverage without managing a separate model per language. The latest version improves across the dimensions that break down most in production: speaker consistency, output stability, and delivery quality.
What's new in chatterbox multilingual v3

Three improvements that matter in production.


SPEAKER SIMILARITY

The cloned voice stays the cloned voice.

Voice identity and accent hold more consistently across language switches. Switch from English to Arabic or Japanese and the speaker still sounds like themselves.

HALLUCINATION REDUCTION

The model says what you gave it.

Less unwanted continuation, repetition, and off-prompt speech — a persistent problem in earlier multilingual models, especially on longer inputs.

NATURALNESS

Sounds like it was meant to be said.

Better rhythm and delivery across all supported languages, optimized for voice agents and conversational AI where flat output breaks the experience.

what it does

One model for every language your product needs.

Multilingual by default. Production-ready by design.

Zero-shot voice cloning
Clone any voice from 10 seconds of reference audio. No fine-tuning, no training run. Ready across all supported languages immediately.
20+ languages
Arabic, Chinese, Czech, Dutch, English, Finnish, French, German, Hebrew, Hindi, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Turkish, and Vietnamese.
Expressive emotion control
Adjust emotion and intensity per generation. Consistent expressive range across all supported languages.
Secure by design
Every clip is watermarked with PerTh at generation, before it leaves the model. Imperceptible to listeners, persistent through re-encoding, and verifiable on demand. Because Resemble builds the model, Resemble Detect identifies any Chatterbox-generated audio with 100% accuracy.
Built for conversational AI
Fluid rhythm, natural delivery, and stable output on the short varied inputs that voice agents produce at scale.
MIT licensed
Full model weights on Hugging Face. Self-host via pip or deploy on-premise. No vendor lock-in, no commercial restrictions.
LANGUAGEs
20+ languages, supported at launch
INDICATES A LANGUAGE PACK
Arabic (ar)
Chinese (zh)
Czech (cs)
Dutch (nl)
English (en)
Finnish (fi)
French (fr)
German (de)
Hebrew (he)
Hindi (hi)
Italian (it)
Japanese (ja)
Korean (ko)
Norwegian (no)
Polish (pl)
Portuguese (pt-pt, pt-br)
Russian (ru)
Spanish (es-es, es-mx)
Swedish
Turkish (tr)
Vietnamese
single language pack

Dedicated models for priority languages.

When a specific language needs tighter quality control, stronger dialect behavior, or regional pronunciation accuracy, the Single Language Pack provides a purpose-built model for that language.

Chinese (Mandarin)
Stronger tone accuracy and regional pronunciation control.
View on Hugging face
Spanish (Latin America)
Optimized for Latin American dialect coverage and regional accent fidelity.
View on Hugging face
Spanish (Spain)
Castilian Spanish with Spain-specific pronunciation and cadence.
View on Hugging face
Portuguese (Brazil)
Stronger dialect behavior and prosody control for Brazilian Portuguese.
View on Hugging face
Portuguese (Portugal)
European Portuguese with regional accent and pronunciation accuracy.
View on Hugging face
Hindi
Stronger language-specific pronunciation and natural delivery in Hindi.
View on Hugging face
Listen

Hear Chatterbox Multilingual

Six languages. Single Language Pack samples generated from dedicated per-language models.

CHINESE (MANDARIN)
但人生之中还是有一些白日梦的成分呐,像那个 Water Meet 一样,就是你会希望呃有一些故事发生。
HINDI
अब यह स्पष्ट है कि ग्लोबल वार्मिंग हो रही है औसत तापमान पच्चीस डिग्री सेल्सियस...
PORTUGUESE - BRAZIL
Havia um zumbido suave no ar enquanto o dia dava lugar à noite. As primeiras estrelas começaram a aparecer no céu...
PORTUGUESE - PORTUGAL
Olá, boa tarde. Fala o Tiago do apoio ao cliente da MEO. Só para confirmar, estou a falar com o titular da conta?
SPANISH - SPAIN
Hola. Llamo en nombre de SaludVital. ¿Es un buen momento para hablar?
SPANISH - LATIN AMERICA
Había un zumbido silencioso en el aire mientras el día daba paso a la noche. Las primeras estrellas comenzaron a aparecer en el cielo...
BUILT ON RESEMBLE AI

Chatterbox Multilingual works with the full Resemble stack.

When a specific language needs tighter quality control, stronger dialect behavior, or regional pronunciation accuracy, the Single Language Pack provides a purpose-built model for that language.
Resemble Watermarker
Embed watermarks into generated audio and verify provenance via API. Every Chatterbox Multilingual output includes PerTh watermarking at generation.
explore resemble watermarker
Resemble Voice Creation
Clone a voice from 10 seconds of audio or generate one from a text description, then deploy it across all supported languages.
explore voice creation
Resemble Detect
Because Resemble builds Chatterbox, Resemble Detect identifies Chatterbox-generated audio with 100% accuracy. Pair with PerTh watermarking for full provenance and detection coverage.
Explore resemble detect
Frequently asked questions
What is Chatterbox Multilingual?
Chatterbox Multilingual is Resemble AI's general-purpose multilingual TTS model. It supports zero-shot voice cloning across 20+ languages from a single model, with improvements in speaker similarity, hallucination reduction, and naturalness of delivery.
What improved in the latest version?
Chatterbox V3 improves across three areas: speaker similarity (the cloned voice holds more consistently across language switches), hallucination reduction (less unwanted repetition or off-prompt continuation), and naturalness (more fluid, conversational output across all supported languages).
What is the Single Language Pack?
A set of dedicated per-language models for six priority languages: Chinese (Mandarin), Latin American Spanish, Spain Spanish, Brazilian Portuguese, Portugal Portuguese, and Hindi. Use it when a specific language needs tighter quality control, stronger dialect accuracy, or regional pronunciation behavior beyond what Chatterbox V3 provides as a general-purpose model.
Which languages does Chatterbox Multilingual support?
Chatterbox Multilingual supports 20+ languages with zero-shot voice cloning, including Arabic, Chinese, Czech, Dutch, English, Finnish, French, German, Hebrew, Hindi, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Turkish, and Vietnamese.
How is Chatterbox Multilingual secured against misuse?
Every clip is automatically watermarked with PerTh at the point of generation. The watermark is imperceptible to listeners, survives re-encoding and format conversion, and can be verified on demand to confirm origin. Because Resemble builds and maintains the model, Resemble Detect can identify any Chatterbox-generated audio with 100% accuracy, giving teams a complete provenance and detection layer without additional setup.
Is Chatterbox Multilingual free to use commercially?
Yes. Chatterbox V3 is released under the MIT license. Use it in commercial products, modify it, and redistribute it without restriction. PerTh watermarking is embedded in every generated clip so synthetic content remains detectable downstream.
How does Chatterbox V3 compare to ElevenLabs for multilingual voice cloning?
In blind evaluations of the Chatterbox model family, listeners preferred Chatterbox over ElevenLabs 63.75% of the time. Chatterbox V3 extends that quality benchmark across 20+ languages, with full model weights you can inspect, fine-tune, and self-host.
Get complete generative AI security
Join thousands of developers and enterprises securing with Resemble AI