BENCHMARKS

Benchmarks for Deepfake Detection and Voice AI Models

Every result comes from public leaderboards or reproducible evaluations. We show methodology, source, and test conditions — not just the number.
97.9%
Audio detection accuracy • Speech DF Arena • pooled
#1
DFBench • Speech and Image detection
2 to 1
Preferred Chatterbox Turbo over ElevenLabs Turbo v2.5
160+
Generative models tested against for detection
Model

Detect-3B Omni ranked #1 across audio, image, and video detection

DETECT-3B Omni is a 3 billion parameter multimodal model, evaluated on public leaderboards across 50+ languages and 160+ generative models.
96.7%
Pooled detection accuracy, Speech DF Arena
3B params · Updated March 2026
AUDIO
97.9%
Pooled accuracy · Speech DF Arena
96.7% on DFBench Speech. EER 2.099% across 40+ languages including telephony codecs G.711 and G.723.1.
#1 DFBENCH SPEECH
IMAGE
85.1%
Overall accuracy · DFBench Image 2025
94.7% on real images. 75.4% on fakes. 90.4% on PNG. 90.8% on TIFF. Evaluated on text-to-image and image-to-image outputs.
#1 DFBENCH IMAGE
VIDEO
~2.5%
Overall EER, video generation models
>99% on Veo 2. ~95% on Veo 3. No prior enrollment required — raw signal analysis.
FACE SWAP • LIP SYNC • FULL BODY
Deepfake detection benchmarks
RESEMBLE AI

COMPETITOR
Speech deepfake detection — Equal Error Rate (EER)
Lower scores indicate better discrimination between real and synthetic speech.
Resemble Detect 3B
2.099%
Hiya Authenticity Verif.
2.324%
DLMSL SpeakSure v0.1
6.142%
Whispeak
8.060%
DF-Raptor
8.35%
0%
EER % (lower = better)
10%
Speech DeepFake Arena · Hugging Face · March 2026 · Source
Audio deepfake detection — Accuracy (Speech DF Arena)
Pooled accuracy across all test sets · Higher = better · March 2026
Resemble Detect 3B
97.9%
Hiya Authenticity Verif.
97.7%
DLMSL SpeakSure v0.1
93.9%
DF-Raptor
92.3%
Whispeak
91.9%
0%
Pooled accuracy (higher = better)
100%
Test sets: ASV2019, ASV2021, Codecfake, ADD 2022, and 8 other held-out sets. Source
LANGUAGE COVERAGE
Over 50 languages detected
ALSO SUPPORTED BY CHATTERBOX TTS
Arabic
Chinese
Danish
Dutch
English
Finnish
French
German
Greek
Hebrew
Hindi
Italian
Japanese
Korean
Malay
Norwegian
Polish
Portuguese
Russian
Spanish
Swahili
Swedish
Turkish
Ukranian
Vietnamese
Thai
Indonesian
Romanian
Czech
Slovak
+ over 20 additional languages
Validated against MLAADv8. Detection relies on generation artifact patterns, not language-specific features — enabling generalization to languages not seen during training. EER and accuracy figures from the Hugging Face Speech DF Arena and DFBench Speech/Image leaderboards, March 2026. Resemble AI does not control test set composition. Image figures from DFBench Image 2025. Video EER from internal evaluations on held-out test sets. Read full methodology →
GENERATE

Chatterbox Turbo preferred 2 to 1 over ElevenLabs in blind evaluation

Human listeners tested Chatterbox Turbo in blind A/B comparisons. ELO ratings derived based on those results from ~2.5k pairwise evaluations per matchup.
RESEMBLE AI

COMPETITOR
65.3%
Win rate vs. ElevenLabs Turbo v2.5
Blind A/B · ~2,500 evals · Podonos
RESEMBLE AI

COMPETITOR
Chatterbox Turbo TTS model rankings — ELO Score
Evaluators rated pairs without knowing the source model. · Higher percentage = better
Chatterbox Turbo
1,200
Qwen 3 TTS+
1,177
Cartesia Sonic 3
1,165
VibeVoice 7B
1,102
ElevenLabs Turbo v2.5
1,050
1,000
ELO rating
1,200
Chatterbox Turbo anchored at 1200. Bradley-Terry model. Blind A/B listening test. *Qwen 3 TTS: ~150 evals, treat as directional. All others: ~2,500.
Chatterbox Turbo head-to-head win rates
Percentage indicates preference
vs. ElevenLabs Turbo v2.5
65.3%
24.5%
vs. VibeVoice 7B
59.0%
31.6%
vs. Cartesia Sonic 3
49.8%
39.8%
vs. Qwen 3 TTS*
42.7%
36.0%
0%
Win rate %
75%
Tie rates excluded from bars. vs ElevenLabs: 10.2% · vs VibeVoice: 9.3% · vs Cartesia: 10.4% · vs Qwen 3: 21.3%
Latency by model variant
Time to first audio · Cloud API · A100 GPU · 50-word input
Model
TIME TO FIRST AUDIO
BEST FOR
LANGUAGES
LICENSE
Chatterbox Turbo
<150 ms
Voice agents, real-time
English
MIT
Chatterbox Pro
<300 ms
Enterprise, on-prem
23 languages
Enterprise
Chatterbox
<500 ms
Narration, content
English
MIT
Chatterbox Multilingual
<500 ms
Global deployment
23 languages
MIT
Latency = time from API call to first audio byte. Cloud API figures on A100 GPU. On-prem latency varies by hardware configuration.
VERIFY

PerTh watermarker: survives compression, re-encoding, and attack

Detection accuracy across 18 real-world attack conditions. PerTh V2 ships with improved robustness across reverb, pitch shift, and spectral manipulation.
~100%
Detection on clean and compressed audio
PerTh V2 · No-attack + standard codecs
STRONG (>90%)
MODERATE (70-90%)
WEAK (<70%)
PerTh — Production
Current live model · 18 attack conditions
wav_dither_attack
100%
random_wav_wavelet_noise
90%
random_wav_reverb_attack
95%
random_wav_resample_attack
100%
random_wav_precision_attack
100%
random_wav_pitch_shift_attack
10%
random_wav_mulaw_attack
100%
random_wav_high_pass_attack
100%
random_wav_gaussian_noise_clipped
45%
random_spec_time_mask
100%
random_spec_stretch
100%
random_spec_scale
100%
random_spec_lowclip
100%
random_spec_highclip
100%
random_spec_gaussian_noise_clipped
100%
random_spec_contiguous_band_mask
72%
no_watermark
100%
no_attack
100%
0%
Accuracy
100%
PerTh V2 — Coming soon
Gains on pitch shift and spectral attacks; regression on reverb
wav_dither_attack
100%
random_wav_wavelet_noise
100%
random_wav_reverb_attack
40%
random_wav_resample_attack
100%
random_wav_precision_attack
100%
random_wav_pitch_shift_attack
75%
random_wav_mulaw_attack
100%
random_wav_high_pass_attack
100%
random_wav_gaussian_noise_clipped
100%
random_spec_time_mask
98%
random_spec_stretch
98%
random_spec_scale
100%
random_spec_lowclip
100%
random_spec_highclip
100%
random_spec_gaussian_noise_clipped
98%
random_spec_contiguous_band_mask
98%
no_watermark
100%
no_attack
100%
0%
Accuracy
100%
Watermark robustness = detection accuracy after each attack transform (1.0 = 100%). "no_attack" = clean audio. "no_watermark" = false positive rate on unwatermarked audio. PerTh V2 figures are pre-release internal evaluations.
DEPLOYMENT / INFRA

Deploy on the infrastructure that meets your needs.

Every Resemble AI model — Chatterbox, DETECT-3B Omni, and PerTh — runs on cloud API, on-prem, or air-gapped. Pick the deployment model that fits your security posture and latency requirements.

CLOUD API
Managed service

RESTful API, 99.9% uptime SLA, auto-scaling. Fastest time to value.

ON-PREMISES
Your infrastructure

Docker / Kubernetes. No data leaves your network. NVIDIA GPU optimized.

AIR-GAPPED
Zero internet dependency

Run Resemble AI in your own AWS, GCP, or Azure VPC. Isolated compute, private networking, and compliance with your cloud security policies.

Infra partners include

All infra partners + integrations
Get complete generative AI security
Join thousands of developers and enterprises securing with Resemble AI