Audio, Video and Image Deepfake Detection Benchmarks

Model

Detect-3B Omni ranked #1 across audio, image, and video detection

DETECT-3B Omni is a 3 billion parameter multimodal model, evaluated on public leaderboards across 50+ languages and 160+ generative models.

96.7%

Pooled detection accuracy, Speech DF Arena

3B params · Updated March 2026

AUDIO

97.9%

Pooled accuracy · Speech DF Arena

96.7% on DFBench Speech. EER 2.099% across 40+ languages including telephony codecs G.711 and G.723.1.

#1 DFBENCH SPEECH

IMAGE

85.1%

Overall accuracy · DFBench Image 2025

94.7% on real images. 75.4% on fakes. 90.4% on PNG. 90.8% on TIFF. Evaluated on text-to-image and image-to-image outputs.

#1 DFBENCH IMAGE

VIDEO

~2.5%

Overall EER, video generation models

>99% on Veo 2. ~95% on Veo 3. No prior enrollment required — raw signal analysis.

FACE SWAP • LIP SYNC • FULL BODY

Deepfake detection benchmarks

RESEMBLE AI 

COMPETITOR

Speech deepfake detection — Equal Error Rate (EER)

Lower scores indicate better discrimination between real and synthetic speech.

Resemble Detect 3B

2.099%

Hiya Authenticity Verif.

2.324%

DLMSL SpeakSure v0.1

6.142%

Whispeak

8.060%

DF-Raptor

8.35%

EER % (lower = better)

10%

Speech DeepFake Arena · Hugging Face · March 2026 · Source

Audio deepfake detection — Accuracy (Speech DF Arena)

Pooled accuracy across all test sets · Higher = better · March 2026

Resemble Detect 3B

97.9%

Hiya Authenticity Verif.

97.7%

DLMSL SpeakSure v0.1

93.9%

DF-Raptor

92.3%

Whispeak

91.9%

Pooled accuracy (higher = better)

100%

Test sets: ASV2019, ASV2021, Codecfake, ADD 2022, and 8 other held-out sets. Source

LANGUAGE COVERAGE

Over 50 languages detected

ALSO SUPPORTED BY CHATTERBOX TTS

Arabic

Chinese

Danish

Dutch

English

Finnish

French

German

Greek

Hebrew

Hindi

Italian

Japanese

Korean

Malay

Norwegian

Polish

Portuguese

Russian

Spanish

Swahili

Swedish

Turkish

Ukranian

Vietnamese

Thai

Indonesian

Romanian

Czech

Slovak

+ over 20 additional languages

Validated against MLAADv8. Detection relies on generation artifact patterns, not language-specific features — enabling generalization to languages not seen during training. EER and accuracy figures from the Hugging Face Speech DF Arena and DFBench Speech/Image leaderboards, March 2026. Resemble AI does not control test set composition. Image figures from DFBench Image 2025. Video EER from internal evaluations on held-out test sets. Read full methodology →

GENERATE

Chatterbox Turbo preferred 2 to 1 over ElevenLabs in blind evaluation

Human listeners tested Chatterbox Turbo in blind A/B comparisons. ELO ratings derived based on those results from ~2.5k pairwise evaluations per matchup.

RESEMBLE AI 

COMPETITOR

65.3%

Win rate vs. ElevenLabs Turbo v2.5

Blind A/B · ~2,500 evals · Podonos

RESEMBLE AI 

COMPETITOR

Chatterbox Turbo TTS model rankings — ELO Score

Evaluators rated pairs without knowing the source model. · Higher percentage = better

Chatterbox Turbo

1,200

Qwen 3 TTS+

1,177

Cartesia Sonic 3

1,165

VibeVoice 7B

1,102

ElevenLabs Turbo v2.5

1,050

1,000

ELO rating

1,200

Chatterbox Turbo anchored at 1200. Bradley-Terry model. Blind A/B listening test. *Qwen 3 TTS: ~150 evals, treat as directional. All others: ~2,500.

Chatterbox Turbo head-to-head win rates

Percentage indicates preference

vs. ElevenLabs Turbo v2.5

65.3%

24.5%

vs. VibeVoice 7B

59.0%

31.6%

vs. Cartesia Sonic 3

49.8%

39.8%

vs. Qwen 3 TTS*

42.7%

36.0%

Win rate %

75%

Tie rates excluded from bars. vs ElevenLabs: 10.2% · vs VibeVoice: 9.3% · vs Cartesia: 10.4% · vs Qwen 3: 21.3%

Latency by model variant

Time to first audio · Cloud API · A100 GPU · 50-word input

Model

TIME TO FIRST AUDIO

BEST FOR

LANGUAGES

LICENSE

Chatterbox Turbo

<150 ms

Voice agents, real-time

English

MIT

Chatterbox Pro

<300 ms

Enterprise, on-prem

23 languages

Enterprise

Chatterbox

<500 ms

Narration, content

English

MIT

Chatterbox Multilingual

<500 ms

Global deployment

23 languages

MIT

Latency = time from API call to first audio byte. Cloud API figures on A100 GPU. On-prem latency varies by hardware configuration.

VERIFY

PerTh watermarker: survives compression, re-encoding, and attack

Detection accuracy across 18 real-world attack conditions. PerTh V2 ships with improved robustness across reverb, pitch shift, and spectral manipulation.

~100%

Detection on clean and compressed audio

PerTh V2 · No-attack + standard codecs

STRONG (>90%)

MODERATE (70-90%)

WEAK (<70%)

PerTh — Production

Current live model · 18 attack conditions

wav_dither_attack

100%

random_wav_wavelet_noise

90%

random_wav_reverb_attack

95%

random_wav_resample_attack

100%

random_wav_precision_attack

100%

random_wav_pitch_shift_attack

10%

random_wav_mulaw_attack

100%

random_wav_high_pass_attack

100%

random_wav_gaussian_noise_clipped

45%

random_spec_time_mask

100%

random_spec_stretch

100%

random_spec_scale

100%

random_spec_lowclip

100%

random_spec_highclip

100%

random_spec_gaussian_noise_clipped

100%

random_spec_contiguous_band_mask

72%

no_watermark

100%

no_attack

100%

Accuracy

100%

PerTh V2 — Coming soon

Gains on pitch shift and spectral attacks; regression on reverb

wav_dither_attack

100%

random_wav_wavelet_noise

100%

random_wav_reverb_attack

40%

random_wav_resample_attack

100%

random_wav_precision_attack

100%

random_wav_pitch_shift_attack

75%

random_wav_mulaw_attack

100%

random_wav_high_pass_attack

100%

random_wav_gaussian_noise_clipped

100%

random_spec_time_mask

98%

random_spec_stretch

98%

random_spec_scale

100%

random_spec_lowclip

100%

random_spec_highclip

100%

random_spec_gaussian_noise_clipped

98%

random_spec_contiguous_band_mask

98%

no_watermark

100%

no_attack

100%

Accuracy

100%

Watermark robustness = detection accuracy after each attack transform (1.0 = 100%). "no_attack" = clean audio. "no_watermark" = false positive rate on unwatermarked audio. PerTh V2 figures are pre-release internal evaluations.

DEPLOYMENT / INFRA

Deploy on the infrastructure that meets your needs.

Every Resemble AI model — Chatterbox, DETECT-3B Omni, and PerTh — runs on cloud API, on-prem, or air-gapped. Pick the deployment model that fits your security posture and latency requirements.

CLOUD API



Managed service

RESTful API, 99.9% uptime SLA, auto-scaling. Fastest time to value.

ON-PREMISES



Your infrastructure

Docker / Kubernetes. No data leaves your network. NVIDIA GPU optimized.

AIR-GAPPED



Zero internet dependency

Run Resemble AI in your own AWS, GCP, or Azure VPC. Isolated compute, private networking, and compliance with your cloud security policies.