DETECT-3B Omni | Resemble Research

The leading multimodal deepfake detection model

Three billion parameters and three modalities. One unified model that ranks #1 on public benchmarks for detecting AI-generated voice, image, and video. Available through a single API and deployable on-premises.

The leading deepfake detector on Hugging Face.

DETECT-3B Omni holds the top rank across the major public detection leaderboards — both legacy speech-only benchmarks and the newer multi-modal suites that track image detection.

#1

Speech DeepFake Arena

Top of the public leaderboard for synthetic speech detection — evaluated across diverse TTS and voice-cloning methods.

#1

DFBench — Image

Ranked first for image deepfake detection across GAN, diffusion, and commercial generator outputs.

#1

DFBench — Speech

Ranked first for speech deepfake detection on the DFBench evaluation suite.

One attack. Many modalities.

A deepfaked executive might appear in a video call, a cloned voice message, and a manipulated photograph — often within the same attack campaign. Defending against these threats requires detection that spans every modality without sacrificing accuracy in any single one.

A unified 3B-parameter model. DETECT-3B Omni combines our proven audio detection capabilities with new vision models trained on millions of images and hundreds of thousands of video clips. The result is state-of-the-art detection across speech, image, and video through a single API.

Built on DETECT-2B. The audio component shares architectural DNA with DETECT-2B but adds significantly expanded training data, improved robustness to real-world conditions, and protection against emerging attack vectors like replay attacks.

New vision stack. The image and video components introduce comprehensive coverage of major generative architectures and commercial AI tools, with particular attention to the rapid evolution of video generation models.

One API, three modalities. Submit media individually or in batches. The API automatically routes content to the appropriate sub-model and returns granular predictions along with aggregated authenticity scores.

Video Audio Image
A single model covering the three modalities adversaries combine in coordinated attacks — audio, image, and video — routed through one unified API.

Audio, image, video — one model.

Each modality is handled by a specialized sub-model tuned for the generators and attack surfaces it will encounter in production. Benchmarks below reflect current reported accuracy.

Audio

Voice & speech detection

Equal error rate consistently below 6%. Stable across MP3, OGG, AAC, and telephony codecs — including G.711, G.723.1, and 8-bit PCMu/PCMa — for call-center and VoIP pipelines.

  • Overall EER< 6%
  • Speech DeepFake Arena#1
  • DFBench — Speech#1
  • Language coverage51+
  • Replay attack defenseYes
Image

Image deepfake detection

Trained on millions of images across major generative architectures and commercial tools. Also flags partially edited content where only a region of an authentic image has been synthesized.

  • StyleGAN 2 / 3> 99%
  • GPT-4o / Flux v2 / Gemini 2.0> 99%
  • DALL·E 398%
  • Midjourney v798%
  • Stable Diffusion94%
  • Partial-edit detection> 85%
Video

Video deepfake detection

Overall EER of ~4.5%. Strong coverage of the most advanced video generators, with reliable performance across varying video lengths, resolutions, and compression levels.

  • Overall EER~ 4.5%
  • Google Veo 2 accuracy> 99%
  • Google Veo 3 accuracy~ 95%
  • Length / resolutionStable
  • Compression robustnessYes

Hardened for production audio, not demos.

The audio model shares DNA with DETECT-2B, but every layer has been upgraded for the messy reality of real-world audio pipelines — compressed, re-encoded, telephony-routed, or replayed through physical devices.

Expanded training data

Substantially more data than DETECT-2B, with augmentations for noise conditions, compression artifacts, and recording-quality variations.

Telephony & codec coverage

Stable across MP3, OGG, AAC plus telephony codecs G.711, G.723.1, and 8-bit PCMu/PCMa — critical for enterprise call-center deployments.

51 languages

Validated against MLAADv9. Performs well even on languages not seen during training, indicating language-agnostic artifact learning.

Modern attack coverage

Detects outputs from IndexTTS, Qwen2.5-Omni, NaturalSpeech2, and other state-of-the-art voice AI providers. Training refreshed continuously.

Replay attack protection

Specific defenses against adversaries who record synthetic audio through a physical device to evade digital-artifact detectors. Research accepted at Interspeech 2025.

< 6% EER, sustained

Equal error rate consistently below 6% across evaluation datasets — and #1 on the public Speech DeepFake Arena benchmark.

Coverage that keeps up with the generators.

As commercial image and video generation tools proliferate, detection has to keep pace with their outputs. The vision stack is trained on millions of images and hundreds of thousands of video clips from foundational architectures, commercial APIs, and partially edited content.

Foundational architectures

>99% accuracy on StyleGAN 2 and StyleGAN 3. DALL·E 3 at 98%, Stable Diffusion variants at 94%.

Commercial tools

>99% accuracy on GPT-4o, Flux v2, Gemini 2.0 Flash. Midjourney v7 detected at 98%. Coverage expands as new tools ship.

Partial-edit detection

Many real-world threats involve authentic images with localized AI-generated modifications. Partial-edit accuracy exceeds 85%.

Google Veo 2 & 3

>99% accuracy on Veo 2 and ~95% on Veo 3 — covering the most advanced publicly available video generators.

Resolution-agnostic

Reliable performance across varying video lengths, resolutions, and compression levels — from social uploads to broadcast.

~9% pooled image EER

Across pooled test sets spanning many generators, the image sub-model achieves a pooled equal error rate of approximately 9%.

< 6%
Audio equal error rate across evaluation datasets — #1 on Speech DeepFake Arena.
~ 9%
Pooled image EER across GAN, diffusion, and commercial generator test sets.
~ 4.5%
Video accuracy — >99% on Veo 2, ~95% on Veo 3, stable across lengths and resolutions.

Detection that speaks the world's languages.

Validated against MLAADv9, which spans 51 languages across diverse linguistic families. Because the model learns language-agnostic manipulation artifacts, performance holds up even on languages not seen during training.

51 languages

Significant expansion from the initial DETECT-2B release. Benchmarked against MLAADv9 — a multi-language anti-spoofing dataset covering 51 languages.

Language-agnostic artifacts. Detection ability transfers to unseen languages because the model learns generator fingerprints in frequency and temporal structure — not phonetic content. Critical for deployments where user-generated audio arrives in dozens of locales.

English
Spanish
French
German
Italian
Portuguese
Dutch
Polish
Russian
Ukrainian
Czech
Romanian
Greek
Turkish
Arabic
Hebrew
Hindi
Bengali
Tamil
Chinese
Japanese
Korean
Vietnamese
+17 more

More about DETECT-3B Omni

Everything teams usually want to know before shipping multimodal deepfake detection in production.

DETECT-3B Omni is Resemble AI's first multi-modal deepfake detection model. While DETECT-2B focused exclusively on audio, DETECT-3B Omni extends coverage to images and video while incorporating significant improvements to audio detection. Key differences:
  • Multi-modal coverage across audio, image, and video
  • Expanded language support (51 languages, validated against MLAADv9)
  • Improved codec and telephony robustness
  • Replay attack protection (published at Interspeech 2025)
  • Coverage of recent image and video generation models
“3B” refers to the combined 3 billion parameters across the model's audio and vision components. “Omni” refers to its omni-modality capability — the ability to analyze and detect synthetic content across speech, images, and video through a single unified system.
DETECT-3B Omni ranks #1 on multiple public benchmarks including the Speech DeepFake Arena, plus both image and speech deepfake detection on DFBench. Performance by modality:
  • Audio: equal error rate consistently below 6%
  • Image: overall EER ~9%, with >99% accuracy on StyleGAN variants and >98% on most commercial tools
  • Video: overall EER ~4.5%, with >99% accuracy on Veo 2 and ~95% on Veo 3
Performance is stable across common audio codecs including MP3, OGG, AAC, and others. We have specifically improved robustness for telephony applications, with reliable detection on 8-bit PCMu/PCMa and telephony codecs such as G.711 and G.723.1.
DETECT-3B Omni supports over 51 languages — validated against the MLAADv9 dataset, which spans 51 languages across diverse linguistic families. The model performs well even on languages not seen during training, indicating robust learning of language-agnostic manipulation artifacts.
The model achieves strong coverage across major generative architectures and commercial tools:
  • StyleGAN 2 / 3: >99% accuracy
  • DALL·E 3: 98% accuracy
  • Stable Diffusion: 94% accuracy
  • GPT-4o: >99% accuracy
  • Flux v2: >99% accuracy
  • Gemini 2.0 Flash: >99% accuracy
  • Midjourney v7: 98% accuracy
It also detects partially edited images where only portions have been synthetically manipulated.
The model achieves >99% accuracy on Google Veo 2 and approximately 95% accuracy on Google Veo 3 — covering the most advanced publicly available video generation models. Performance remains reliable across varying video lengths, resolutions, and compression levels.
A replay attack occurs when an adversary records synthetic audio through a physical device — for instance, playing a deepfake through speakers and re-recording it with a microphone — to evade detection systems that rely on digital artifacts. DETECT-3B Omni includes specific defenses against this attack vector, based on research published and accepted at Interspeech 2025.
DETECT-3B Omni is available through the same Resemble Detect API. Existing audio detection integrations require no additional work. For image and video detection, the API accepts media files and automatically routes content to the appropriate sub-model based on media type. A web-based dashboard is also available for manual analysis, and on-premise deployment options are available for enterprise requirements.
Get complete generative AI security
Join thousands of developers and enterprises securing with Resemble AI