DETECT-3B-Omni | The Leading Multimodal Deepfake Detector

Leading benchmarks

The leading deepfake detector on Hugging Face.

DETECT-3B Omni holds the top rank across the major public detection leaderboards — both legacy speech-only benchmarks and the newer multi-modal suites that track image detection.

#1

Speech DeepFake Arena

Top of the public leaderboard for synthetic speech detection — evaluated across diverse TTS and voice-cloning methods.

#1

DFBench — Image

Ranked first for image deepfake detection across GAN, diffusion, and commercial generator outputs.

#1

DFBench — Speech

Ranked first for speech deepfake detection on the DFBench evaluation suite.

Why multi-modal

One attack. Many modalities.

A deepfaked executive might appear in a video call, a cloned voice message, and a manipulated photograph — often within the same attack campaign. Defending against these threats requires detection that spans every modality without sacrificing accuracy in any single one.

A unified 3B-parameter model. DETECT-3B Omni combines our proven audio detection capabilities with new vision models trained on millions of images and hundreds of thousands of video clips. The result is state-of-the-art detection across speech, image, and video through a single API.

Built on DETECT-2B. The audio component shares architectural DNA with DETECT-2B but adds significantly expanded training data, improved robustness to real-world conditions, and protection against emerging attack vectors like replay attacks.

New vision stack. The image and video components introduce comprehensive coverage of major generative architectures and commercial AI tools, with particular attention to the rapid evolution of video generation models.

One API, three modalities. Submit media individually or in batches. The API automatically routes content to the appropriate sub-model and returns granular predictions along with aggregated authenticity scores.

A single model covering the three modalities adversaries combine in coordinated attacks — audio, image, and video — routed through one unified API.

Three modalities

Audio, image, video — one model.

Each modality is handled by a specialized sub-model tuned for the generators and attack surfaces it will encounter in production. Benchmarks below reflect current reported accuracy.

Audio

Voice & speech detection

Equal error rate consistently below 6%. Stable across MP3, OGG, AAC, and telephony codecs — including G.711, G.723.1, and 8-bit PCMu/PCMa — for call-center and VoIP pipelines.

Overall EER< 6%
Speech DeepFake Arena#1
DFBench — Speech#1
Language coverage51+
Replay attack defenseYes

Image

Image deepfake detection

Trained on millions of images across major generative architectures and commercial tools. Also flags partially edited content where only a region of an authentic image has been synthesized.

StyleGAN 2 / 3> 99%
GPT-4o / Flux v2 / Gemini 2.0> 99%
DALL·E 398%
Midjourney v798%
Stable Diffusion94%
Partial-edit detection> 85%

Video

Video deepfake detection

Overall EER of ~4.5%. Strong coverage of the most advanced video generators, with reliable performance across varying video lengths, resolutions, and compression levels.

Overall EER~ 4.5%
Google Veo 2 accuracy> 99%
Google Veo 3 accuracy~ 95%
Length / resolutionStable
Compression robustnessYes

Audio analysis

Hardened for production audio, not demos.

The audio model shares DNA with DETECT-2B, but every layer has been upgraded for the messy reality of real-world audio pipelines — compressed, re-encoded, telephony-routed, or replayed through physical devices.

Expanded training data

Substantially more data than DETECT-2B, with augmentations for noise conditions, compression artifacts, and recording-quality variations.

Telephony & codec coverage

Stable across MP3, OGG, AAC plus telephony codecs G.711, G.723.1, and 8-bit PCMu/PCMa — critical for enterprise call-center deployments.

51 languages

Validated against MLAADv9. Performs well even on languages not seen during training, indicating language-agnostic artifact learning.

Modern attack coverage

Detects outputs from IndexTTS, Qwen2.5-Omni, NaturalSpeech2, and other state-of-the-art voice AI providers. Training refreshed continuously.

Replay attack protection

Specific defenses against adversaries who record synthetic audio through a physical device to evade digital-artifact detectors. Research accepted at Interspeech 2025.

< 6% EER, sustained

Equal error rate consistently below 6% across evaluation datasets — and #1 on the public Speech DeepFake Arena benchmark.

Vision: image & video

Coverage that keeps up with the generators.

As commercial image and video generation tools proliferate, detection has to keep pace with their outputs. The vision stack is trained on millions of images and hundreds of thousands of video clips from foundational architectures, commercial APIs, and partially edited content.

Foundational architectures

>99% accuracy on StyleGAN 2 and StyleGAN 3. DALL·E 3 at 98%, Stable Diffusion variants at 94%.

Commercial tools

>99% accuracy on GPT-4o, Flux v2, Gemini 2.0 Flash. Midjourney v7 detected at 98%. Coverage expands as new tools ship.

Partial-edit detection

Many real-world threats involve authentic images with localized AI-generated modifications. Partial-edit accuracy exceeds 85%.

Google Veo 2 & 3

>99% accuracy on Veo 2 and ~95% on Veo 3 — covering the most advanced publicly available video generators.

Resolution-agnostic

Reliable performance across varying video lengths, resolutions, and compression levels — from social uploads to broadcast.

~9% pooled image EER

Across pooled test sets spanning many generators, the image sub-model achieves a pooled equal error rate of approximately 9%.

Language coverage

Detection that speaks the world's languages.

Validated against MLAADv9, which spans 51 languages across diverse linguistic families. Because the model learns language-agnostic manipulation artifacts, performance holds up even on languages not seen during training.

51 languages

Significant expansion from the initial DETECT-2B release. Benchmarked against MLAADv9 — a multi-language anti-spoofing dataset covering 51 languages.

Language-agnostic artifacts. Detection ability transfers to unseen languages because the model learns generator fingerprints in frequency and temporal structure — not phonetic content. Critical for deployments where user-generated audio arrives in dozens of locales.

English

Spanish

French

German

Italian

Portuguese

Dutch

Polish

Russian

Ukrainian

Czech

Romanian

Greek

Turkish

Arabic

Hebrew

Hindi

Bengali

Tamil

Chinese

Japanese

Korean

Vietnamese

+17 more

See it live

Upload a clip. Get an answer.

DETECT-3B Omni ships with a web dashboard for analysts and a single unified API for developers. Submit audio, image, or video files individually or in batches — the platform routes content to the right sub-model and returns granular predictions with aggregated authenticity scores.

Try for free

Free Deepfake Detector

Detect AI-generated content while browsing with our Chrome extension that puts a deepfake detector on every image, video, and audio clip across the web.

Install Chrome Extension →

Real-time

Detection for meetings

Real-time deepfake detection for Google Meet, Teams, Zoom, and Webex — protection for synchronous video calls.

Explore Meetings →

Multimodal Detection

Resemble Detect

Real-time multimodal detection across audio, video, and images. Battle-tested against 160+ generative AI models with explainability built in.

Explore Detect →

For developers

One endpoint. Every modality.

If you are already integrated with Resemble Detect, DETECT-3B Omni is available through the same API. Existing audio integrations get multimodal capabilities for free — image and video require only the addition of appropriate media handling.

API

API documentation

Endpoints, authentication, request/response shapes, and modality-specific options. Batch and streaming supported.

Read the docs →

Deployment

Contact us

Self-host DETECT-3B Omni inside your VPC for data residency and latency-sensitive workloads. Enterprise support included.

Explore on-prem →

Research

DETECT-2B predecessor

Read the DETECT-2B research — the audio-only foundation DETECT-3B Omni builds on, with full method and evaluation details.

See DETECT-2B →

Complement

PerTh Watermarker

Pair detection with provenance. PerTh embeds an invisible watermark into Resemble-generated audio so you can verify origin later.

Read the research →

Program

Security awareness training

Gen-AI-based training that teaches your organization to recognize deepfakes in calls, emails, and meetings.

See the program →

Intelligence

Deepfake database

Realtime database of the latest deepfake incidents — the public intelligence feed from the Resemble research team.

Browse incidents →

FAQ

More about DETECT-3B Omni

Everything teams usually want to know before shipping multimodal deepfake detection in production.

DETECT-3B Omni is Resemble AI's first multi-modal deepfake detection model. While DETECT-2B focused exclusively on audio, DETECT-3B Omni extends coverage to images and video while incorporating significant improvements to audio detection. Key differences:

Multi-modal coverage across audio, image, and video
Expanded language support (51 languages, validated against MLAADv9)
Improved codec and telephony robustness
Replay attack protection (published at Interspeech 2025)
Coverage of recent image and video generation models

“3B” refers to the combined 3 billion parameters across the model's audio and vision components. “Omni” refers to its omni-modality capability — the ability to analyze and detect synthetic content across speech, images, and video through a single unified system.

DETECT-3B Omni ranks #1 on multiple public benchmarks including the Speech DeepFake Arena, plus both image and speech deepfake detection on DFBench. Performance by modality:

Audio: equal error rate consistently below 6%
Image: overall EER ~9%, with >99% accuracy on StyleGAN variants and >98% on most commercial tools
Video: overall EER ~4.5%, with >99% accuracy on Veo 2 and ~95% on Veo 3

Performance is stable across common audio codecs including MP3, OGG, AAC, and others. We have specifically improved robustness for telephony applications, with reliable detection on 8-bit PCMu/PCMa and telephony codecs such as G.711 and G.723.1.

DETECT-3B Omni supports over 51 languages — validated against the MLAADv9 dataset, which spans 51 languages across diverse linguistic families. The model performs well even on languages not seen during training, indicating robust learning of language-agnostic manipulation artifacts.

The model achieves strong coverage across major generative architectures and commercial tools:

StyleGAN 2 / 3: >99% accuracy
DALL·E 3: 98% accuracy
Stable Diffusion: 94% accuracy
GPT-4o: >99% accuracy
Flux v2: >99% accuracy
Gemini 2.0 Flash: >99% accuracy
Midjourney v7: 98% accuracy

It also detects partially edited images where only portions have been synthetically manipulated.

The model achieves >99% accuracy on Google Veo 2 and approximately 95% accuracy on Google Veo 3 — covering the most advanced publicly available video generation models. Performance remains reliable across varying video lengths, resolutions, and compression levels.

A replay attack occurs when an adversary records synthetic audio through a physical device — for instance, playing a deepfake through speakers and re-recording it with a microphone — to evade detection systems that rely on digital artifacts. DETECT-3B Omni includes specific defenses against this attack vector, based on research published and accepted at Interspeech 2025.

DETECT-3B Omni is available through the same Resemble Detect API. Existing audio detection integrations require no additional work. For image and video detection, the API accepts media files and automatically routes content to the appropriate sub-model based on media type. A web-based dashboard is also available for manual analysis, and on-premise deployment options are available for enterprise requirements.

The leading multimodal deepfake detection model

The leading deepfake detector on Hugging Face.

Speech DeepFake Arena

DFBench — Image

DFBench — Speech

One attack. Many modalities.

Audio, image, video — one model.

Voice & speech detection

Image deepfake detection

Video deepfake detection

Hardened for production audio, not demos.

Expanded training data

Telephony & codec coverage

51 languages

Modern attack coverage

Replay attack protection

< 6% EER, sustained

Coverage that keeps up with the generators.

Foundational architectures

Commercial tools

Partial-edit detection

Google Veo 2 & 3

Resolution-agnostic

~9% pooled image EER

Detection that speaks the world's languages.

Upload a clip. Get an answer.

Free Deepfake Detector

Detection for meetings

Resemble Detect

One endpoint. Every modality.

API documentation

Contact us

DETECT-2B predecessor

PerTh Watermarker

Security awareness training

Deepfake database

More about DETECT-3B Omni