Three billion parameters and three modalities. One unified model that ranks #1 on public benchmarks for detecting AI-generated voice, image, and video. Available through a single API and deployable on-premises.
DETECT-3B Omni holds the top rank across the major public detection leaderboards — both legacy speech-only benchmarks and the newer multi-modal suites that track image detection.
Top of the public leaderboard for synthetic speech detection — evaluated across diverse TTS and voice-cloning methods.
Ranked first for image deepfake detection across GAN, diffusion, and commercial generator outputs.
Ranked first for speech deepfake detection on the DFBench evaluation suite.
A deepfaked executive might appear in a video call, a cloned voice message, and a manipulated photograph — often within the same attack campaign. Defending against these threats requires detection that spans every modality without sacrificing accuracy in any single one.
A unified 3B-parameter model. DETECT-3B Omni combines our proven audio detection capabilities with new vision models trained on millions of images and hundreds of thousands of video clips. The result is state-of-the-art detection across speech, image, and video through a single API.
Built on DETECT-2B. The audio component shares architectural DNA with DETECT-2B but adds significantly expanded training data, improved robustness to real-world conditions, and protection against emerging attack vectors like replay attacks.
New vision stack. The image and video components introduce comprehensive coverage of major generative architectures and commercial AI tools, with particular attention to the rapid evolution of video generation models.
One API, three modalities. Submit media individually or in batches. The API automatically routes content to the appropriate sub-model and returns granular predictions along with aggregated authenticity scores.
Each modality is handled by a specialized sub-model tuned for the generators and attack surfaces it will encounter in production. Benchmarks below reflect current reported accuracy.
Equal error rate consistently below 6%. Stable across MP3, OGG, AAC, and telephony codecs — including G.711, G.723.1, and 8-bit PCMu/PCMa — for call-center and VoIP pipelines.
Trained on millions of images across major generative architectures and commercial tools. Also flags partially edited content where only a region of an authentic image has been synthesized.
Overall EER of ~4.5%. Strong coverage of the most advanced video generators, with reliable performance across varying video lengths, resolutions, and compression levels.
The audio model shares DNA with DETECT-2B, but every layer has been upgraded for the messy reality of real-world audio pipelines — compressed, re-encoded, telephony-routed, or replayed through physical devices.
Substantially more data than DETECT-2B, with augmentations for noise conditions, compression artifacts, and recording-quality variations.
Stable across MP3, OGG, AAC plus telephony codecs G.711, G.723.1, and 8-bit PCMu/PCMa — critical for enterprise call-center deployments.
Validated against MLAADv9. Performs well even on languages not seen during training, indicating language-agnostic artifact learning.
Detects outputs from IndexTTS, Qwen2.5-Omni, NaturalSpeech2, and other state-of-the-art voice AI providers. Training refreshed continuously.
Specific defenses against adversaries who record synthetic audio through a physical device to evade digital-artifact detectors. Research accepted at Interspeech 2025.
Equal error rate consistently below 6% across evaluation datasets — and #1 on the public Speech DeepFake Arena benchmark.
As commercial image and video generation tools proliferate, detection has to keep pace with their outputs. The vision stack is trained on millions of images and hundreds of thousands of video clips from foundational architectures, commercial APIs, and partially edited content.
>99% accuracy on StyleGAN 2 and StyleGAN 3. DALL·E 3 at 98%, Stable Diffusion variants at 94%.
>99% accuracy on GPT-4o, Flux v2, Gemini 2.0 Flash. Midjourney v7 detected at 98%. Coverage expands as new tools ship.
Many real-world threats involve authentic images with localized AI-generated modifications. Partial-edit accuracy exceeds 85%.
>99% accuracy on Veo 2 and ~95% on Veo 3 — covering the most advanced publicly available video generators.
Reliable performance across varying video lengths, resolutions, and compression levels — from social uploads to broadcast.
Across pooled test sets spanning many generators, the image sub-model achieves a pooled equal error rate of approximately 9%.
Validated against MLAADv9, which spans 51 languages across diverse linguistic families. Because the model learns language-agnostic manipulation artifacts, performance holds up even on languages not seen during training.
Significant expansion from the initial DETECT-2B release. Benchmarked against MLAADv9 — a multi-language anti-spoofing dataset covering 51 languages.
Language-agnostic artifacts. Detection ability transfers to unseen languages because the model learns generator fingerprints in frequency and temporal structure — not phonetic content. Critical for deployments where user-generated audio arrives in dozens of locales.
DETECT-3B Omni ships with a web dashboard for analysts and a single unified API for developers. Submit audio, image, or video files individually or in batches — the platform routes content to the right sub-model and returns granular predictions with aggregated authenticity scores.
Detect AI-generated content while browsing with our Chrome extension that puts a deepfake detector on every image, video, and audio clip across the web.
Real-time deepfake detection for Google Meet, Teams, Zoom, and Webex — protection for synchronous video calls.
Real-time multimodal detection across audio, video, and images. Battle-tested against 160+ generative AI models with explainability built in.
If you are already integrated with Resemble Detect, DETECT-3B Omni is available through the same API. Existing audio integrations get multimodal capabilities for free — image and video require only the addition of appropriate media handling.
Endpoints, authentication, request/response shapes, and modality-specific options. Batch and streaming supported.
Self-host DETECT-3B Omni inside your VPC for data residency and latency-sensitive workloads. Enterprise support included.
Read the DETECT-2B research — the audio-only foundation DETECT-3B Omni builds on, with full method and evaluation details.
Pair detection with provenance. PerTh embeds an invisible watermark into Resemble-generated audio so you can verify origin later.
Gen-AI-based training that teaches your organization to recognize deepfakes in calls, emails, and meetings.
Realtime database of the latest deepfake incidents — the public intelligence feed from the Resemble research team.
Everything teams usually want to know before shipping multimodal deepfake detection in production.