The leading deepfake detector on HuggingFace
#1
on Speech DeepFake Arena
#1
Image Deepfake Detection on DFBench
#1
Speech Deepfake Detection on DFBench
As generative AI capabilities expand beyond audio into images and video, synthetic media threats have become fundamentally multi-modal. A deepfaked executive might appear in a video call, a cloned voice message, or a manipulated photograph—often within the same attack campaign. Defending against these threats requires detection capabilities that span all three modalities without sacrificing accuracy in any single domain.
To meet this challenge, Resemble AI is introducing DETECT-3B Omni, our first multi-modal deepfake detection model. DETECT-3B Omni combines our proven audio detection capabilities with new vision models trained on millions of images and hundreds of thousands of video clips. The result is a unified 3 billion parameter model that delivers state-of-the-art detection across speech, image, and video content through a single API.
Building on the strong foundation of DETECT-2B, the audio component of 3B Omni incorporates significantly expanded training data, improved robustness to real-world audio conditions, and protection against emerging attack vectors like replay attacks. The vision component introduces comprehensive coverage of major generative architectures and commercial AI tools, with particular attention to the rapid evolution of video generation models.
Audio Analysis
Advances from DETECT-2B: While DETECT-3B Omni’s audio model shares architectural DNA with DETECT-2B, several significant improvements have been made to enhance real-world performance and coverage.
Expanded Training Data: The audio model has been trained on substantially more data than its predecessor, with particular emphasis on challenging scenarios. Training augmentations now include a wider variety of noise conditions, compression artifacts, and recording quality variations to ensure robust performance on media encountered in production environments.
Codec and Format Support: Performance is now stable across common audio codecs including MP3, OGG, AAC, and others. We have specifically improved robustness for telephony applications, with reliable detection on 8-bit PCMu/PCMa and telephony codecs such as G.711 and G.723.1. This telephony hardening is critical for enterprise deployments where audio often arrives via call center infrastructure or VoIP systems.
Language Coverage: DETECT-3B Omni now supports over 40 languages, a significant expansion from the initial DETECT-2B release. This extended coverage has been validated against the MLAADv8 dataset, which includes 40 languages across diverse linguistic families.
Modern Attack Method Coverage: The model accurately detects outputs from current commercial voice AI providers and research synthesis methods, including IndexTTS, Qwen2.5-Omni, NaturalSpeech2, and numerous other state-of-the-art approaches. We continuously update our training data to include samples from newly released generation methods.
Replay Attack Protection: We have implemented specific defenses against replay attacks, where an adversary records synthetic audio through a physical device to evade detection systems that rely on digital artifacts. This research was published and accepted at Interspeech 2025, representing a novel contribution to the deepfake detection literature.
Image Analysis
The image detection component of DETECT-3B Omni has been trained on over millions of images spanning major generative architectures, commercial AI tools, and partially edited content.
Generative Architecture Coverage: The model achieves strong performance across foundational generative architectures. On StyleGAN 2 and StyleGAN 3 generated images, accuracy exceeds 99%. DALL-E 3 outputs are detected at 98% accuracy, while Stable Diffusion variants are detected at 94% accuracy.
Commercial Model Coverage: As commercial image generation tools proliferate, detection must keep pace with their outputs. DETECT-3B Omni achieves greater than 99% accuracy on images generated by GPT-4o, Flux v2, and Gemini 2.0 Flash. Midjourney v7 outputs are detected at 98% accuracy. We continue to expand coverage as new commercial tools emerge.
Partial Edit Detection: Not all synthetic media is fully generated—many real-world threats involve authentic images with localized AI-generated modifications. DETECT-3B Omni includes capabilities for detecting partially edited images, achieving high accuracy on content where only portions of the image have been synthetically manipulated. The overall equal error rate (EER) when evaluated across pooled test sets is approximately 9%, representing strong generalization across the diverse landscape of image generation methods.
Performance Evaluation
DETECT-3B Omni achieves state-of-the-art performance across all three modalities when evaluated against our comprehensive test sets.
Audio Performance: The audio model achieves an equal error rate consistently below 6% across our evaluation datasets. On the public Speech DeepFake Arena benchmark, DETECT-3B Omni’s audio model currently ranks #1. Performance remains strong across all 40+ supported languages, including languages not seen during training, indicating robust learning of language-agnostic manipulation artifacts.
Image Performance: Pooled across all image test sets, the model achieves an overall EER of approximately 9%. Performance varies by generation method, with particularly strong results on GAN-based architectures (>99% accuracy on StyleGAN variants) and commercial tools (>98% accuracy on most major platforms). Partial edit detection achieves >85% accuracy, addressing the increasingly common threat of localized image manipulation.
Video Performance: The video model achieves an overall EER of approximately 4.5%. Coverage of recent video generation models is particularly strong, with >99% accuracy on Veo 2 and ~95% on Veo 3. The model maintains reliable performance across varying video lengths, resolutions, and compression levels.These results demonstrate that DETECT-3B Omni provides comprehensive, production-ready detection capabilities across the full spectrum of synthetic media types.
Integrating DETECT-3B Omni
For customers already using Resemble Detect, DETECT-3B Omni is available through the same API interface. Existing integrations can access multi-modal capabilities —no additional integration work is required for audio detection, while image and video detection require only the addition of appropriate media handling.
Media files can be submitted individually or in batches across any supported modality. The API automatically routes content to the appropriate sub-model based on media type and returns granular predictions along with aggregated authenticity scores. We also offer a web-based dashboard interface for users who prefer visual interaction with detection results. The dashboard supports upload and analysis of audio, image, and video files with detailed result visualization and export capabilities.For enterprise deployments with data residency or latency requirements, on-premise deployment options are available.
Future Work
With the release of DETECT-3B Omni, Resemble AI establishes a foundation for comprehensive multi-modal deepfake detection. However, the synthetic media landscape continues to evolve rapidly, and our research agenda reflects the need for continuous advancement.
Expanded Video Coverage: As video generation models continue to improve, we will expand coverage of new generation methods. We are particularly focused on emerging models that combine multiple modalities or produce longer-form content.
Real-Time Detection: We continue to launch applications suitable for real-time detection for live video and audio streams. This capability is critical for protecting synchronous communication channels like video calls.
Adversarial Robustness: We continue to research defenses against adversarial attacks specifically designed to evade detection. This includes both digital adversarial perturbations and physical-world attacks like the replay attacks addressed in our Interspeech 2025 publication.
Cross-Modal Analysis: Future versions may incorporate cross-modal reasoning that analyzes relationships between audio, image, and video components of multimedia content to identify inconsistencies that single-modality analysis might miss.We are committed to staying at the forefront of detection technology as generative AI capabilities advance. DETECT-3B Omni represents our current best effort, but our research continues toward even more robust and comprehensive solutions.
More FAQs
What is DETECT-3B Omni and how does it differ from DETECT-2B?
DETECT-3B Omni is Resemble AI’s first multi-modal deepfake detection model. While DETECT-2B focused exclusively on audio deepfake detection, DETECT-3B Omni extends coverage to images and video while also incorporating significant improvements to audio detection. Key differences include:
- Multi-modal coverage across audio, image, and video
- Expanded language support (40+ languages, validated against MLAADv8)
- Improved codec and telephony robustness
- Replay attack protection (published at Interspeech 2025)
- Coverage of recent image and video generation models
What does "3B" mean in DETECT-3B Omni?
“3B” refers to the combined 3 billion parameters across the model’s audio and vision components. “Omni” refers to the model’s omni-modality capability—its ability to analyze and detect synthetic content across multiple media types (speech, images, and video) through a unified system.
What are the model's accuracy benchmarks?
DETECT-3B Omni ranks #1 on multiple public benchmarks including the Speech DeepFake Arena, and both image and speech deepfake detection on DFBench. Performance by modality:
- Audio: Equal error rate consistently below 6%
- Image: Overall EER of approximately 9%, with >99% accuracy on StyleGAN variants and >98% on most commercial tools
- Video: Overall EER of approximately 4.5%, with >99% accuracy on Veo 2 and ~95% on Veo 3
What audio formats and codecs does DETECT-3B Omni support?
Performance is stable across common audio codecs including MP3, OGG, AAC, and others. We have specifically improved robustness for telephony applications, with reliable detection on 8-bit PCMu/PCMa and telephony codecs such as G.711 and G.723.1.
What languages does DETECT-3B Omni support?
DETECT-3B Omni supports over 40 languages, a significant expansion from DETECT-2B. This coverage has been validated against the MLAADv8 dataset, which includes 40 languages across diverse linguistic families. The model performs well even on languages not seen during training, indicating robust learning of language-agnostic manipulation artifacts.
What image generation methods can DETECT-3B Omni detect?
The model achieves strong coverage across major generative architectures and commercial tools:
- StyleGAN 2/3: >99% accuracy
- DALL-E 3: 98% accuracy
- Stable Diffusion: 94% accuracy
- GPT-4o: >99% accuracy
- Flux v2: >99% accuracy
- Gemini 2.0 Flash: >99% accuracy
- Midjourney v7: 98% accuracy
The model also detects partially edited images where only portions have been synthetically manipulated.
What video generation methods can DETECT-3B Omni detect?
The model achieves >99% accuracy on Google Veo 2 and approximately 95% accuracy on Google Veo 3, representing coverage of the most advanced publicly available video generation models. The model maintains reliable performance across varying video lengths, resolutions, and compression levels.
What is replay attack protection?
A replay attack occurs when an adversary records synthetic audio through a physical device (such as playing a deepfake through speakers and re-recording it with a microphone) to evade detection systems that rely on digital artifacts. DETECT-3B Omni includes specific defenses against this attack vector, based on research published and accepted at Interspeech 2025.
How can I integrate DETECT-3B Omni?
DETECT-3B Omni is available through the same Resemble Detect API. Existing audio detection integrations require no additional work. For image and video detection, the API accepts media files and automatically routes content to the appropriate sub-model based on media type. A web-based dashboard is also available for manual analysis, and on-premise deployment options are available for enterprise requirements.