Replay Attacks: The Blind Spot in Audio Deepfake Detection

May 22, 2025

We’re thrilled to announce that groundbreaking research from our team at Resemble AI and collaborators, detailed in the paper “Replay Attacks Against Audio Deepfake Detection,” has been accepted for presentation at the prestigious Interspeech 2025 conference! This work uncovers a new challenge in the deepfake detection landscape and is already sharpening our cutting-edge detection technologies.

Deepfake detection systems face a critical vulnerability that has largely gone unexamined – until now. Our collaborative research paper, “Replay Attacks Against Audio Deepfake Detection,” documents how the digital-to-physical boundary compromises current detection methods. This work, co-authored by our team at Resemble AI (Nicolas Müller, Piotr Kawa, and Aditya Tirumala Bukkapatnam) alongside researchers from Fraunhofer AISEC, Wrocław University of Science and Technology, Technical University of Cluj-Napoca, Neodyme AG, and TU Munich, has uncovered a new challenge in the deepfake detection landscape. This research is already sharpening our cutting-edge detection technologies..

The Replay Attack: Deceptively Simple, Strikingly Effective

The study reveals a counterintuitive phenomenon: the simple act of playing Generative AI  audio through speakers and re-recording it can imbue the audio with real-world acoustic properties. This “replay attack” can make deepfakes appear authentic to some detection models by masking the subtle digital artifacts these systems are trained to identify.

This isn’t a fundamental flaw in detection; it’s the next frontier. At Resemble AI, we see every new challenge as an opportunity to fortify our defenses and stay ahead of those who would misuse Generative AI models.

 Photographs from the recording labs, where we play
bona fide and spoofed audio samples over a loudspeaker, and
record them via microphone.

Photographs from the recording labs, where we play bona fide and spoofed audio samples over a loudspeaker, and record them via microphone.

ReplayDF: An Intelligence Goldmine for Robust Detection

A cornerstone of this research is the introduction of ReplayDF, a comprehensive dataset meticulously designed to study and combat replay attacks. This isn’t just another collection of audio files; it’s an acoustic battleground for forging more resilient detection.

Consider the scale and methodology:

  • Diverse Conditions: ReplayDF features 109 unique speaker-microphone combinations, capturing deepfakes and genuine audio across six languages (English, German, French, Italian, Polish, and Spanish).
  • Real-World Scenarios: The audio, derived from M-AILABS (bona fide) and MLAAD (spoofed using four different TTS models), was physically played and re-recorded, mimicking how deepfakes might be laundered in the wild.
  • Systematic Approach: The creation pipeline (detailed as “Algorithm 1” in the paper) ensured a balanced dataset of 132.5 hours of audio, with each sample meticulously cataloged. Metadata includes original and recorded file paths, attack types (Bark, VITS, XTTS v1.1, XTTS v2.0), language, hardware, setup images, and even Room Impulse Responses (RIRs).
  • Quality Spectrum Insights: The dataset includes human-rated Mean Opinion Scores (MOS) and objective Perceptual Evaluation of Speech Quality (PESQ) scores. The research found a correlation between recording quality and detection accuracy (Pearson correlation of 0.423 with MOS and 0.509 with PESQ for the W2V2-AASIST model). This suggests, somewhat counterintuitively, that more aggressively re-recorded (and thus potentially lower quality) fakes can be harder to detect by standard models.

The paper showed that replay attacks significantly degrade the performance of several open-source detection models, with the EER of a top-performing model, W2V2-AASIST, increasing from 4.7% on original audio to 18.2% on the replayed audio from ReplayDF. Crucially, the study found this degradation primarily affects the classification of spoofed samples, making them appear genuine, while bona fide samples remain largely unaffected. Furthermore, the performance drop isn’t merely due to added noise; replaying seems to remove or alter key artifacts that detectors rely on.

ReplayDF is more than data; it’s strategic intelligence, providing a blueprint for building the next generation of unbreakable detection systems.

Overview of recording quality in ReplayDF, measured by MOS and PESQ (blue, green). Detection performance on audio
deepfakes (red) correlates with recording quality, showing Pearson correlations of 0.423 and 0.509, respectively. This suggests that the
more aggressive the replay attack, the worse the detection performance

Overview of recording quality in ReplayDF, measured by MOS and PESQ (blue, green). Detection performance on audio deepfakes (red) correlates with recording quality, showing Pearson correlations of 0.423 and 0.509, respectively. This suggests that the more aggressive the replay attack, the worse the detection performance

Resemble Detect: Leading the Charge Against Sophisticated Fakes

This research underscores the dynamic nature of deepfake threats. While the study showed vulnerabilities in existing models when faced with these novel replay attacks, Resemble AI’s Detect platform is built for this evolving landscape.

Resemble Detect is a state-of-the-art neural model designed to expose deepfake audio in real-time, working across various media types and against modern speech synthesis solutions. Our current Detect platform identifies fake audio with impressive accuracy (with DETECT-2B showing >94% accuracy against diverse datasets). DETECT-2B, for instance, is an ensemble of multiple sub-models leveraging pre-trained self-supervised audio representations and efficient fine-tuning techniques to identify subtle synthetic artifacts. This sophisticated architecture, trained on vast and diverse datasets, positions us uniquely to address challenges like replay attacks head-on.

The paper itself explores a promising mitigation strategy: adaptive retraining using RIRs. Augmenting training data with RIRs from ReplayDF improved the W2V2-AASIST model’s EER on replayed samples from 18.2% to 11.0%. This demonstrates that by understanding and incorporating these acoustic transformations, detection models can become more robust—a principle central to our development philosophy at Resemble AI.

Beyond Detection: A Multi-Layered Defense Strategy

What sets Resemble AI apart is not just superior detection but our comprehensive approach to audio authenticity and security:

  • AI Watermarker: Our PerTh watermarker embeds imperceptible, persistent identifiers within audio content. This technology, designed to survive model training processes, offers a crucial verification layer that complements our detection capabilities. It can even help trace if watermarked data was used in training other AI models.
  • Deepfake Detection Dashboard: For enterprise customers, our intuitive Deepfake Detection Dashboard provides real-time analysis, making complex authentication accessible to non-technical users.
  • Multimodal Protection: Recognizing that threats extend beyond audio, our Detect Multimodal platform now supports image and video content, providing comprehensive protection across media types.

Leading the Evolution in Generative AI Safety

This research, featuring contributions from our own team, doesn’t highlight weaknesses; it showcases our commitment to proactive innovation. While others may wait for new attack vectors to emerge, we are actively mapping the battlefield, anticipating tomorrow’s challenges, and building defenses today.

By incorporating insights from the ReplayDF study directly into our development roadmap, we ensure our clients remain protected against even these sophisticated evasion techniques. The ReplayDF dataset and the findings are being shared with the research community for non-commercial use, reflecting our dedication to advancing the entire field of synthetic media authentication.

The Path Forward: Trust in an Increasingly Generative World

The fight against deepfakes is an ongoing endeavor, a continuous dialogue about establishing trust in our digital interactions. By identifying and addressing nuanced challenges like replay attacks before they become widespread, Resemble AI continues to set the standard for reliable, robust, and future-proof detection.

This transparent and research-driven approach ensures that as synthetic media evolves, our collective ability to verify authenticity and maintain digital integrity evolves in lockstep. In the intricate dance of Generative AI, Resemble AI is not just keeping pace—we are leading the next move.

Congratulations to Nicolas Müller, Piotr Kawa, and Aditya Tirumala Bukkapatnam from Resemble AI and Wei-Herng Choong, Adriana Stan, Karla Pizzi, Alexander Wagner, and Philip Sperl for the paper “Replay Attacks Against Audio Deepfake Detection” being accepted to Interspeech 2025.

More From This Category

Voice Design: Transforming Text into Unlimited AI Voices

Voice Design: Transforming Text into Unlimited AI Voices

Today, we're thrilled to unveil Voice Design, our most groundbreaking feature yet. Voice Design represents a fundamental shift in how creators approach voice generation by translating simple text descriptions into fully-realized AI voices in seconds.The Power of...

read more
Introducing State-of-the-Art in Multimodal Deepfake Detection

Introducing State-of-the-Art in Multimodal Deepfake Detection

Today, we present our research on Multimodal Deepfake Detection, expanding our industry-leading deepfake detection platform to support image and video analysis. Our approach builds on our established audio detection system to deliver comprehensive protection across...

read more