Detecting the Brad Pitt and Tom Cruise Seedance Deepfake: How AI Video Detection Keeps Pace

"It's likely over for us."

— Rhett Reese, Deadpool & Wolverine screenwriter, reacting to a Seedance 2.0 deepfake

Three days ago, ByteDance launched Seedance 2.0. Within hours, an Irish filmmaker named Ruairi Robinson typed a two-line prompt and generated a fight scene between Tom Cruise and Brad Pitt on a rooftop. The clip hit over a million views on X. The MPA, SAG-AFTRA, and the Human Artistry Campaign all issued formal statements within 24 hours. ByteDance themselves had to suspend certain features because the deepfake implications were that obvious.

At Resemble AI, we were doing something different in those same 72 hours. We were pulling Seedance samples, running adversarial tests, and training our DETECT model to classify its outputs correctly. We're already seeing strong initial results.

That's the job. And Seedance is just this week's chapter.

Side-by-side: The viral Seedance 2.0 Cruise vs Pitt deepfake alongside Resemble DETECT identifying it as AI-generated in real time.

The Pace Is a Challenge

The conversation around AI-generated video tends to focus on individual models. Seedance is the story this week. Before that it was Sora. Before that it was Kling. But the real story is the pace at which these models are shipping.

In February 2026 alone: Kling 3.0 dropped February 4, Seedance 2.0 dropped February 8-10. That's two major video model releases in under a week, alongside ongoing iterations of Sora 2, Veo 3.1, Runway Gen-4, and Wan 2.6.

To put it in perspective, here's what the AI video generation landscape looks like in just the first six weeks of 2026:

Model	Developer	Release / Update	Key Capability
Veo 3.1	Google DeepMind	January 2026	Native 4K, vertical video, improved character consistency, "Ingredients to Video" reference system
Wan 2.6	Alibaba	January 2026	Cost-efficient generation (~$0.05/sec), open-source, improved motion quality
Kling 3.0	Kuaishou	February 4, 2026	First native 4K/60fps AI video, multi-shot sequences with subject consistency, multi-character audio
Seedance 2.0	ByteDance	February 8-10, 2026	12-file multimodal input, native audio-visual sync, phoneme-level lip sync, 90%+ usable output rate
Sora 2	OpenAI	Continuous updates	Physics simulation benchmark, 25-second generation, Storyboard editing
Runway Gen-4	Runway	Continuous updates	Production-grade editing workflows, professional tooling, Gen-4.5 iteration
Pika 2.5	Pika	Continuous updates	Fast iteration, social media-optimized effects and transitions

That's seven actively shipping AI video models from seven different companies, and we're only halfway through February. And that's just video, before accounting for audio and image model releases from OpenAI, Google, ElevenLabs, Stability, and others that detection systems also need to cover. Each release introduces new architectures, new capabilities, and new artifact signatures.

This isn't a once-a-year event anymore. It's a rolling cycle where new generation capabilities emerge weekly, each one more capable than the last. For anyone building detection infrastructure, the challenge isn't detecting any single model. It's building systems that can adapt to new architectures continuously, without starting from scratch every time something new drops.

What Makes Seedance 2.0 Different

Seedance 2.0 isn't just another text-to-video model. It's built on ByteDance's dual-branch diffusion transformer architecture, and it introduces several capabilities that make detection more challenging than previous generation models.

Multimodal input at scale

Seedance accepts up to 12 reference files simultaneously, including images, video clips, audio tracks, and text prompts. This means generated outputs aren't just responding to text descriptions. They're synthesizing visual, temporal, and auditory references into a coherent clip. The resulting outputs carry a different artifact signature than pure text-to-video models.

Native audio-visual synchronization

Unlike most video generators that treat audio as a post-processing step, Seedance generates video and audio in a single pass. Lip movements are synchronized at the phoneme level. This eliminates one of the most common tells in previous-generation deepfakes: mismatched lip sync.

Character consistency across shots

Seedance maintains face, clothing, and scene consistency across multi-shot sequences. Previous models would often "drift" between cuts, with subtle but detectable changes in facial geometry or lighting. Seedance's consistency makes frame-by-frame comparison, a traditional detection technique, significantly less effective.

Scene-specific sound design

The model generates contextually appropriate sound effects and background audio, not just dialogue. A glass breaking sounds like a glass breaking. This audio coherence makes the content more convincing and removes another category of artifacts that detection systems have historically relied on.

ByteDance reports a 90%+ usable output rate, meaning 9 out of 10 generations produce commercially viable video without re-generation. Compare that to roughly 60-70% for Sora 2 and 75% for Kling 3.0 based on community benchmarks. Higher consistency means fewer visible artifacts in circulation, which means detection needs to work harder.

How We Detect It

The reason we can respond to something like Seedance in days instead of months comes down to how Resemble AI is built.

Most companies working on deepfake detection are detection-only. They're training classifiers against generation outputs they don't fully understand. We build both sides. Resemble develops generative voice AI alongside our detection stack. That dual understanding, knowing how generative models work from the inside, what patterns they produce, what artifacts they leave, where they succeed and where they fail, gives us a structural advantage when new models emerge.

When Seedance dropped, our team didn't start from zero. We started from DETECT-3B Omni, our multimodal deepfake detection model. DETECT-3B Omni is a 3 billion parameter model trained across audio, image, and video modalities. It's been battle-tested against 160+ generative AI models and achieves 98% accuracy across more than 40 languages. The model is already robust against major generative architectures, including diffusion-based video generators.

From there, the cycle is:

Collect samples across a range of prompt types, input configurations, and output qualities

Run adversarial tests against our existing model to identify where it already generalizes and where the new architecture introduces novel patterns

Fine-tune against the new generation signatures without retraining from scratch

Validate against our full test suite to ensure no regression, then deploy

This cycle takes days. And because we understand generative modeling at a deep level, we know where to look.

Why This Matters Beyond Hollywood

The Seedance conversation has been dominated by Hollywood, and understandably so. The MPA called it "unauthorized use of U.S. copyrighted works on a massive scale." SAG-AFTRA condemned the "blatant infringement." The Human Artistry Campaign called it "an attack on every creator around the world."

But the implications extend far beyond entertainment.

Financial fraud

A deepfake video convincing enough to fool a casual viewer is also convincing enough to fool a colleague on a video call. In 2024, an employee at a multinational company was tricked into transferring $25 million after a video call with deepfakes impersonating senior management. As video generation quality improves, these attacks become cheaper and more scalable.

Disinformation

A generated video of a public figure saying something they never said, distributed during a news cycle or election, can cause real damage before any correction reaches the audience. The speed of generation, measured in seconds, already outpaces the speed of verification for most platforms.

Identity theft and non-consensual content

ByteDance suspended Seedance's face-to-voice feature after a tech reviewer demonstrated that uploading a single facial photo could generate video with his exact voice, without any audio sample. The privacy implications are immediate and severe.

Enterprise trust

As AI-generated content becomes indistinguishable from real footage at a glance, organizations need programmatic ways to verify content provenance. Human judgment alone is no longer a reliable filter. A recent study found that only 0.1% of participants correctly identified all real and fake content in a mixed set.

The Will Smith eating spaghetti era of AI video is dead. We've crossed into territory where most people will be wrong more often than right when guessing whether something is real or generated. That's not a dystopian prediction. That's today.

Detection Is Keeping Pace

The narrative around AI video right now is overwhelmingly one of doom. "Hollywood is cooked." "It's over for us." The generation side of the equation gets all the attention because the outputs are visual, viral, and visceral.

But detection is advancing at the same pace, and almost nobody is covering that side of the story.

Resemble AI's DETECT-3B Omni achieves greater than 99% accuracy on outputs from StyleGAN, DALL-E 3, GPT-4o, Flux v2, and Gemini 2.0 Flash. It detects Midjourney v7 outputs at 98%. It covers 40+ languages for audio deepfakes. It runs in under 300 milliseconds. And it deploys on-premise for organizations that require air-gapped environments.

When Seedance dropped, we didn't issue a press statement. We started training against it. The results are strong and we'll share specifics as we complete validation.

This is the cycle we've built for. A new model launches. Content goes viral. And somewhere in a Slack channel, our team is already pulling samples and updating DETECT.

It's Not Over

The real question isn't whether AI can generate convincing video. It can. The question is whether detection infrastructure can keep pace with the generation side of the equation.

We believe it can, for a structural reason: creation and detection aren't opposing forces. They're two sides of the same discipline and they advance together. When you build generative AI, you learn what makes synthetic content synthetic. When you build detection, you learn what makes the next generation of models harder to detect, and you build for that.

There's also a cultural point worth making. The panic around Seedance assumes that generated content will replace the real thing. We don't think that's right. The Cruise vs Pitt clip is impressive as a tech demo. But nobody is going to sit in a theater for two hours watching generated actors and feel something. The content people love is made by people they care about. The performances, the choices, the humanity behind it. A generated Tom Cruise isn't Tom Cruise. It's a convincing copy of his face, and there's a massive difference.

The entertainment industry shouldn't have to rely on press statements and takedown requests. Enterprises and governments shouldn't have to guess. People scrolling their feeds shouldn't have to wonder.

The tools to detect this exist and are advancing fast. The appetite for real, human-created content isn't going anywhere. And the next model will probably drop in a few weeks. We'll be ready for that one too.

Detect Deepfakes Across Audio, Video, and Images

DETECT-3B Omni is a 3 billion parameter multimodal detection model battle-tested against 160+ generative AI models. Real-time detection in under 300ms. On-premise deployment available. Already training against Seedance 2.0.

Explore DETECT Read About DETECT-3B Omni