PerTh Multimodal Watermarking Model

THE PROBLEM

When any modality can be faked, every modality needs a mark.

The tools to create convincing synthetic content now require almost no technical skill. EU AI Act Article 50 requires machine-readable marking of AI-generated content by August 2026. PerTh Multimodal embeds an invisible, persistent watermark at creation across audio, video, image, and text so the proof of origin travels with the content.
‍



STEP 1 • ENCODE

Ownership embedded at creation

PerTh Multimodal embeds a data payload into perceptually masked regions of the signal. Inaudible in audio, imperceptible in image and video, semantically neutral in text. Add a custom identifier and it travels with the content.



STEP 2 • SURVIVE

Withstand real-world handling

The payload persists through format conversion, re-encoding, compression, and editing. PerTh Multimodal is trained against the transformations content encounters in the real world.



STEP 3 • DECODE

Verify on demand

Run any file through the decoder and PerTh Multimodal returns your custom identifier, C2PA signatures, and SynthID in a single provenance report.

WHAT PERTH Multimodal DOES

PerTh Multimodal extends the architecture to every modality

The same perceptual masking principles that made PerTh work for audio, now applies to video, image, and text with explicit payloads, so every file carries a verifiable identifier.

Multimodal by design

Multimodal extends PerTh's neural watermarking architecture to audio, video, image, and text. Each modality uses algorithms tuned to its signal characteristics: psychoacoustics for audio, pixel-domain masking for image and video, semantic rewriting for text.

Explicit payload with custom identifier

Encode your organization name, system ID, or any string directly into the mark. Audio permits a 16-bit upload, image and video up to 256-bit encodes. Anyone who decodes it gets that identifier back, satisfying the EU AI Act's system identifier requirement.

Reads third-party marks

Decodes PerTh watermarks, C2PA signatures, and SynthID in a single pass. Designed to extend to new marks as other providers release them.

Resilient to real-world attacks · Audio

Near-100% data recovery across resampling, re-encoding, noise injection, compression, and pitch shifting.

Improved accuracy over baseline · Image and video

Fine-tuned beyond the Meta open-source baseline. Improved recovery under compression, cropping, brightness changes, and blur.

Pair with Detect

Use PerTh Multimodal alongside Resemble Detect to verify Resemble-generated content and flag synthetic content from any source.

HOW THE MODEL WORKS

Perceptual masking, applied per signal

PerTh Multimodal's architecture applies the core insight of the original PerTh model — encode data only into the regions humans can't perceive, to four modalities, each with signal-appropriate algorithms.

Audio psychoacoustics masking

Auditory masking creates a perceptual blanket in amplitude-frequency-time space where quieter sounds are hidden by louder ones nearby. The watermark lives inside that region. Trained with regularization against resampling, re-encoding, noise injection, and time-stretching, producing near-100% recovery across a standard attack suite.

Image and video pixel-domain modification

Accuracy exceeding the Meta open source baseline. Imperceptible pixel-domain modifications that survive compression, cropping, brightness adjustment, color jitter, and blur.

Text semantic rewriting

A semantic rewriting model makes meaning-preserving changes to word choice and phrasing. The mark is embedded in the pattern of the rewrite. The decoder looks for its own pattern to determine authorship.

VERIFY

PerTh watermarker: survives compression, re-encoding, and attack

Detection accuracy across 18 real-world attack conditions. PerTh V2 ships with improved robustness across reverb, pitch shift, and spectral manipulation.

~100%

Detection on clean and compressed audio

PerTh V2 · No-attack + standard codecs

STRONG (>90%)

MODERATE (70-90%)

WEAK (<70%)

PerTh — Open Source, Audio Only

wav_dither_attack

100%

random_wav_wavelet_noise

90%

random_wav_reverb_attack

95%

random_wav_resample_attack

100%

random_wav_precision_attack

100%

random_wav_pitch_shift_attack

10%

random_wav_mulaw_attack

100%

random_wav_high_pass_attack

100%

random_wav_gaussian_noise_clipped

45%

random_spec_time_mask

100%

random_spec_stretch

100%

random_spec_scale

100%

random_spec_lowclip

100%

random_spec_highclip

100%

random_spec_gaussian_noise_clipped

100%

random_spec_contiguous_band_mask

72%

no_watermark

100%

no_attack

100%

Accuracy

100%

PerTh Multimodal

wav_dither_attack

100%

random_wav_wavelet_noise

100%

random_wav_reverb_attack

88%

random_wav_resample_attack

100%

random_wav_precision_attack

100%

random_wav_pitch_shift_attack

94%

random_wav_mulaw_attack

100%

random_wav_high_pass_attack

100%

random_wav_gaussian_noise_clipped

100%

random_spec_time_mask

100%

random_spec_stretch

98%

random_spec_scale

100%

random_spec_lowclip

100%

random_spec_highclip

100%

random_spec_gaussian_noise_clipped

100%

random_spec_contiguous_band_mask

98%

no_watermark

100%

no_attack

100%

Accuracy

100%

Watermark robustness = detection accuracy after each attack transform (1.0 = 100%). "no_attack" = clean audio. "no_watermark" = false positive rate on unwatermarked audio.

Frequently asked questions

How does PerTh Multimodal embed the watermark without affecting signal quality?

PerTh Multimodal places the data payload inside perceptually masked regions — the frequencies already masked by louder sounds in audio, imperceptible pixel modifications in image and video, and meaning-preserving rewrites in text. Nothing is added that falls outside the perceptual threshold.

What attack types does PerTh Multimodal survive?

For audio, near-100% data recovery across resampling, re-encoding, MP3 compression, pitch shifting, time-stretching, noise injection, high and low pass filtering, and added delay. For image and video, the mark survives compression, cropping, brightness changes, color jitter, and blur.

What is the 16-bit explicit payload and what can it encode?

The explicit payload lets you embed a custom identifier: organization name, system ID, or any string within the bit limit directly into the mark. Decode returns that identifier, enabling verifiable origin attribution rather than binary watermark detection. This satisfies the EU AI Act Article 50 requirement for a system identifier in the mark.

Does the decoder require the original file?

No. For audio, the payload is distributed across the waveform so any non-silent segment is sufficient for recovery. You don't need access to the original generation request or file.

Which third-party marks does the decoder read?

PerTh Multimodal reads PerTh watermarks, C2PA signatures, and SynthID in a single pass. The architecture is designed to extend to additional marks as other providers expose their formats.

How does PerTh Multimodal map to EU AI Act Article 50?

Article 50 requires machine-readable marking across modalities, dual independent provenance layers, and a system identifier in the mark. PerTh Multimodal covers all four modalities, embeds both a neural watermark and C2PA signature simultaneously, and supports a custom identifier via the explicit payload.

How do I get started?

Talk to us about your use case. Voice agent, content platform, media production, or authentication, we’ll scope the right setup.

A multimodal watermark for every piece of content you generate

When any modality can be faked, every modality needs a mark.

PerTh Multimodal extends the architecture to every modality

Perceptual masking, applied per signal

PerTh watermarker: survives compression, re-encoding, and attack

PerTh Multimodal is one layer of a broader safety stack