Module 2.2 — Synthetic Audio Detection: What’s Catchable

50-minute lecture · Day 2 morning · Hands-on Python in the lab

Learning objectives

By end of this module, students can:

Name the current state of the AASIST detector family and its successors (AASIST3, hybrid SSL backbones)
Choose from at least three currently-available open-source audio deepfake detection models on Hugging Face
Build a working Python audio deepfake detection pipeline that combines pretrained model inference with spectral-heuristic fallback
Identify three classes of evasion technique that defeat current detectors — and articulate when an audio detector is not the right control for the threat

The honest state of audio deepfake detection in mid-2026

Five years ago, distinguishing AI-generated audio from human speech was an open research question. Today, detection at the artifact level has been losing ground to generation quality at a measurable pace.

The 2026 reality: commercial voice clones from ElevenLabs v3, OpenAI Voice, Cartesia, and Resemble AI regularly produce output that scores below threshold on detectors trained on older datasets. The Day 1’s anti-pattern lesson (Module 1.6: “we have an audio detector, we’re deepfake-safe”) applies here in concrete form — the audio detector is a useful signal among many, not a binary decision-maker.

This module teaches students how to build the detector competently while internalizing its limits. The durable workflow control covered in Module 2.4 is what catches the cases the audio detector misses.

The AASIST family — past, present, near-future

AASIST (2021) — the original Audio Anti-Spoofing Integrated Spectro-Temporal Graph Attention Network — was the gold-standard end-to-end audio anti-spoofing detector for several years. Built on RawNet2 backbones with graph attention for spectro-temporal fusion. Reference benchmark: ASVspoof 2019/2021.

AASIST3 (late 2024) integrated Kolmogorov-Arnold Networks (KAN) replacing the MLP layers in the original architecture, achieving substantial accuracy gains on cross-dataset evaluation. AASIST3 is now the baseline for new audio anti-spoofing research.

Current SOTA in 2026 is a hybrid pattern: AASIST3 (or a successor) used as the classifier head on top of a large self-supervised speech foundation model — XLS-R 2B, wav2vec2-XLSR-53, or Whisper-Large-v3 encoders. The SSL backbone provides generalization across speakers, languages, and recording conditions; the AASIST3 head provides the spoofing-discrimination training signal.

Practical implication for the detection engineer: if you’re picking a single model to deploy, look for ones that combine a SSL foundation backbone with an AASIST-style head. The pure-RawNet detectors of 2021 are now baselines, not deployment choices.

Open-source models available May 2026

Three categories of model the detection engineer should know:

Production-ready on Hugging Face

garystafford/wav2vec2-deepfake-voice-detector — Wav2Vec2-XLSR fine-tuned for binary deepfake classification. Reported ROC-AUC 0.998 and balanced accuracy 98.4% at a 0.9 threshold on the developer’s eval set. License: per model card. Verified May 2026.
Gustking/wav2vec2-large-xlsr-deepfake-audio-classification — the base model garystafford fine-tunes from. Pretrained on 53 languages, then deepfake-classified.
Pranjal-Pravesh’s anti-spoofing models — multiple models tagged anti-spoofing on Hugging Face from this author; useful as ensemble members.

Caution: model card metrics are vendor-reported. Always validate against your own audio environment (recording-quality, codec, language mix) before deployment. Module 2.6 covers the calibration discipline.

Research-tier (current AASIST3 variants)

These are typically distributed as research code on GitHub rather than as Hugging Face models:

AASIST3 reference implementations
XLS-R + AASIST3 hybrids (typically requires PyTorch + custom training/inference scripts)
ASVspoof 5 Challenge winners (annual benchmark, latest 2025 edition)

Commercial APIs

Resemble AI Detect — proprietary, API-based. Cross-dataset accuracy ~94% reported.
Reality Defender — multi-modal (audio + image + video). API-based; enterprise pricing.
Pindrop — voice biometric and deepfake detection for contact centers. Heavily used in financial services.

The commercial APIs offer faster integration but inferior cost economics at SOC volume and harder-to-audit detection logic. For most SOC deployments, an open-source model + heuristic fallback (below) is the right starting point.

A working detection pipeline (the lab handout)

The pipeline below is the production pattern: try a pretrained model first, fall back to spectral-heuristic features when no model loads or model is unavailable. Codex generated this implementation; we validated it against syntax checks and reviewed the logic.

Architecture

input audio (WAV/MP3, any sample rate)
    ↓ librosa.load → mono 16 kHz, float32
    ↓
[1] Spectral feature extraction
    - MFCC (13 coefficients)
    - Spectral centroid
    - Spectral rolloff
    - Zero-crossing rate
    - Chroma (12 bins)
    ↓
[2] Hugging Face model inference (best-effort)
    - Try MODEL_CANDIDATES in order
    - Use AutoFeatureExtractor + AutoModelForAudioClassification
    - Map model logits to synthetic probability via label-text heuristics
    - Returns None if no model loads
    ↓
[3] Spectral-heuristic fallback (always available)
    - Low MFCC dynamic range → oversmoothing artifact (synthetic indicator)
    - High zero-crossing-rate variance → transient artifacts (synthetic indicator)
    - Spectral-centroid drift → vocoder residue (synthetic indicator)
    - Combines into one heuristic confidence score 0-1
    ↓
[4] Combine model + heuristic confidence
    - If model loaded: weighted combination favoring model
    - If model failed: pure heuristic
    - Output: dict {confidence, verdict, features, model_id, mode}

Key pipeline pattern (excerpt)

def detect(audio_path: str, threshold: float = 0.7) -> dict:
    """Main entry point: returns confidence, verdict, and full diagnostic."""
    y, sr = load_audio(audio_path)
    features = compute_spectral_features(y, sr)

    # Try pretrained model first
    model_result = try_hugging_face_model(y, sr)
    model_confidence = model_result.get("confidence") if model_result else None

    # Always compute heuristic
    heuristic_score, heuristic_factors = heuristic_confidence(features, sr)

    # Combine
    if model_confidence is not None:
        combined = 0.7 * model_confidence + 0.3 * heuristic_score
        mode = f"model+heuristic ({model_result['model_id']})"
    else:
        combined = heuristic_score
        mode = "heuristic-only"

    verdict = (
        "likely_synthetic" if combined >= threshold
        else "likely_real" if combined <= (1 - threshold)
        else "uncertain"
    )

    return {
        "confidence": combined,
        "verdict": verdict,
        "mode": mode,
        "features": features,
        "heuristic_factors": heuristic_factors,
    }

The full implementation is 271 lines including CLI handling, JSON output, error paths. Production deployment requires you to verify which models actually load in your environment — Codex’s model-candidate list contains references that may or may not be live on Hugging Face at any given time; the pipeline’s fallback handles this gracefully but you should audit your candidate list quarterly.

Dependencies

librosa
numpy
soundfile
audioread
torch
transformers

Why heuristic fallback matters

Two scenarios where the model-only approach fails:

Air-gapped or restricted environments — corporate SOCs without Hugging Face Hub access cannot load arbitrary models. The heuristic fallback provides a usable signal even when no model loads.
Model unavailability — Hugging Face models can be deprecated, moved, or rate-limited. The fallback ensures the pipeline doesn’t simply fail closed.

Production-grade extensions to the basic pipeline:

Cache model weights locally on first download
Pre-validate the candidate model list at startup; only retain models that loaded
Log every detection event with the model_id and confidence to the SIEM — pattern your downstream rules off the detector telemetry, not the audio file

Evasion techniques that defeat 2026 audio detectors

Three categories of evasion are documented in 2025-2026 research literature:

1. Projected Gradient Descent (PGD) perturbations

Adversarial perturbations at the audio waveform level, optimized against a known detector, can defeat both raw-waveform detectors (RawNet3, original AASIST) and spectrogram-based detectors. The perturbation is typically inaudible to a human listener but moves the detector’s confidence below threshold.

Defender response: train detectors with adversarial augmentation; ensemble multiple detectors with different architectures; treat single-detector confidence as one input among many, not as ground truth.

2. Neural Codec Smoothing (NCS)

Source-tracing detectors trained to identify the residual fingerprint of specific neural codecs (EnCodec, SNAC) used by commercial voice-clone services can be defeated by a re-encoding pass that smooths or randomizes those fingerprints. Active research area in late 2025.

Defender response: don’t rely on codec-fingerprint detectors alone; combine with semantic and behavioral analysis (Module 2.4 workflow gap).

3. Resampling and phase-vocoder attacks

Static forensic models can be defeated by simple resampling chains and phase-vocoder transformations that shift generative spectral signatures into frequency bands the detector wasn’t trained on.

Defender response: train detectors with diverse augmentation including bandlimited and resampled versions; deploy multiple detectors targeting different spectral regions.

The pattern: every advance in detection prompts a corresponding advance in evasion. The detection engineer’s stance should be that any single audio detector has a half-life of 12-18 months at most against motivated adversaries.

Tuning thresholds for your environment

The Codex pipeline uses a default 0.7 threshold for “likely synthetic.” This is calibration-dependent. Steps to calibrate:

Collect a held-out set of 200-500 known-real and 200-500 known-synthetic audio samples representative of your environment (caller languages, audio codecs, recording quality, average call duration).
Run the pipeline against every sample and record confidence scores.
Plot the score distribution for real vs synthetic. The threshold should sit at the cleanest separation point — not necessarily at 0.7.
Choose your operating point based on false-positive cost vs false-negative cost:
- High-FP-cost environments (legitimate users being blocked): threshold higher, fewer alerts
- High-FN-cost environments (executive impersonation must never reach the wire transfer): threshold lower, more alerts, more escalations
Recalibrate quarterly as voice-clone technology evolves.

When NOT to deploy an audio detector

There are scenarios where an audio detector is the wrong control:

Customer-facing IVR / call center: false positives are catastrophic for customer experience; threshold ends up so high that detection is near-zero
High-volume telephony with poor audio quality: detectors trained on clean studio audio degrade severely on phone-quality lossy codecs
Workflows where out-of-band verification is already mandatory: the workflow gap (Module 2.4) is the durable control; the audio detector adds cost without proportional benefit

In these cases, route investment to the workflow-gap detection layer instead.

Discussion questions (~10 min)

Your CISO mandates an audio deepfake detector for all incoming executive calls. You measure the detector against your environment and find 12% FPR at your chosen threshold. Calculate the daily false-positive volume at 1,000 executive calls/day and recommend whether to deploy.
The Codex-generated pipeline includes the Sara1708/deepfake-audio-wav2vec2 model as a candidate. When we built this module, that model was not findable on Hugging Face by direct ID search. Walk through what the pipeline’s fallback logic does when this candidate fails to load — and why “fail closed” would be the wrong design here.
Ferrari’s executive defeated a deepfake by asking a book-recommendation question (Module 2.1). What category of detection is this — artifact-level, workflow-level, or social-level? How would you SIEM-detect that the question was asked?

Common mistakes

Mistake	Better approach
Deploying one detector and trusting its threshold	Ensemble multiple detectors; calibrate against your audio environment quarterly
Building only model-based detection (no fallback)	Heuristic fallback ensures pipeline doesn’t fail closed when models can’t load
Trusting vendor-reported accuracy	Validate against your own held-out set in your environment
Treating detector confidence as a decision boundary	One signal among many; pair with workflow-gap detection (Module 2.4)
Static threshold at 0.7 forever	Score distribution drifts; recalibrate as voice-clone tech improves

What’s next

Module 2.3 covers synthetic video detection — and why it’s structurally harder than audio. Then Module 2.4 introduces the workflow-gap SIEM detection that catches what artifact-level detectors miss.