Module 3.5 — The Guardrails Stack as Detection Telemetry

50-minute lecture · Day 3 afternoon

Learning objectives

By end of this module, students can:

Identify and choose between the four major guardrail systems for LLM applications in 2026: Llama Guard 3, Prompt Guard 2, NVIDIA NeMo Guardrails, and Microsoft Azure AI Content Safety Prompt Shields
Recognize the structural shift from “guardrail as silent middleware” to “guardrail as SIEM telemetry source”
Deploy a working Codex-generated integration that wires Llama Guard 3 and Azure Prompt Shields into the SIEM event stream
Identify known evasions of each guardrail system and design ensemble layouts that don’t rely on any single one

The architectural shift

Most enterprises that deploy guardrails on their LLM applications treat the guardrail as a content filter — a yes/no gate that allows safe content through and blocks unsafe content. The guardrail is silent middleware; the SOC has no visibility into what it caught or what it let through.

The detection engineer’s reframe: the guardrail is an event source. Every classification decision the guardrail makes is a structured event that can be shipped to the SIEM, correlated with other telemetry, and analyzed in aggregate.

Why this matters:

Adversary tradecraft is visible. If your guardrail catches 30 prompt-injection attempts per day across all users, that’s an adversary-volume baseline. Spikes are intelligence.
False-positive analysis becomes possible. When the guardrail blocks legitimate content, the SOC can see it and tune.
Correlation across systems. A prompt-injection attempt against your customer-facing chatbot, plus a credential-stuffing attempt against the same account, plus an unusual login pattern — three independent low-fidelity signals become one high-fidelity incident.
Coverage measurement. You can answer “how many LLM interactions did we screen today?” — a question most SOCs cannot currently answer.

This module covers the four major guardrails and the integration pattern that wires them as SIEM event sources.

The four major guardrails (May 2026)

Llama Guard 3 (Meta, open-weight)

Llama Guard 3 is a Llama-3.1-8B model fine-tuned for content-safety classification. Available in 8B and 1B variants on Hugging Face at meta-llama/Llama-Guard-3-8B and meta-llama/Llama-Guard-3-1B. License: Llama 3.1 Community License (free for under 700M MAU).

Capabilities:

Classifies content on both LLM inputs (prompt classification) and outputs (response classification)
Multi-turn conversation evaluation
Multilingual support (8 languages including Spanish, French, German, Hindi, Italian, Portuguese, Thai)
Tool-use safety classification for code-interpreter and search-tool calls
Aligned to MLCommons standardized hazards taxonomy

Categories: 14 — the 13 MLCommons hazards (S1-S13: Violent Crimes, Non-Violent Crimes, Sex-Related Crimes, Child Sexual Exploitation, Defamation, Specialized Advice, Privacy, Intellectual Property, Indiscriminate Weapons, Hate, Suicide & Self-Harm, Sexual Content, Elections) plus S14 (Code Interpreter Abuse).

Note: Meta also released Llama Guard 4 (12B) in 2025 — heavier, more capable. For SOC deployments where latency matters, Llama Guard 3 8B is the practical default; deploy Llama Guard 4 for high-stakes filtering where additional accuracy justifies the latency cost.

Prompt Guard 2 (Meta, open-weight, DeBERTa-based)

Llama Prompt Guard 2 is a specialized classifier specifically for detecting prompt injections and jailbreaks (in contrast to Llama Guard 3’s broader content-safety scope). Available as meta-llama/Llama-Prompt-Guard-2-86M and meta-llama/Llama-Prompt-Guard-2-22M on Hugging Face. License: Llama 4 Community License (effective April 5, 2025).

Function: Classifies prompts as benign or malicious. Optimized for low latency and adversarial-attack-resistant tokenization. The 86M model supports English + non-English attack patterns; the 22M model is English-only but faster.

Deployment pattern: Use Prompt Guard 2 as a first-line fast classifier (cheap to run on every prompt). Route prompts flagged as malicious to a slower, more capable model (or to human review). Llama Guard 3 then runs on the output to classify the response.

NVIDIA NeMo Guardrails

NeMo Guardrails is NVIDIA’s open-source framework for adding programmable rails to LLM applications. The current version uses Colang 2.0 — an event-driven syntax with parallel execution (IORails), asynchronous actions, and Python-style interop. Available at github.com/NVIDIA-NeMo/Guardrails.

Differs from Llama Guard 3 / Prompt Guard 2: these are classifier models. NeMo Guardrails is a flow-control framework — you write rails that constrain the LLM’s behavior at the conversation level (topical rails, dialogue rails, tool-call rails). Often used in combination with the classifier models: Llama Guard 3 classifies content, NeMo Guardrails enforces flow.

Production use: Strong for agentic systems where you want to constrain which tools the agent can call, in what order, under what conditions — exactly the LLM06 (Excessive Agency) risk from Module 3.3.

Microsoft Azure AI Content Safety — Prompt Shields

Azure’s cloud-hosted prompt-injection-detection service. Two modes:

Direct Attack detection: classic jailbreak detection — content explicitly attempting to override system prompts
Indirect Attack (XPIA) detection: Cross-Prompt Injection Attack detection — detecting injection content arriving through RAG, document upload, or other indirect channels (this is the EchoLeak class)

API pattern: {endpoint}/contentsafety/text:shieldPrompt?api-version=2024-02-15-preview (verify current version at delivery; the API version moves).

Production use: Best fit for orgs already on Azure, especially those running Azure OpenAI Service. Comes with FedRAMP High authorization (Day 1 Module 1.2 covered this), enabling US government use.

The Codex-generated integration pattern

The integration sketch at .boss-pattern-work/day3/guardrails_telemetry.py (468 lines) demonstrates the architectural pattern: wrap each guardrail into a uniform classify_input(text) function, then emit structured SIEM events on every classification.

The uniform classification interface

def classify_input(text: str, classifier: Literal["llama_guard_3", "prompt_shields"]) -> dict:
    """Classify an input text via the named guardrail. Returns:
       {
           "classifier": "llama_guard_3" or "prompt_shields",
           "verdict": "safe" | "unsafe" | "uncertain",
           "categories": list[str],  # e.g., ["S6_specialized_advice"]
           "raw": dict,  # full underlying response
       }
    """

Behind this interface:

Llama Guard 3 backend — loads the model via transformers.AutoModelForCausalLM.from_pretrained("meta-llama/Llama-Guard-3-8B") and invokes the model’s standard prompt-template
Azure Prompt Shields backend — calls POST {endpoint}/contentsafety/text:shieldPrompt with API key authentication

Both backends include graceful fallback when their dependencies (model weights, API key) aren’t available — the function returns a "verdict": "uncertain" classification rather than failing.

The SIEM event emitter

Every classification call also emits a structured event:

def emit_siem_event(classification: dict, source_metadata: dict) -> None:
    """Emit a structured log event suitable for SIEM ingestion."""
    event = {
        "@timestamp": datetime.utcnow().isoformat() + "Z",
        "event.kind": "alert" if classification["verdict"] == "unsafe" else "event",
        "event.category": ["intrusion_detection", "llm_security"],
        "event.action": "prompt_classification",
        "rule.name": classification["classifier"],
        "llm.classifier.verdict": classification["verdict"],
        "llm.classifier.categories": classification["categories"],
        "source.application": source_metadata.get("application"),
        "source.user_id": source_metadata.get("user_id"),
        "source.session_id": source_metadata.get("session_id"),
        # ... additional ECS-style fields
    }
    print(json.dumps(event))  # in production, ship to your SIEM

The resulting log stream gives the SOC:

A per-classification event with timestamps, application, user, session
Verdict (safe/unsafe/uncertain) and categories matched
The classifier that made the decision

Aggregation patterns at the SIEM

Once the events flow to the SIEM, the detection engineer builds correlations:

Spike detection: unusual increase in unsafe classifications from a specific user, application, or session
Category drill-down: which categories of attack are growing? (S1 violent crimes vs S6 specialized advice vs S14 code-interpreter abuse — different attacker profiles)
Cross-system correlation: prompt-injection attempts against the customer chatbot + login anomalies on the same user account + outbound traffic anomalies = high-confidence incident
Coverage metric: total LLM interactions classified per day; this is the denominator for security-effectiveness reporting

Known evasions and how to design around them

No single guardrail catches everything. Documented evasions through 2025-2026:

Multilingual evasion (Llama Guard 3, Prompt Guard 2)

Classifier training data is dominated by English. Adversaries craft attacks in low-resource languages (Zulu, Turkish, Vietnamese) that classifiers haven’t seen at scale. Mitigation: ensemble Llama Guard 3 + Prompt Guard 2 (86M, which has broader language coverage) + a translation pre-pass that normalizes to English before classification.

AgentPoison (Azure Prompt Shields, RAG poisoning)

Adversarial documents seeded in RAG databases that, when retrieved, force agentic tool-calling into exfiltration paths. Mitigation: ingestion-time provenance checks, instruction-stripping on retrieved content (treat retrieval as data, not instructions), canary tokens in the corpus.

AdvJudge-Zero (NeMo Guardrails, LLM-as-Judge bypass)

Logic-based fuzzing that exploits decision-making logic in LLM-as-Judge components of NeMo’s Colang flows. Mitigation: combine LLM-as-Judge with deterministic verifiers; never rely on LLM judgment alone for high-stakes decisions.

The general principle: layered defense. Use Prompt Guard 2 as the fast first pass, Llama Guard 3 as content-safety classifier, NeMo Guardrails as flow control, Azure Prompt Shields for XPIA detection on RAG inputs. No single one catches everything; the ensemble has materially lower miss rate.

Architectural patterns for production deployments

Pattern 1: Pre-LLM and post-LLM classification

User prompt → Prompt Guard 2 (fast, fails closed)
              ↓ benign
            LLM
              ↓ response
            Llama Guard 3 (slower, content safety)
              ↓ safe
            Deliver to user

Each classifier call emits a SIEM event. Both filters in series means high coverage.

Pattern 2: RAG-augmented LLM with XPIA defense

User query → LLM
              ↓ retrieval query
            Vector DB
              ↓ retrieved chunks
            Azure Prompt Shields (XPIA mode)
              ↓ chunks free of indirect injection
            LLM generates response
              ↓
            Llama Guard 3 (output filter)
              ↓
            User

Critical: the XPIA-mode check is on retrieved content, not just user input. This is what would have caught EchoLeak.

Pattern 3: Agentic system with full guardrail stack

User goal → LLM (planner)
              ↓ proposed actions
            NeMo Guardrails (Colang flow rails)
              ↓ approved actions only
            Tool execution
              ↓ tool outputs
            Llama Guard 3 (output filter)
              ↓
            LLM (next planning step)
              ↓
            (loop until done OR HITL gate)

Day 4 covers agentic detection in depth.

Discussion questions (~10 min)

Your org runs Microsoft 365 Copilot internally. Azure Prompt Shields is available, but adding it to the Copilot processing path is a Microsoft-side change you don’t control. What complementary controls can the SOC deploy that don’t require Microsoft to add Prompt Shields?
Llama Guard 3 emits a classification on every prompt. At 1M prompts/day across an enterprise’s LLM stack, that’s 1M SIEM events daily — a non-trivial logging volume. What sampling or aggregation strategy makes this volume manageable while preserving incident-investigation utility?
The multilingual-evasion attack against classifier guardrails is well-known. Should an English-only org deploy a multilingual classifier (slower, slightly less accurate on English) or an English-only one (faster, more accurate on English, broken on non-English attacks)? Frame the tradeoff for your CISO.

Common mistakes

Mistake	Better approach
Treating guardrails as silent middleware	Emit a SIEM event on every classification — your guardrail is your highest-fidelity LLM-activity log
Deploying one guardrail and considering the problem solved	Multilayer: Prompt Guard 2 + Llama Guard 3 + NeMo + Azure Prompt Shields each have different gaps
Building rails only for direct attacks	Indirect/XPIA attacks (the EchoLeak class) require explicit detection of injection in retrieved content
Allow-listing the LLM’s outputs without inspection	Llama Guard 3 on outputs is the controllable defense against the LLM saying something it shouldn’t
Storing classification logs locally without SIEM integration	The aggregate is the value; ship to SIEM and correlate across all LLM-touching services

What’s next

Module 3.6 closes Day 3 with Simon Willison’s Lethal Trifecta framing — an architectural lens for auditing every LLM-touching system in your org against three structural properties. EchoLeak, GTG-1002, ForcedLeak, and the broader pattern all manifest the same three properties simultaneously.