Module 3.5 — The Guardrails Stack as Detection Telemetry
50-minute lecture · Day 3 afternoon
Learning objectives
By end of this module, students can:
- Identify and choose between the four major guardrail systems for LLM applications in 2026: Llama Guard 3, Prompt Guard 2, NVIDIA NeMo Guardrails, and Microsoft Azure AI Content Safety Prompt Shields
- Recognize the structural shift from “guardrail as silent middleware” to “guardrail as SIEM telemetry source”
- Deploy a working Codex-generated integration that wires Llama Guard 3 and Azure Prompt Shields into the SIEM event stream
- Identify known evasions of each guardrail system and design ensemble layouts that don’t rely on any single one
The architectural shift
Most enterprises that deploy guardrails on their LLM applications treat the guardrail as a content filter — a yes/no gate that allows safe content through and blocks unsafe content. The guardrail is silent middleware; the SOC has no visibility into what it caught or what it let through.
The detection engineer’s reframe: the guardrail is an event source. Every classification decision the guardrail makes is a structured event that can be shipped to the SIEM, correlated with other telemetry, and analyzed in aggregate.
Why this matters:
- Adversary tradecraft is visible. If your guardrail catches 30 prompt-injection attempts per day across all users, that’s an adversary-volume baseline. Spikes are intelligence.
- False-positive analysis becomes possible. When the guardrail blocks legitimate content, the SOC can see it and tune.
- Correlation across systems. A prompt-injection attempt against your customer-facing chatbot, plus a credential-stuffing attempt against the same account, plus an unusual login pattern — three independent low-fidelity signals become one high-fidelity incident.
- Coverage measurement. You can answer “how many LLM interactions did we screen today?” — a question most SOCs cannot currently answer.
This module covers the four major guardrails and the integration pattern that wires them as SIEM event sources.
The four major guardrails (May 2026)
Llama Guard 3 (Meta, open-weight)
Llama Guard 3 is a Llama-3.1-8B model fine-tuned for content-safety classification. Available in 8B and 1B variants on Hugging Face at meta-llama/Llama-Guard-3-8B and meta-llama/Llama-Guard-3-1B. License: Llama 3.1 Community License (free for under 700M MAU).
Capabilities:
- Classifies content on both LLM inputs (prompt classification) and outputs (response classification)
- Multi-turn conversation evaluation
- Multilingual support (8 languages including Spanish, French, German, Hindi, Italian, Portuguese, Thai)
- Tool-use safety classification for code-interpreter and search-tool calls
- Aligned to MLCommons standardized hazards taxonomy
Categories: 14 — the 13 MLCommons hazards (S1-S13: Violent Crimes, Non-Violent Crimes, Sex-Related Crimes, Child Sexual Exploitation, Defamation, Specialized Advice, Privacy, Intellectual Property, Indiscriminate Weapons, Hate, Suicide & Self-Harm, Sexual Content, Elections) plus S14 (Code Interpreter Abuse).
Note: Meta also released Llama Guard 4 (12B) in 2025 — heavier, more capable. For SOC deployments where latency matters, Llama Guard 3 8B is the practical default; deploy Llama Guard 4 for high-stakes filtering where additional accuracy justifies the latency cost.
Prompt Guard 2 (Meta, open-weight, DeBERTa-based)
Llama Prompt Guard 2 is a specialized classifier specifically for detecting prompt injections and jailbreaks (in contrast to Llama Guard 3’s broader content-safety scope). Available as meta-llama/Llama-Prompt-Guard-2-86M and meta-llama/Llama-Prompt-Guard-2-22M on Hugging Face. License: Llama 4 Community License (effective April 5, 2025).
Function: Classifies prompts as benign or malicious. Optimized for low latency and adversarial-attack-resistant tokenization. The 86M model supports English + non-English attack patterns; the 22M model is English-only but faster.
Deployment pattern: Use Prompt Guard 2 as a first-line fast classifier (cheap to run on every prompt). Route prompts flagged as malicious to a slower, more capable model (or to human review). Llama Guard 3 then runs on the output to classify the response.
NVIDIA NeMo Guardrails
NeMo Guardrails is NVIDIA’s open-source framework for adding programmable rails to LLM applications. The current version uses Colang 2.0 — an event-driven syntax with parallel execution (IORails), asynchronous actions, and Python-style interop. Available at github.com/NVIDIA-NeMo/Guardrails.
Differs from Llama Guard 3 / Prompt Guard 2: these are classifier models. NeMo Guardrails is a flow-control framework — you write rails that constrain the LLM’s behavior at the conversation level (topical rails, dialogue rails, tool-call rails). Often used in combination with the classifier models: Llama Guard 3 classifies content, NeMo Guardrails enforces flow.
Production use: Strong for agentic systems where you want to constrain which tools the agent can call, in what order, under what conditions — exactly the LLM06 (Excessive Agency) risk from Module 3.3.
Microsoft Azure AI Content Safety — Prompt Shields
Azure’s cloud-hosted prompt-injection-detection service. Two modes:
- Direct Attack detection: classic jailbreak detection — content explicitly attempting to override system prompts
- Indirect Attack (XPIA) detection: Cross-Prompt Injection Attack detection — detecting injection content arriving through RAG, document upload, or other indirect channels (this is the EchoLeak class)
API pattern: {endpoint}/contentsafety/text:shieldPrompt?api-version=2024-02-15-preview (verify current version at delivery; the API version moves).
Production use: Best fit for orgs already on Azure, especially those running Azure OpenAI Service. Comes with FedRAMP High authorization (Day 1 Module 1.2 covered this), enabling US government use.
The Codex-generated integration pattern
The integration sketch at .boss-pattern-work/day3/guardrails_telemetry.py (468 lines) demonstrates the architectural pattern: wrap each guardrail into a uniform classify_input(text) function, then emit structured SIEM events on every classification.
The uniform classification interface
def classify_input(text: str, classifier: Literal["llama_guard_3", "prompt_shields"]) -> dict:
"""Classify an input text via the named guardrail. Returns:
{
"classifier": "llama_guard_3" or "prompt_shields",
"verdict": "safe" | "unsafe" | "uncertain",
"categories": list[str], # e.g., ["S6_specialized_advice"]
"raw": dict, # full underlying response
}
"""
Behind this interface:
- Llama Guard 3 backend — loads the model via
transformers.AutoModelForCausalLM.from_pretrained("meta-llama/Llama-Guard-3-8B")and invokes the model’s standard prompt-template - Azure Prompt Shields backend — calls
POST {endpoint}/contentsafety/text:shieldPromptwith API key authentication
Both backends include graceful fallback when their dependencies (model weights, API key) aren’t available — the function returns a "verdict": "uncertain" classification rather than failing.
The SIEM event emitter
Every classification call also emits a structured event:
def emit_siem_event(classification: dict, source_metadata: dict) -> None:
"""Emit a structured log event suitable for SIEM ingestion."""
event = {
"@timestamp": datetime.utcnow().isoformat() + "Z",
"event.kind": "alert" if classification["verdict"] == "unsafe" else "event",
"event.category": ["intrusion_detection", "llm_security"],
"event.action": "prompt_classification",
"rule.name": classification["classifier"],
"llm.classifier.verdict": classification["verdict"],
"llm.classifier.categories": classification["categories"],
"source.application": source_metadata.get("application"),
"source.user_id": source_metadata.get("user_id"),
"source.session_id": source_metadata.get("session_id"),
# ... additional ECS-style fields
}
print(json.dumps(event)) # in production, ship to your SIEM
The resulting log stream gives the SOC:
- A per-classification event with timestamps, application, user, session
- Verdict (safe/unsafe/uncertain) and categories matched
- The classifier that made the decision
Aggregation patterns at the SIEM
Once the events flow to the SIEM, the detection engineer builds correlations:
- Spike detection: unusual increase in
unsafeclassifications from a specific user, application, or session - Category drill-down: which categories of attack are growing? (S1 violent crimes vs S6 specialized advice vs S14 code-interpreter abuse — different attacker profiles)
- Cross-system correlation: prompt-injection attempts against the customer chatbot + login anomalies on the same user account + outbound traffic anomalies = high-confidence incident
- Coverage metric: total LLM interactions classified per day; this is the denominator for security-effectiveness reporting
Known evasions and how to design around them
No single guardrail catches everything. Documented evasions through 2025-2026:
Multilingual evasion (Llama Guard 3, Prompt Guard 2)
Classifier training data is dominated by English. Adversaries craft attacks in low-resource languages (Zulu, Turkish, Vietnamese) that classifiers haven’t seen at scale. Mitigation: ensemble Llama Guard 3 + Prompt Guard 2 (86M, which has broader language coverage) + a translation pre-pass that normalizes to English before classification.
AgentPoison (Azure Prompt Shields, RAG poisoning)
Adversarial documents seeded in RAG databases that, when retrieved, force agentic tool-calling into exfiltration paths. Mitigation: ingestion-time provenance checks, instruction-stripping on retrieved content (treat retrieval as data, not instructions), canary tokens in the corpus.
AdvJudge-Zero (NeMo Guardrails, LLM-as-Judge bypass)
Logic-based fuzzing that exploits decision-making logic in LLM-as-Judge components of NeMo’s Colang flows. Mitigation: combine LLM-as-Judge with deterministic verifiers; never rely on LLM judgment alone for high-stakes decisions.
The general principle: layered defense. Use Prompt Guard 2 as the fast first pass, Llama Guard 3 as content-safety classifier, NeMo Guardrails as flow control, Azure Prompt Shields for XPIA detection on RAG inputs. No single one catches everything; the ensemble has materially lower miss rate.
Architectural patterns for production deployments
Pattern 1: Pre-LLM and post-LLM classification
User prompt → Prompt Guard 2 (fast, fails closed)
↓ benign
LLM
↓ response
Llama Guard 3 (slower, content safety)
↓ safe
Deliver to user
Each classifier call emits a SIEM event. Both filters in series means high coverage.
Pattern 2: RAG-augmented LLM with XPIA defense
User query → LLM
↓ retrieval query
Vector DB
↓ retrieved chunks
Azure Prompt Shields (XPIA mode)
↓ chunks free of indirect injection
LLM generates response
↓
Llama Guard 3 (output filter)
↓
User
Critical: the XPIA-mode check is on retrieved content, not just user input. This is what would have caught EchoLeak.
Pattern 3: Agentic system with full guardrail stack
User goal → LLM (planner)
↓ proposed actions
NeMo Guardrails (Colang flow rails)
↓ approved actions only
Tool execution
↓ tool outputs
Llama Guard 3 (output filter)
↓
LLM (next planning step)
↓
(loop until done OR HITL gate)
Day 4 covers agentic detection in depth.
Discussion questions (~10 min)
- Your org runs Microsoft 365 Copilot internally. Azure Prompt Shields is available, but adding it to the Copilot processing path is a Microsoft-side change you don’t control. What complementary controls can the SOC deploy that don’t require Microsoft to add Prompt Shields?
- Llama Guard 3 emits a classification on every prompt. At 1M prompts/day across an enterprise’s LLM stack, that’s 1M SIEM events daily — a non-trivial logging volume. What sampling or aggregation strategy makes this volume manageable while preserving incident-investigation utility?
- The multilingual-evasion attack against classifier guardrails is well-known. Should an English-only org deploy a multilingual classifier (slower, slightly less accurate on English) or an English-only one (faster, more accurate on English, broken on non-English attacks)? Frame the tradeoff for your CISO.
Common mistakes
| Mistake | Better approach |
|---|---|
| Treating guardrails as silent middleware | Emit a SIEM event on every classification — your guardrail is your highest-fidelity LLM-activity log |
| Deploying one guardrail and considering the problem solved | Multilayer: Prompt Guard 2 + Llama Guard 3 + NeMo + Azure Prompt Shields each have different gaps |
| Building rails only for direct attacks | Indirect/XPIA attacks (the EchoLeak class) require explicit detection of injection in retrieved content |
| Allow-listing the LLM’s outputs without inspection | Llama Guard 3 on outputs is the controllable defense against the LLM saying something it shouldn’t |
| Storing classification logs locally without SIEM integration | The aggregate is the value; ship to SIEM and correlate across all LLM-touching services |
What’s next
Module 3.6 closes Day 3 with Simon Willison’s Lethal Trifecta framing — an architectural lens for auditing every LLM-touching system in your org against three structural properties. EchoLeak, GTG-1002, ForcedLeak, and the broader pattern all manifest the same three properties simultaneously.