Module 4.3 — Hardening Your Own Agents
50-minute lecture · Day 4 afternoon · Hands-on Python in the lab
Learning objectives
By end of this module, students can:
- Name and apply the five agent design patterns from Anthropic’s Building Effective Agents (Dec 2024): prompt chaining, routing, parallelization, orchestrator-worker, evaluator-optimizer
- Use LangGraph HITL primitives (
interrupt(),Command(resume=...), checkpointing) to gate cross-domain or destructive agent actions - Apply the action-criticality matrix to decide which actions require auto-execution, audit-only, HITL approval, or dual-control
- Deploy a working multi-agent SOC workflow (Codex-generated) that triages alerts, enriches them in parallel, and proposes containment with explicit HITL gates and audit-log emission
The defender’s reframe
Day 4 has covered adversaries running their own agents. The flip side: your own SOC may run agents too — and those agents need hardening against the same role-play exploit, the same lethal trifecta, the same indirect-prompt-injection vectors that adversaries use against external systems.
The detection engineer’s job in Module 4.3 is architectural defense of the org’s own agents before they become the lateral-movement vehicle for an adversary. The five canonical agent patterns from Anthropic’s Building Effective Agents (Dec 2024) frame how to think about agent design.
The five agent patterns (Anthropic, Building Effective Agents)
Each pattern has different risk and detection implications. Walk through them before designing the SOC’s own agent workflow.
Pattern 1: Prompt Chaining
Definition: Decompose a complex task into a sequence of smaller LLM calls, where each step’s output feeds the next.
Example: Generate a technical document by (a) creating an outline, (b) drafting sections, (c) polishing the final text. Three sequential LLM calls.
SOC application: Alert triage chain — (a) classify alert severity, (b) enrich with relevant context, (c) propose response. Three sequential LLM calls.
Risk: Errors compound across the chain. If step (a) misclassifies, step (b) and step (c) build on the wrong foundation. Mitigation: validate each step’s output before passing to the next; surface the chain’s intermediate states to the analyst.
Pattern 2: Routing
Definition: Classify an input and direct it to a specialized prompt or model optimized for that category.
Example: A support agent routing a request to the “Billing” expert vs the “Technical Troubleshooting” expert.
SOC application: Route alerts by category to specialized triage prompts — phishing alerts to a phishing-specific prompt, lateral-movement alerts to a different one. Each specialized prompt has tighter context and better accuracy.
Risk: Routing errors send alerts to the wrong specialist. Mitigation: maintain a “general triage” fallback prompt for alerts the router can’t classify confidently.
Pattern 3: Parallelization
Definition: Execute multiple LLM calls simultaneously — either for independent sub-tasks (Sectioning) or to reach consensus (Voting).
Example: Summarizing five book chapters simultaneously; or running three agents on the same query to find the best SQL solution.
SOC application: Parallel enrichment — when an alert needs URL reputation, sender reputation, and recent-campaign correlation, run all three lookups simultaneously. Combine results before passing to the response step.
Risk: Cost multiplication. Running 3 LLM calls per alert at 10M alerts/month is 30M LLM calls — verify cost economics before scaling.
Pattern 4: Orchestrator-Worker
Definition: A central LLM analyzes a task, delegates sub-tasks to specialized workers, and synthesizes results.
Example: An engineering agent breaks a feature request into frontend/backend/test sub-tasks for sub-agents to execute.
SOC application: Multi-agent IR — orchestrator agent receives a major incident, delegates investigation to specialized worker agents (one for endpoint forensics, one for network flow analysis, one for log correlation), synthesizes the findings into a single response plan.
Risk: The orchestrator becomes a single point of failure and a high-value target for prompt injection. Mitigation: orchestrator agents need especially strong input filtering and HITL gates for cross-domain actions.
Pattern 5: Evaluator-Optimizer
Definition: An iterative loop where a Generator produces an output and an Evaluator provides feedback for refinement.
Example: High-stakes code generation where a second agent reviews the output for security bugs and forces a rewrite if any are found.
SOC application: Auto-generated detection rules where a “generator” agent proposes a Sigma rule and an “evaluator” agent reviews it for known false-positive patterns before deploying.
Risk: Generator-evaluator collusion if both run on the same model. Mitigation: use different models (or significantly different prompts) for generator and evaluator.
LangGraph HITL primitives
LangGraph (LangChain’s agent-orchestration framework) provides built-in primitives for human-in-the-loop control flow:
interrupt(payload)
Pauses the agent graph at a node and returns control to the human. The payload is delivered to the human reviewer (typically via Slack approval bot, web UI, or CLI). The agent does not advance until resumed.
Semantic guarantees:
- Deterministic resumption at the start of the interrupted node
- Idempotency requirement: any side-effects in the interrupted node should be idempotent because the node may execute again on resume
- State is persisted via checkpointer (Postgres/Mongo) — survives process restarts
- JSON-safe data serialization for cross-environment state transfers
Command(resume=value)
Resumes an interrupted graph, optionally passing a value back as the interrupt’s “decision.”
# In a human-review UI handler:
graph.update_state(thread_id, Command(resume="approved"))
Command(goto=node_name)
Allows the human reviewer to redirect the agent to a different node (e.g., “this needs more investigation, route back to enrichment”).
Checkpointer
Stores graph state in durable storage. Required for HITL because the agent must survive the wait for human review (which can take minutes to hours).
from langgraph.checkpoint.postgres import PostgresSaver
checkpointer = PostgresSaver.from_conn_string(POSTGRES_URL)
graph = builder.compile(checkpointer=checkpointer)
The action-criticality matrix
The right HITL gate depends on action criticality, not on the model’s self-reported confidence. The matrix:
| Action class | Default policy | Override conditions |
|---|---|---|
| Read-only enrichment, lookups, tagging | Auto — no HITL needed | Override to require HITL only for tagged-as-sensitive data sources |
| Ticket creation, internal documentation | Auto with audit — log every action | Periodic review of agent decisions; no per-action HITL |
| User-facing notification (email to employee, Slack DM) | Auto with audit + rate limit | Rate limit to prevent spam; HITL if rate exceeded |
| Email-to-external (vendor, customer) | HITL required | Always — external comms have brand and legal exposure |
| Host isolation, credential reset, firewall rule change | HITL required | Always — operational impact too high for auto |
| Cross-domain action (AD/Okta, cloud IAM, EDR config) | Dual-control HITL | Two-human approval required |
| Financial transaction, wire transfer | HITL + secondary OOB verification | Module 2.4’s workflow-gap pattern applies |
Never gate HITL on model self-reported confidence. A model can be 0.95-confident about taking the wrong action. Use action-criticality, not certainty.
The Codex-generated multi-agent SOC workflow
The full implementation is at .boss-pattern-work/day4/multi_agent_soc.py. Three agent nodes:
TriageAgent
Receives an incoming alert (JSON). Classifies severity, suspected MITRE techniques, recommended next queries. Uses an LLM call (Claude / GPT / Llama 3.x — abstracted behind a function).
EnrichmentAgent
Parallel tool fan-out (Module 4.3 Pattern 3 — Parallelization):
- Mock URL reputation lookup
- Mock sender reputation lookup
- Mock recent-campaign correlation against a 30-day ticket store
Aggregates the three results into an enrichment summary.
ResponseAgent
Reads triage + enrichment, proposes containment actions. For any action above “auto” tier in the criticality matrix, the workflow inserts interrupt() before the action is executed.
Audit-log emission
Every agent decision emits a structured event:
def emit_audit(
agent: str,
decision: str,
prompt_hash: str,
model_version: str,
tool_args: dict,
latency_ms: int,
user_id: str | None,
confidence: float | None,
) -> None:
"""Emit a structured audit log event for an agent decision.
Fields:
- who: agent name + user context
- what: decision string + tool args
- prompt_hash: SHA-256 of the prompt for reproducibility
- model_version: exact model ID used (Claude Sonnet 4.6, Llama 3.1-70B, etc.)
- latency_ms: time taken
- confidence: agent's self-reported confidence (logged but NOT used for HITL decisions)
"""
event = {
"@timestamp": datetime.utcnow().isoformat() + "Z",
"event.kind": "event",
"event.category": ["intrusion_detection", "agent_decision"],
"agent.name": agent,
"agent.decision": decision,
"agent.prompt_hash": prompt_hash,
"agent.model_version": model_version,
"agent.tool_args": tool_args,
"agent.latency_ms": latency_ms,
"agent.confidence": confidence,
"user.id": user_id,
}
print(json.dumps(event)) # production: ship to SIEM
Production audit-schema references
Two industry-standard schemas the SOC should adopt:
- OCSF (Open Cybersecurity Schema Framework) — LangSmith and others publish OCSF-compliant agent audit events. Map:
actor.agent_id,operation,affected_resources,ocsf_class: 6003 (API Activity) - OpenInference (Arize Phoenix) — span-based observability with reasoning trace; fields:
span_id,label,score,explanation,attributes.mcp_server_uri
For production deployments, choose one schema and stick with it. The Codex-generated workflow emits a simplified version; in production, swap for the OCSF schema your SIEM ingests.
Production case studies
The 2025-2026 production-deployment landscape for agentic SOCs is still maturing. A few documented patterns:
- Financial services (e.g., autonomous loan approval / payment triage) — gated execution where transactions above a dollar threshold ($5,000 is a common watershed) trigger hard
interrupt()requiring human manager’s digital signature - Healthcare (e.g., AI-assisted dosage or clinical note review) — confidence-based escalation where low-confidence decisions are queued for “human above the loop” audit rather than blocking
- Anthropic Claude Code internal use — immutable audit trails to write-only tamper-evident ledger prevents agent self-deception about prior actions
Instructor note: the specific public case studies in this space are still emerging in 2025-2026; verify particular vendor / org claims at delivery time. The architectural patterns (gated execution, immutable audit, confidence escalation) are stable; the named examples may shift.
The “rule of two” applied
From Day 3 Module 3.6: any agent should satisfy a maximum of two legs of the lethal trifecta (private data + untrusted content + external communication).
Applied to your own SOC agents:
- Triage agent — has private data (alerts) + untrusted content (alert payloads which can be adversary-crafted). Should NOT have external communication. It can write to your ticket system (internal) but not directly email customers.
- Enrichment agent — has external communication (calling threat-intel APIs) + may see untrusted content (in alert payloads). Should NOT have private data access — give it the alert metadata, not the full ticket store.
- Response agent — has private data (the org’s response runbook) + external communication (can isolate hosts, send notifications). Should NOT see untrusted content. Strip adversary-controlled fields from alerts before passing to the response agent.
The Codex multi-agent workflow implements this decomposition: TriageAgent reads alerts but cannot externally communicate; EnrichmentAgent calls external services but processes only sanitized fields; ResponseAgent acts but receives only the triage+enrichment summary, never raw adversary-controlled content.
Discussion questions (~10 min)
- Your SOC wants to auto-execute “block this IOC across all firewalls” without HITL because the action is reversible. The action-criticality matrix classifies this as cross-domain (firewall config change). Is auto-execution defensible because it’s reversible? Walk through the failure modes.
- The Codex multi-agent workflow uses
interrupt()before any action above the auto tier. What’s the worst-case latency the SOC should plan for — and how does that compare to manual analyst response time pre-agent? - The “rule of two” decomposition splits the agent into three sub-agents. What new attack surface does this create (e.g., adversary manipulating one sub-agent’s output to influence another)? Identify mitigation.
Common mistakes
| Mistake | Better approach |
|---|---|
| Building one big agent with all permissions | Decompose by rule-of-two; smaller agents each break a leg of the trifecta |
| Gating HITL on model confidence score | Action-criticality matrix; never trust model self-reported certainty |
| Skipping the audit log because it’s “verbose” | The audit log is the SIEM event stream for agent decisions — non-optional |
| Using the same model for generator and evaluator | Collusion bias; use different models or significantly different prompts |
Treating LangGraph interrupt() as a Python input() | The semantic guarantees (durable, idempotent, JSON-safe) are what make it production-grade |
What’s next
Module 4.4 covers AI supply-chain compromise — the LiteLLM/Mercor March 2026 case in technical detail, plus the JFrog Hugging Face 2024 disclosure and the PyTorch torchtriton 2022 incident. The defender’s discipline: model SBOM, provenance pinning, picklescan and safetensors-scan.