Module 4.3 — Hardening Your Own Agents

50-minute lecture · Day 4 afternoon · Hands-on Python in the lab

Learning objectives

By end of this module, students can:

Name and apply the five agent design patterns from Anthropic’s Building Effective Agents (Dec 2024): prompt chaining, routing, parallelization, orchestrator-worker, evaluator-optimizer
Use LangGraph HITL primitives (interrupt(), Command(resume=...), checkpointing) to gate cross-domain or destructive agent actions
Apply the action-criticality matrix to decide which actions require auto-execution, audit-only, HITL approval, or dual-control
Deploy a working multi-agent SOC workflow (Codex-generated) that triages alerts, enriches them in parallel, and proposes containment with explicit HITL gates and audit-log emission

The defender’s reframe

Day 4 has covered adversaries running their own agents. The flip side: your own SOC may run agents too — and those agents need hardening against the same role-play exploit, the same lethal trifecta, the same indirect-prompt-injection vectors that adversaries use against external systems.

The detection engineer’s job in Module 4.3 is architectural defense of the org’s own agents before they become the lateral-movement vehicle for an adversary. The five canonical agent patterns from Anthropic’s Building Effective Agents (Dec 2024) frame how to think about agent design.

The five agent patterns (Anthropic, Building Effective Agents)

Each pattern has different risk and detection implications. Walk through them before designing the SOC’s own agent workflow.

Pattern 1: Prompt Chaining

Definition: Decompose a complex task into a sequence of smaller LLM calls, where each step’s output feeds the next.

Example: Generate a technical document by (a) creating an outline, (b) drafting sections, (c) polishing the final text. Three sequential LLM calls.

SOC application: Alert triage chain — (a) classify alert severity, (b) enrich with relevant context, (c) propose response. Three sequential LLM calls.

Risk: Errors compound across the chain. If step (a) misclassifies, step (b) and step (c) build on the wrong foundation. Mitigation: validate each step’s output before passing to the next; surface the chain’s intermediate states to the analyst.

Pattern 2: Routing

Definition: Classify an input and direct it to a specialized prompt or model optimized for that category.

Example: A support agent routing a request to the “Billing” expert vs the “Technical Troubleshooting” expert.

SOC application: Route alerts by category to specialized triage prompts — phishing alerts to a phishing-specific prompt, lateral-movement alerts to a different one. Each specialized prompt has tighter context and better accuracy.

Risk: Routing errors send alerts to the wrong specialist. Mitigation: maintain a “general triage” fallback prompt for alerts the router can’t classify confidently.

Pattern 3: Parallelization

Definition: Execute multiple LLM calls simultaneously — either for independent sub-tasks (Sectioning) or to reach consensus (Voting).

Example: Summarizing five book chapters simultaneously; or running three agents on the same query to find the best SQL solution.

SOC application: Parallel enrichment — when an alert needs URL reputation, sender reputation, and recent-campaign correlation, run all three lookups simultaneously. Combine results before passing to the response step.

Risk: Cost multiplication. Running 3 LLM calls per alert at 10M alerts/month is 30M LLM calls — verify cost economics before scaling.

Pattern 4: Orchestrator-Worker

Definition: A central LLM analyzes a task, delegates sub-tasks to specialized workers, and synthesizes results.

Example: An engineering agent breaks a feature request into frontend/backend/test sub-tasks for sub-agents to execute.

SOC application: Multi-agent IR — orchestrator agent receives a major incident, delegates investigation to specialized worker agents (one for endpoint forensics, one for network flow analysis, one for log correlation), synthesizes the findings into a single response plan.

Risk: The orchestrator becomes a single point of failure and a high-value target for prompt injection. Mitigation: orchestrator agents need especially strong input filtering and HITL gates for cross-domain actions.

Pattern 5: Evaluator-Optimizer

Definition: An iterative loop where a Generator produces an output and an Evaluator provides feedback for refinement.

Example: High-stakes code generation where a second agent reviews the output for security bugs and forces a rewrite if any are found.

SOC application: Auto-generated detection rules where a “generator” agent proposes a Sigma rule and an “evaluator” agent reviews it for known false-positive patterns before deploying.

Risk: Generator-evaluator collusion if both run on the same model. Mitigation: use different models (or significantly different prompts) for generator and evaluator.

LangGraph HITL primitives

LangGraph (LangChain’s agent-orchestration framework) provides built-in primitives for human-in-the-loop control flow:

`interrupt(payload)`

Pauses the agent graph at a node and returns control to the human. The payload is delivered to the human reviewer (typically via Slack approval bot, web UI, or CLI). The agent does not advance until resumed.

Semantic guarantees:

Deterministic resumption at the start of the interrupted node
Idempotency requirement: any side-effects in the interrupted node should be idempotent because the node may execute again on resume
State is persisted via checkpointer (Postgres/Mongo) — survives process restarts
JSON-safe data serialization for cross-environment state transfers

`Command(resume=value)`

Resumes an interrupted graph, optionally passing a value back as the interrupt’s “decision.”

# In a human-review UI handler:
graph.update_state(thread_id, Command(resume="approved"))

`Command(goto=node_name)`

Allows the human reviewer to redirect the agent to a different node (e.g., “this needs more investigation, route back to enrichment”).

`Checkpointer`

Stores graph state in durable storage. Required for HITL because the agent must survive the wait for human review (which can take minutes to hours).

from langgraph.checkpoint.postgres import PostgresSaver

checkpointer = PostgresSaver.from_conn_string(POSTGRES_URL)
graph = builder.compile(checkpointer=checkpointer)

The action-criticality matrix

The right HITL gate depends on action criticality, not on the model’s self-reported confidence. The matrix:

Action class	Default policy	Override conditions
Read-only enrichment, lookups, tagging	Auto — no HITL needed	Override to require HITL only for tagged-as-sensitive data sources
Ticket creation, internal documentation	Auto with audit — log every action	Periodic review of agent decisions; no per-action HITL
User-facing notification (email to employee, Slack DM)	Auto with audit + rate limit	Rate limit to prevent spam; HITL if rate exceeded
Email-to-external (vendor, customer)	HITL required	Always — external comms have brand and legal exposure
Host isolation, credential reset, firewall rule change	HITL required	Always — operational impact too high for auto
Cross-domain action (AD/Okta, cloud IAM, EDR config)	Dual-control HITL	Two-human approval required
Financial transaction, wire transfer	HITL + secondary OOB verification	Module 2.4’s workflow-gap pattern applies

Never gate HITL on model self-reported confidence. A model can be 0.95-confident about taking the wrong action. Use action-criticality, not certainty.

The Codex-generated multi-agent SOC workflow

The full implementation is at .boss-pattern-work/day4/multi_agent_soc.py. Three agent nodes:

TriageAgent

Receives an incoming alert (JSON). Classifies severity, suspected MITRE techniques, recommended next queries. Uses an LLM call (Claude / GPT / Llama 3.x — abstracted behind a function).

EnrichmentAgent

Parallel tool fan-out (Module 4.3 Pattern 3 — Parallelization):

Mock URL reputation lookup
Mock sender reputation lookup
Mock recent-campaign correlation against a 30-day ticket store

Aggregates the three results into an enrichment summary.

ResponseAgent

Reads triage + enrichment, proposes containment actions. For any action above “auto” tier in the criticality matrix, the workflow inserts interrupt() before the action is executed.

Audit-log emission

Every agent decision emits a structured event:

def emit_audit(
    agent: str,
    decision: str,
    prompt_hash: str,
    model_version: str,
    tool_args: dict,
    latency_ms: int,
    user_id: str | None,
    confidence: float | None,
) -> None:
    """Emit a structured audit log event for an agent decision.

    Fields:
      - who: agent name + user context
      - what: decision string + tool args
      - prompt_hash: SHA-256 of the prompt for reproducibility
      - model_version: exact model ID used (Claude Sonnet 4.6, Llama 3.1-70B, etc.)
      - latency_ms: time taken
      - confidence: agent's self-reported confidence (logged but NOT used for HITL decisions)
    """
    event = {
        "@timestamp": datetime.utcnow().isoformat() + "Z",
        "event.kind": "event",
        "event.category": ["intrusion_detection", "agent_decision"],
        "agent.name": agent,
        "agent.decision": decision,
        "agent.prompt_hash": prompt_hash,
        "agent.model_version": model_version,
        "agent.tool_args": tool_args,
        "agent.latency_ms": latency_ms,
        "agent.confidence": confidence,
        "user.id": user_id,
    }
    print(json.dumps(event))  # production: ship to SIEM

Production audit-schema references

Two industry-standard schemas the SOC should adopt:

OCSF (Open Cybersecurity Schema Framework) — LangSmith and others publish OCSF-compliant agent audit events. Map: actor.agent_id, operation, affected_resources, ocsf_class: 6003 (API Activity)
OpenInference (Arize Phoenix) — span-based observability with reasoning trace; fields: span_id, label, score, explanation, attributes.mcp_server_uri

For production deployments, choose one schema and stick with it. The Codex-generated workflow emits a simplified version; in production, swap for the OCSF schema your SIEM ingests.

Production case studies

The 2025-2026 production-deployment landscape for agentic SOCs is still maturing. A few documented patterns:

Financial services (e.g., autonomous loan approval / payment triage) — gated execution where transactions above a dollar threshold ($5,000 is a common watershed) trigger hard interrupt() requiring human manager’s digital signature
Healthcare (e.g., AI-assisted dosage or clinical note review) — confidence-based escalation where low-confidence decisions are queued for “human above the loop” audit rather than blocking
Anthropic Claude Code internal use — immutable audit trails to write-only tamper-evident ledger prevents agent self-deception about prior actions

Instructor note: the specific public case studies in this space are still emerging in 2025-2026; verify particular vendor / org claims at delivery time. The architectural patterns (gated execution, immutable audit, confidence escalation) are stable; the named examples may shift.

The “rule of two” applied

From Day 3 Module 3.6: any agent should satisfy a maximum of two legs of the lethal trifecta (private data + untrusted content + external communication).

Applied to your own SOC agents:

Triage agent — has private data (alerts) + untrusted content (alert payloads which can be adversary-crafted). Should NOT have external communication. It can write to your ticket system (internal) but not directly email customers.
Enrichment agent — has external communication (calling threat-intel APIs) + may see untrusted content (in alert payloads). Should NOT have private data access — give it the alert metadata, not the full ticket store.
Response agent — has private data (the org’s response runbook) + external communication (can isolate hosts, send notifications). Should NOT see untrusted content. Strip adversary-controlled fields from alerts before passing to the response agent.

The Codex multi-agent workflow implements this decomposition: TriageAgent reads alerts but cannot externally communicate; EnrichmentAgent calls external services but processes only sanitized fields; ResponseAgent acts but receives only the triage+enrichment summary, never raw adversary-controlled content.

Discussion questions (~10 min)

Your SOC wants to auto-execute “block this IOC across all firewalls” without HITL because the action is reversible. The action-criticality matrix classifies this as cross-domain (firewall config change). Is auto-execution defensible because it’s reversible? Walk through the failure modes.
The Codex multi-agent workflow uses interrupt() before any action above the auto tier. What’s the worst-case latency the SOC should plan for — and how does that compare to manual analyst response time pre-agent?
The “rule of two” decomposition splits the agent into three sub-agents. What new attack surface does this create (e.g., adversary manipulating one sub-agent’s output to influence another)? Identify mitigation.

Common mistakes

Mistake	Better approach
Building one big agent with all permissions	Decompose by rule-of-two; smaller agents each break a leg of the trifecta
Gating HITL on model confidence score	Action-criticality matrix; never trust model self-reported certainty
Skipping the audit log because it’s “verbose”	The audit log is the SIEM event stream for agent decisions — non-optional
Using the same model for generator and evaluator	Collusion bias; use different models or significantly different prompts
Treating LangGraph `interrupt()` as a Python `input()`	The semantic guarantees (durable, idempotent, JSON-safe) are what make it production-grade

What’s next

Module 4.4 covers AI supply-chain compromise — the LiteLLM/Mercor March 2026 case in technical detail, plus the JFrog Hugging Face 2024 disclosure and the PyTorch torchtriton 2022 incident. The defender’s discipline: model SBOM, provenance pinning, picklescan and safetensors-scan.