Module 4.3 — Hardening Your Own Agents

50-minute lecture · Day 4 afternoon · Hands-on Python in the lab

Learning objectives

By end of this module, students can:

  1. Name and apply the five agent design patterns from Anthropic’s Building Effective Agents (Dec 2024): prompt chaining, routing, parallelization, orchestrator-worker, evaluator-optimizer
  2. Use LangGraph HITL primitives (interrupt(), Command(resume=...), checkpointing) to gate cross-domain or destructive agent actions
  3. Apply the action-criticality matrix to decide which actions require auto-execution, audit-only, HITL approval, or dual-control
  4. Deploy a working multi-agent SOC workflow (Codex-generated) that triages alerts, enriches them in parallel, and proposes containment with explicit HITL gates and audit-log emission

The defender’s reframe

Day 4 has covered adversaries running their own agents. The flip side: your own SOC may run agents too — and those agents need hardening against the same role-play exploit, the same lethal trifecta, the same indirect-prompt-injection vectors that adversaries use against external systems.

The detection engineer’s job in Module 4.3 is architectural defense of the org’s own agents before they become the lateral-movement vehicle for an adversary. The five canonical agent patterns from Anthropic’s Building Effective Agents (Dec 2024) frame how to think about agent design.


The five agent patterns (Anthropic, Building Effective Agents)

Each pattern has different risk and detection implications. Walk through them before designing the SOC’s own agent workflow.

Pattern 1: Prompt Chaining

Definition: Decompose a complex task into a sequence of smaller LLM calls, where each step’s output feeds the next.

Example: Generate a technical document by (a) creating an outline, (b) drafting sections, (c) polishing the final text. Three sequential LLM calls.

SOC application: Alert triage chain — (a) classify alert severity, (b) enrich with relevant context, (c) propose response. Three sequential LLM calls.

Risk: Errors compound across the chain. If step (a) misclassifies, step (b) and step (c) build on the wrong foundation. Mitigation: validate each step’s output before passing to the next; surface the chain’s intermediate states to the analyst.

Pattern 2: Routing

Definition: Classify an input and direct it to a specialized prompt or model optimized for that category.

Example: A support agent routing a request to the “Billing” expert vs the “Technical Troubleshooting” expert.

SOC application: Route alerts by category to specialized triage prompts — phishing alerts to a phishing-specific prompt, lateral-movement alerts to a different one. Each specialized prompt has tighter context and better accuracy.

Risk: Routing errors send alerts to the wrong specialist. Mitigation: maintain a “general triage” fallback prompt for alerts the router can’t classify confidently.

Pattern 3: Parallelization

Definition: Execute multiple LLM calls simultaneously — either for independent sub-tasks (Sectioning) or to reach consensus (Voting).

Example: Summarizing five book chapters simultaneously; or running three agents on the same query to find the best SQL solution.

SOC application: Parallel enrichment — when an alert needs URL reputation, sender reputation, and recent-campaign correlation, run all three lookups simultaneously. Combine results before passing to the response step.

Risk: Cost multiplication. Running 3 LLM calls per alert at 10M alerts/month is 30M LLM calls — verify cost economics before scaling.

Pattern 4: Orchestrator-Worker

Definition: A central LLM analyzes a task, delegates sub-tasks to specialized workers, and synthesizes results.

Example: An engineering agent breaks a feature request into frontend/backend/test sub-tasks for sub-agents to execute.

SOC application: Multi-agent IR — orchestrator agent receives a major incident, delegates investigation to specialized worker agents (one for endpoint forensics, one for network flow analysis, one for log correlation), synthesizes the findings into a single response plan.

Risk: The orchestrator becomes a single point of failure and a high-value target for prompt injection. Mitigation: orchestrator agents need especially strong input filtering and HITL gates for cross-domain actions.

Pattern 5: Evaluator-Optimizer

Definition: An iterative loop where a Generator produces an output and an Evaluator provides feedback for refinement.

Example: High-stakes code generation where a second agent reviews the output for security bugs and forces a rewrite if any are found.

SOC application: Auto-generated detection rules where a “generator” agent proposes a Sigma rule and an “evaluator” agent reviews it for known false-positive patterns before deploying.

Risk: Generator-evaluator collusion if both run on the same model. Mitigation: use different models (or significantly different prompts) for generator and evaluator.


LangGraph HITL primitives

LangGraph (LangChain’s agent-orchestration framework) provides built-in primitives for human-in-the-loop control flow:

interrupt(payload)

Pauses the agent graph at a node and returns control to the human. The payload is delivered to the human reviewer (typically via Slack approval bot, web UI, or CLI). The agent does not advance until resumed.

Semantic guarantees:

Command(resume=value)

Resumes an interrupted graph, optionally passing a value back as the interrupt’s “decision.”

# In a human-review UI handler:
graph.update_state(thread_id, Command(resume="approved"))

Command(goto=node_name)

Allows the human reviewer to redirect the agent to a different node (e.g., “this needs more investigation, route back to enrichment”).

Checkpointer

Stores graph state in durable storage. Required for HITL because the agent must survive the wait for human review (which can take minutes to hours).

from langgraph.checkpoint.postgres import PostgresSaver

checkpointer = PostgresSaver.from_conn_string(POSTGRES_URL)
graph = builder.compile(checkpointer=checkpointer)

The action-criticality matrix

The right HITL gate depends on action criticality, not on the model’s self-reported confidence. The matrix:

Action classDefault policyOverride conditions
Read-only enrichment, lookups, taggingAuto — no HITL neededOverride to require HITL only for tagged-as-sensitive data sources
Ticket creation, internal documentationAuto with audit — log every actionPeriodic review of agent decisions; no per-action HITL
User-facing notification (email to employee, Slack DM)Auto with audit + rate limitRate limit to prevent spam; HITL if rate exceeded
Email-to-external (vendor, customer)HITL requiredAlways — external comms have brand and legal exposure
Host isolation, credential reset, firewall rule changeHITL requiredAlways — operational impact too high for auto
Cross-domain action (AD/Okta, cloud IAM, EDR config)Dual-control HITLTwo-human approval required
Financial transaction, wire transferHITL + secondary OOB verificationModule 2.4’s workflow-gap pattern applies

Never gate HITL on model self-reported confidence. A model can be 0.95-confident about taking the wrong action. Use action-criticality, not certainty.


The Codex-generated multi-agent SOC workflow

The full implementation is at .boss-pattern-work/day4/multi_agent_soc.py. Three agent nodes:

TriageAgent

Receives an incoming alert (JSON). Classifies severity, suspected MITRE techniques, recommended next queries. Uses an LLM call (Claude / GPT / Llama 3.x — abstracted behind a function).

EnrichmentAgent

Parallel tool fan-out (Module 4.3 Pattern 3 — Parallelization):

Aggregates the three results into an enrichment summary.

ResponseAgent

Reads triage + enrichment, proposes containment actions. For any action above “auto” tier in the criticality matrix, the workflow inserts interrupt() before the action is executed.

Audit-log emission

Every agent decision emits a structured event:

def emit_audit(
    agent: str,
    decision: str,
    prompt_hash: str,
    model_version: str,
    tool_args: dict,
    latency_ms: int,
    user_id: str | None,
    confidence: float | None,
) -> None:
    """Emit a structured audit log event for an agent decision.

    Fields:
      - who: agent name + user context
      - what: decision string + tool args
      - prompt_hash: SHA-256 of the prompt for reproducibility
      - model_version: exact model ID used (Claude Sonnet 4.6, Llama 3.1-70B, etc.)
      - latency_ms: time taken
      - confidence: agent's self-reported confidence (logged but NOT used for HITL decisions)
    """
    event = {
        "@timestamp": datetime.utcnow().isoformat() + "Z",
        "event.kind": "event",
        "event.category": ["intrusion_detection", "agent_decision"],
        "agent.name": agent,
        "agent.decision": decision,
        "agent.prompt_hash": prompt_hash,
        "agent.model_version": model_version,
        "agent.tool_args": tool_args,
        "agent.latency_ms": latency_ms,
        "agent.confidence": confidence,
        "user.id": user_id,
    }
    print(json.dumps(event))  # production: ship to SIEM

Production audit-schema references

Two industry-standard schemas the SOC should adopt:

For production deployments, choose one schema and stick with it. The Codex-generated workflow emits a simplified version; in production, swap for the OCSF schema your SIEM ingests.


Production case studies

The 2025-2026 production-deployment landscape for agentic SOCs is still maturing. A few documented patterns:

Instructor note: the specific public case studies in this space are still emerging in 2025-2026; verify particular vendor / org claims at delivery time. The architectural patterns (gated execution, immutable audit, confidence escalation) are stable; the named examples may shift.


The “rule of two” applied

From Day 3 Module 3.6: any agent should satisfy a maximum of two legs of the lethal trifecta (private data + untrusted content + external communication).

Applied to your own SOC agents:

The Codex multi-agent workflow implements this decomposition: TriageAgent reads alerts but cannot externally communicate; EnrichmentAgent calls external services but processes only sanitized fields; ResponseAgent acts but receives only the triage+enrichment summary, never raw adversary-controlled content.


Discussion questions (~10 min)

  1. Your SOC wants to auto-execute “block this IOC across all firewalls” without HITL because the action is reversible. The action-criticality matrix classifies this as cross-domain (firewall config change). Is auto-execution defensible because it’s reversible? Walk through the failure modes.
  2. The Codex multi-agent workflow uses interrupt() before any action above the auto tier. What’s the worst-case latency the SOC should plan for — and how does that compare to manual analyst response time pre-agent?
  3. The “rule of two” decomposition splits the agent into three sub-agents. What new attack surface does this create (e.g., adversary manipulating one sub-agent’s output to influence another)? Identify mitigation.

Common mistakes

MistakeBetter approach
Building one big agent with all permissionsDecompose by rule-of-two; smaller agents each break a leg of the trifecta
Gating HITL on model confidence scoreAction-criticality matrix; never trust model self-reported certainty
Skipping the audit log because it’s “verbose”The audit log is the SIEM event stream for agent decisions — non-optional
Using the same model for generator and evaluatorCollusion bias; use different models or significantly different prompts
Treating LangGraph interrupt() as a Python input()The semantic guarantees (durable, idempotent, JSON-safe) are what make it production-grade

What’s next

Module 4.4 covers AI supply-chain compromise — the LiteLLM/Mercor March 2026 case in technical detail, plus the JFrog Hugging Face 2024 disclosure and the PyTorch torchtriton 2022 incident. The defender’s discipline: model SBOM, provenance pinning, picklescan and safetensors-scan.