Module 1.2 — The Detector’s AI Deployment Decision

50-minute lecture · Day 1 morning

Learning objectives

By end of this module, students can:

  1. Build a defensible decision matrix for cloud-API vs on-prem open-weight LLM deployment for a given detection-engineering workload
  2. Compute total cost of inference at realistic SOC volume (millions of triage events per month) for Claude Sonnet 4.6, GPT-5.4, Gemini 2.5 Flash, and on-prem Llama/Qwen-class options
  3. Identify which workloads must stay on-prem under GDPR, CJIS, ITAR, and DoD Impact-Level 4/5/6 regimes — and which workloads are now legally fine on cloud given the 2025 FedRAMP wave
  4. Justify a hybrid architecture where the same RAG pipeline routes some queries to cloud and others to on-prem based on data sensitivity

The decision is rarely binary

When detection engineers first encounter the cloud-vs-on-prem question, the instinct is to pick one. That’s usually wrong. The 2026 landscape rewards hybrid deployments where the same retrieval pipeline routes queries through different model backends based on the sensitivity of the data being processed.

In practice, a competent SOC AI stack in mid-2026 runs:

The detection engineer’s job is to draw the routing rules — not to evangelize one backend.


Current cloud pricing snapshot (verify the week of delivery)

The model and pricing landscape moves faster than slide decks. Instructors should verify these against vendor pages the morning of delivery. These figures are from May 2026:

VendorModelInput $/MtokOutput $/MtokSource
AnthropicClaude Sonnet 4.6$3.00$15.00platform.claude.com/docs/en/about-claude/pricing
OpenAIGPT-5.4$2.50$15.00openai.com/api/pricing
OpenAIGPT-5.5$5.00$30.00openai.com/api/pricing
GoogleGemini 2.5 Pro$1.25 (≤200k ctx) / $2.50 (>200k)$10.00 / $15.00ai.google.dev/pricing
GoogleGemini 2.5 Flash$0.15$0.60ai.google.dev/pricing

Critical observation: Gemini 2.5 Flash at $0.15/$0.60 is ~20x cheaper than Claude Sonnet 4.6 for similar latency budgets on triage-class workloads. For a SOC ingesting 10M alert candidates/month at avg 2k tokens in / 500 tokens out per triage, the math:

The Llama 70B option is dramatically cheaper at scale but introduces capacity ceilings, ops burden, and worse zero-shot accuracy than the frontier cloud models. The right answer depends entirely on workload mix.


Open-weight options worth knowing in May 2026

The open-weight ecosystem in 2026 is genuinely competitive with cloud for SOC-class workloads. Names a detection engineer should know:

For SOC detection workloads, the sweet spot is usually a 20-70B parameter model running on a single A100/H100 or a g5.4xlarge / p4d. Below 20B, accuracy degrades noticeably on adversary-content classification. Above 70B, latency starts hurting triage throughput without proportional accuracy gains.


Regulatory snapshot (May 2026)

The 2025 wave of FedRAMP and DoD authorizations changed the regulatory math significantly. As of May 2026:

FedRAMP High and DoD Impact Level:

Practical implication for the detection engineer: A federal SOC operating at IL5 or below now has multiple cleared cloud LLM options. The “we can’t use cloud LLMs because of compliance” argument is largely obsolete for federal civilian and most DoD work below IL6.

Still requires on-prem (or near-on-prem):


The four-axis decision matrix

For each detection workload, the engineer should evaluate against these four axes:

AxisWhen to choose cloudWhen to choose on-prem
Data sensitivitySanitized alerts; public threat intel; non-PII telemetryRaw logs containing PII, customer content, classified, ITAR, IP
VolumeBursty workloads; <10M tokens/daySustained workloads; >100M tokens/day (cloud TCO dominates)
Latency budgetP95 ≥ 2 seconds acceptable (typical for triage)P95 < 200 ms required (real-time inline filtering)
Accuracy ceilingHigh-reasoning workloads, novel attacks, escalation triageBulk first-pass triage on known-pattern alerts

A workload that scores “cloud” on all four axes is a clear cloud-API workload. A workload that scores “on-prem” on all four is a clear on-prem workload. The interesting cases score mixed, and that’s where hybrid routing earns its complexity.


The reference architecture detection engineers should know is:

SIEM event

[1] Edge classifier (open-weight 7-8B on-prem)
   ↓ confidence > 0.95  →  auto-triage, store decision
   ↓ confidence ≤ 0.95

[2] Mid-tier reasoner (open-weight 32-70B on-prem)
   ↓ if PII/classified content present → finalize here, never leave boundary
   ↓ if no sensitive content

[3] Frontier cloud model (Claude Sonnet 4.6 / GPT-5.4 / Gemini 2.5 Pro)
   ↓ for high-reasoning escalation, attribution, novel-attack reasoning

Triage decision + audit trail

The key design decisions for the detection engineer:

  1. The PII/classified gate at Tier 2 is non-negotiable. Cloud APIs must never see raw sensitive content. Use a deterministic classifier (regex, NER, classification model) before routing decision is made — never trust the reasoning LLM to enforce its own data-handling policy.
  2. Audit every Tier 3 cloud call. Log prompt-hash, model-version, response-hash, token-counts, and the decision the SOC took on the response. This is your evidence trail if a compliance officer asks “did any sensitive content leave the boundary?”
  3. Cache aggressively. Triage on alerts that have been seen before (same hash, same enrichment context) should hit a result cache, never re-call any model.

Code: a minimal routing decision in Python

"""
Minimal triage router demonstrating the three-tier pattern.
Production version should add audit logging, retries, circuit breakers.
"""
from typing import Literal

from openai import OpenAI  # for cloud
import requests           # for on-prem ollama

ON_PREM_FAST = "http://ollama-fast.internal:11434/api/chat"  # llama 3.1-8b
ON_PREM_REASONER = "http://ollama-reasoner.internal:11434/api/chat"  # llama-4-scout
CLOUD = OpenAI(base_url="https://api.openai.com/v1")

def classify_sensitivity(alert: dict) -> Literal["public", "pii", "classified"]:
    """
    Deterministic content-sensitivity classifier.
    NEVER let the reasoning LLM make this decision.
    """
    fields = " ".join(str(v) for v in alert.values())
    # In production, use a proper NER + classification model, not regex
    if any(marker in fields for marker in ["TS//", "SECRET//", "CONFIDENTIAL//"]):
        return "classified"
    if any(marker in fields for marker in ["SSN", "passport", "DOB", "@", "CardNumber"]):
        return "pii"
    return "public"

def triage(alert: dict) -> dict:
    sensitivity = classify_sensitivity(alert)

    # Tier 1: edge classifier (always on-prem)
    fast_result = call_ollama(ON_PREM_FAST, alert)
    if fast_result["confidence"] > 0.95:
        return {"tier": 1, **fast_result}

    # Tier 2: on-prem reasoner
    reasoner_result = call_ollama(ON_PREM_REASONER, alert)

    # Tier 3: cloud reasoner — ONLY if content is non-sensitive
    if sensitivity == "public" and reasoner_result["confidence"] < 0.85:
        cloud_result = CLOUD.chat.completions.create(
            model="claude-sonnet-4-6-20260101",  # verify model id at delivery
            messages=[{"role": "user", "content": serialize_for_cloud(alert)}],
        )
        return {"tier": 3, **parse_cloud_response(cloud_result)}

    return {"tier": 2, **reasoner_result}

The production version of this is much more elaborate (audit logging, circuit breakers, cache, schema validation), but the routing logic is exactly this pattern. Day 4 of this course builds the full version with HITL gates layered on top.


Discussion questions (~10 min)

  1. A federal CJIS workload now has Azure OpenAI authorized at IL5. Does that mean the workload can use it? What additional checks (beyond FedRAMP authorization) does the engineering team need to perform?
  2. Gemini 2.5 Flash is ~20x cheaper than Claude Sonnet 4.6 for triage. Why might a SOC still choose Claude for the same workload?
  3. Your org’s CISO says “no customer email content goes to cloud LLMs, ever.” A user reports a phishing email with a malicious attachment. The defender’s RAG corpus contains MITRE ATT&CK technique descriptions (public) but the alert payload contains the email body (customer content). How do you architect the routing so the LLM can still reason about the email without sending it to cloud?

Common mistakes

MistakeBetter approach
Picking one backend and standardizingHybrid is the production default in 2026
Letting the reasoning LLM enforce data-handling policyDeterministic pre-filter outside the LLM
Caching at the model level instead of the result levelCache the decision keyed on alert fingerprint, not the raw model output
Ignoring the cost of cloud retries on 429 errorsCloud retry budgets matter — set explicit per-incident caps
Treating “FedRAMP High” as “compliant for everything”FedRAMP High ≠ ITAR ≠ CJIS personnel screening ≠ NIS2 ≠ HIPAA. Check each.

What’s next

Module 1.3 covers embeddings — the highest-ROI primitive the detection engineer has in 2026, and what most SOCs reach for too late. Embeddings drive both the routing logic above (sensitivity classifiers, near-duplicate alert detection) and the RAG pipeline we build in Module 1.4.