Module 1.2 — The Detector’s AI Deployment Decision

50-minute lecture · Day 1 morning

Learning objectives

By end of this module, students can:

Build a defensible decision matrix for cloud-API vs on-prem open-weight LLM deployment for a given detection-engineering workload
Compute total cost of inference at realistic SOC volume (millions of triage events per month) for Claude Sonnet 4.6, GPT-5.4, Gemini 2.5 Flash, and on-prem Llama/Qwen-class options
Identify which workloads must stay on-prem under GDPR, CJIS, ITAR, and DoD Impact-Level 4/5/6 regimes — and which workloads are now legally fine on cloud given the 2025 FedRAMP wave
Justify a hybrid architecture where the same RAG pipeline routes some queries to cloud and others to on-prem based on data sensitivity

The decision is rarely binary

When detection engineers first encounter the cloud-vs-on-prem question, the instinct is to pick one. That’s usually wrong. The 2026 landscape rewards hybrid deployments where the same retrieval pipeline routes queries through different model backends based on the sensitivity of the data being processed.

In practice, a competent SOC AI stack in mid-2026 runs:

A frontier cloud model (Claude Sonnet 4.6, GPT-5.4, or Gemini 2.5 Pro) for high-reasoning workloads where the data has been sanitized or is already non-sensitive
A medium-sized open-weight model on-prem (Llama-class 70B, Qwen-class 32B, or DeepSeek-V3 variants) for high-volume bulk triage where every query touches potentially-sensitive log content
A small open-weight model (Llama-class 8B, Phi-3, Qwen 3 7B) running on edge nodes for fastest-path triage at the SIEM ingest layer

The detection engineer’s job is to draw the routing rules — not to evangelize one backend.

Current cloud pricing snapshot (verify the week of delivery)

The model and pricing landscape moves faster than slide decks. Instructors should verify these against vendor pages the morning of delivery. These figures are from May 2026:

Vendor	Model	Input $/Mtok	Output $/Mtok	Source
Anthropic	Claude Sonnet 4.6	$3.00	$15.00	platform.claude.com/docs/en/about-claude/pricing
OpenAI	GPT-5.4	$2.50	$15.00	openai.com/api/pricing
OpenAI	GPT-5.5	$5.00	$30.00	openai.com/api/pricing
Google	Gemini 2.5 Pro	$1.25 (≤200k ctx) / $2.50 (>200k)	$10.00 / $15.00	ai.google.dev/pricing
Google	Gemini 2.5 Flash	$0.15	$0.60	ai.google.dev/pricing

Critical observation: Gemini 2.5 Flash at $0.15/$0.60 is ~20x cheaper than Claude Sonnet 4.6 for similar latency budgets on triage-class workloads. For a SOC ingesting 10M alert candidates/month at avg 2k tokens in / 500 tokens out per triage, the math:

Claude Sonnet 4.6: 10M × (2k × $3 / 1M + 0.5k × $15 / 1M) = 10M × ($0.006 + $0.0075) = ~$135,000/month
Gemini 2.5 Flash: 10M × (2k × $0.15 / 1M + 0.5k × $0.60 / 1M) = 10M × ($0.0003 + $0.0003) = ~$6,000/month
Llama 70B on a single g5.2xlarge ($1.21/hr on-demand, ~50 RPS sustained): ~$870/month for hardware (capacity-bound, not token-bound)

The Llama 70B option is dramatically cheaper at scale but introduces capacity ceilings, ops burden, and worse zero-shot accuracy than the frontier cloud models. The right answer depends entirely on workload mix.

Open-weight options worth knowing in May 2026

The open-weight ecosystem in 2026 is genuinely competitive with cloud for SOC-class workloads. Names a detection engineer should know:

Llama 4 family (Meta) — Llama 4 Maverick (~400B parameters, MoE-routed), Llama 4 Scout (~17B active, MoE). Apache 2.0. Best general-purpose reasoning in the open-weight tier as of May 2026.
Qwen 3 family (Alibaba) — Qwen 3 Coder 32B, Qwen 3 235B. Apache 2.0. Notable: Qwen2.5-Coder-32B-Instruct is the model APT28’s PROMPTSTEAL malware queries from Hugging Face (see Module 1.1). Same family the defender can run.
DeepSeek V3 — High parameter count (~671B MoE), MIT-licensed. Strong on reasoning benchmarks.
Mistral Small 3 and Mistral Medium 3 — Apache 2.0. Smaller footprint; common choice for edge deployment.
gpt-oss (OpenAI) — Limited open-weight release for specific scenarios.

For SOC detection workloads, the sweet spot is usually a 20-70B parameter model running on a single A100/H100 or a g5.4xlarge / p4d. Below 20B, accuracy degrades noticeably on adversary-content classification. Above 70B, latency starts hurting triage throughput without proportional accuracy gains.

Regulatory snapshot (May 2026)

The 2025 wave of FedRAMP and DoD authorizations changed the regulatory math significantly. As of May 2026:

FedRAMP High and DoD Impact Level:

Claude (Anthropic): FedRAMP High via AWS GovCloud (Apr 2025) and Google Cloud (Jun 2025). DoD IL4 and IL5 approved within Amazon Bedrock.
Azure OpenAI: FedRAMP High certification finalized Dec 2025, covering GPT-4o, the o1 series, and (per Microsoft’s announcement) the full GenAI suite. Azure OpenAI is authorized for IL6 (Top Secret), making it the first major commercial LLM cleared for all US Government data classification levels.
AWS Bedrock: FedRAMP High and DoD IL5, hosting both Anthropic and Meta models.
Google Gemini: FedRAMP High authorized March 2025.

Practical implication for the detection engineer: A federal SOC operating at IL5 or below now has multiple cleared cloud LLM options. The “we can’t use cloud LLMs because of compliance” argument is largely obsolete for federal civilian and most DoD work below IL6.

Still requires on-prem (or near-on-prem):

ITAR-controlled environments — Defense contractors handling export-controlled technical data still face restrictions even with FedRAMP-authorized services, because ITAR compliance hinges on personnel access controls and data residency in ways FedRAMP doesn’t fully address.
GDPR-strict workloads — EU SOCs handling personal data under Article 9 (special categories) often choose on-prem to avoid the legal complexity of cross-border processing agreements, even when cloud LLMs have GDPR-compatible regions.
CJIS for law enforcement — Some FBI CJIS Security Policy requirements (specifically around personnel screening for anyone with access to CJI) are still easier to meet on-prem than via cloud vendor staff.
Intelligence Community (IC) workloads above the level of commercial-cloud authorization, particularly anything that touches SCI.

The four-axis decision matrix

For each detection workload, the engineer should evaluate against these four axes:

Axis	When to choose cloud	When to choose on-prem
Data sensitivity	Sanitized alerts; public threat intel; non-PII telemetry	Raw logs containing PII, customer content, classified, ITAR, IP
Volume	Bursty workloads; <10M tokens/day	Sustained workloads; >100M tokens/day (cloud TCO dominates)
Latency budget	P95 ≥ 2 seconds acceptable (typical for triage)	P95 < 200 ms required (real-time inline filtering)
Accuracy ceiling	High-reasoning workloads, novel attacks, escalation triage	Bulk first-pass triage on known-pattern alerts

A workload that scores “cloud” on all four axes is a clear cloud-API workload. A workload that scores “on-prem” on all four is a clear on-prem workload. The interesting cases score mixed, and that’s where hybrid routing earns its complexity.

Hybrid architecture pattern (the recommended default)

The reference architecture detection engineers should know is:

SIEM event
   ↓
[1] Edge classifier (open-weight 7-8B on-prem)
   ↓ confidence > 0.95  →  auto-triage, store decision
   ↓ confidence ≤ 0.95
   ↓
[2] Mid-tier reasoner (open-weight 32-70B on-prem)
   ↓ if PII/classified content present → finalize here, never leave boundary
   ↓ if no sensitive content
   ↓
[3] Frontier cloud model (Claude Sonnet 4.6 / GPT-5.4 / Gemini 2.5 Pro)
   ↓ for high-reasoning escalation, attribution, novel-attack reasoning
   ↓
Triage decision + audit trail

The key design decisions for the detection engineer:

The PII/classified gate at Tier 2 is non-negotiable. Cloud APIs must never see raw sensitive content. Use a deterministic classifier (regex, NER, classification model) before routing decision is made — never trust the reasoning LLM to enforce its own data-handling policy.
Audit every Tier 3 cloud call. Log prompt-hash, model-version, response-hash, token-counts, and the decision the SOC took on the response. This is your evidence trail if a compliance officer asks “did any sensitive content leave the boundary?”
Cache aggressively. Triage on alerts that have been seen before (same hash, same enrichment context) should hit a result cache, never re-call any model.

Code: a minimal routing decision in Python

"""
Minimal triage router demonstrating the three-tier pattern.
Production version should add audit logging, retries, circuit breakers.
"""
from typing import Literal

from openai import OpenAI  # for cloud
import requests           # for on-prem ollama

ON_PREM_FAST = "http://ollama-fast.internal:11434/api/chat"  # llama 3.1-8b
ON_PREM_REASONER = "http://ollama-reasoner.internal:11434/api/chat"  # llama-4-scout
CLOUD = OpenAI(base_url="https://api.openai.com/v1")

def classify_sensitivity(alert: dict) -> Literal["public", "pii", "classified"]:
    """
    Deterministic content-sensitivity classifier.
    NEVER let the reasoning LLM make this decision.
    """
    fields = " ".join(str(v) for v in alert.values())
    # In production, use a proper NER + classification model, not regex
    if any(marker in fields for marker in ["TS//", "SECRET//", "CONFIDENTIAL//"]):
        return "classified"
    if any(marker in fields for marker in ["SSN", "passport", "DOB", "@", "CardNumber"]):
        return "pii"
    return "public"

def triage(alert: dict) -> dict:
    sensitivity = classify_sensitivity(alert)

    # Tier 1: edge classifier (always on-prem)
    fast_result = call_ollama(ON_PREM_FAST, alert)
    if fast_result["confidence"] > 0.95:
        return {"tier": 1, **fast_result}

    # Tier 2: on-prem reasoner
    reasoner_result = call_ollama(ON_PREM_REASONER, alert)

    # Tier 3: cloud reasoner — ONLY if content is non-sensitive
    if sensitivity == "public" and reasoner_result["confidence"] < 0.85:
        cloud_result = CLOUD.chat.completions.create(
            model="claude-sonnet-4-6-20260101",  # verify model id at delivery
            messages=[{"role": "user", "content": serialize_for_cloud(alert)}],
        )
        return {"tier": 3, **parse_cloud_response(cloud_result)}

    return {"tier": 2, **reasoner_result}

The production version of this is much more elaborate (audit logging, circuit breakers, cache, schema validation), but the routing logic is exactly this pattern. Day 4 of this course builds the full version with HITL gates layered on top.

Discussion questions (~10 min)

A federal CJIS workload now has Azure OpenAI authorized at IL5. Does that mean the workload can use it? What additional checks (beyond FedRAMP authorization) does the engineering team need to perform?
Gemini 2.5 Flash is ~20x cheaper than Claude Sonnet 4.6 for triage. Why might a SOC still choose Claude for the same workload?
Your org’s CISO says “no customer email content goes to cloud LLMs, ever.” A user reports a phishing email with a malicious attachment. The defender’s RAG corpus contains MITRE ATT&CK technique descriptions (public) but the alert payload contains the email body (customer content). How do you architect the routing so the LLM can still reason about the email without sending it to cloud?

Common mistakes

Mistake	Better approach
Picking one backend and standardizing	Hybrid is the production default in 2026
Letting the reasoning LLM enforce data-handling policy	Deterministic pre-filter outside the LLM
Caching at the model level instead of the result level	Cache the decision keyed on alert fingerprint, not the raw model output
Ignoring the cost of cloud retries on 429 errors	Cloud retry budgets matter — set explicit per-incident caps
Treating “FedRAMP High” as “compliant for everything”	FedRAMP High ≠ ITAR ≠ CJIS personnel screening ≠ NIS2 ≠ HIPAA. Check each.

What’s next

Module 1.3 covers embeddings — the highest-ROI primitive the detection engineer has in 2026, and what most SOCs reach for too late. Embeddings drive both the routing logic above (sensitivity classifiers, near-duplicate alert detection) and the RAG pipeline we build in Module 1.4.