Model Routing Rationale¶
Overview¶
V3 Pragmatica uses a 3-tier model routing strategy where vulnerability impact potential drives model selection. Every sub-agent dispatched by the wave coordinator is assigned a model and thinking budget based on the maximum severity of findings that agent could produce.
The Core Principle¶
Not all testing tasks require the same reasoning depth. SQL injection exploitation demands creative bypass construction and response analysis, while TLS certificate validation is a deterministic checklist. Assigning Opus to both wastes budget on the latter and risks insufficient reasoning on the former.
The routing key is a compound lookup: skill-scope (e.g., test-injection-sqli), falling back to skill if no scope is active, and finally defaulting to Sonnet.
Three-Tier Model Assignment¶
Tier 1: Opus (Creative/Critical)¶
Opus handles tasks where the ceiling for finding severity is Critical or High, and where success depends on creative reasoning, response interpretation, or multi-step attack chain construction.
| Agent | Justification |
|---|---|
test-sqli |
Blind SQLi requires observing subtle timing/content differences across dozens of responses |
test-xss |
WAF bypass demands iterative payload mutation and context-aware encoding |
test-cmdi |
8 blind timing variants need systematic evaluation of response deltas |
test-oauth |
State null-byte, IDN homograph, pre-ATO require multi-step protocol reasoning |
test-logic-business |
Financial bypass, negative amounts, overdraft exploitation are inherently creative |
test-logic-race |
Race condition window detection requires precise timing analysis |
test-infra-smuggling |
CL.TE/TE.CL desync requires byte-level reasoning about parser differences |
test-dom |
DOM XSS source-sink tracing through complex JS call chains |
test-access-idor |
Neighbor-ID, role escalation, and field injection patterns need contextual reasoning |
route |
Endpoint-to-test mapping requires understanding the full attack surface |
verify |
False positive elimination demands careful response comparison against baselines |
chain |
Correlating findings into attack chains (SSRF to cloud creds, XSS to ATO) |
Tier 2: Opus with Medium Effort¶
A subset of Opus agents handle tasks that are high-impact but more procedural. These get Opus for quality but lower thinking budgets to control cost.
| Agent | Justification |
|---|---|
test-ssrf-vector |
Bypass patterns are catalogued; needs Opus for response analysis, not creative generation |
test-csrf-cors |
SameSite analysis and Content-Type downgrade are structured checks |
test-api-rest |
REST endpoint testing follows systematic patterns but needs reasoning for auth bypass |
test-api-graphql |
Introspection, batching, WS auth bypass are well-defined attack trees |
test-deser |
Gadget chain detection is pattern-matching with Opus-level response analysis |
test-advanced-* |
HPP, CRLF, MFA bypass, host header are structured but need quality FP filtering |
test-cloud-* |
S3/GCS misconfig, subdomain takeover follow known patterns |
test-supply-chain |
Dependency analysis, SRI checks are systematic |
test-infra-cache |
Cache poisoning/deception follows documented techniques |
Tier 3: Sonnet (Passive/Deterministic)¶
Sonnet handles tasks with lower severity ceilings or highly deterministic execution paths.
| Agent | Justification |
|---|---|
test-crypto |
TLS/SSL checks are tool-driven (testssl.sh output parsing) |
test-exceptions |
Stack trace and debug mode detection is pattern matching |
test-llm |
Prompt injection testing follows structured probe categories |
test-mobile |
Binary analysis with tooling (apktool, jadx) is procedural |
Tier 4: Haiku (Tool Execution)¶
Haiku handles pure tool orchestration where no security reasoning is needed.
| Agent | Justification |
|---|---|
recon |
Subfinder, httpx, dnsx are fire-and-parse |
scan |
Nuclei, nikto execution and output collection |
Thinking Budget Tiers¶
Thinking budgets control how many tokens the model spends on internal reasoning before producing output. Higher budgets improve quality on complex reasoning tasks but increase cost linearly.
| Tier | Budget Range | Assignment |
|---|---|---|
| HIGH | 10,000-16,000 tokens | SQLi, XSS, CMDi, OAuth, business logic, race conditions, smuggling, DOM XSS, IDOR, verify, chain, route |
| MEDIUM | 5,000-8,000 tokens | SSRF vectors, CSRF/CORS, API testing, deserialization, advanced checks, cloud, supply chain, GraphQL |
| Sonnet | 3,000-5,000 tokens | Crypto, exceptions, LLM, mobile |
| Haiku default | 2,000 tokens | Recon, scan (implicit default) |
Route gets the highest budget
The /route skill receives 16,000 thinking tokens because it must analyze the entire resource map, injectable parameters, and scan results to produce an accurate endpoint-to-test mapping. Mistakes here cascade into missed coverage downstream.
Cost-Quality Tradeoff¶
The 3-tier approach produces approximately 2-2.5x the token cost compared to running everything on Sonnet. The justification is measurable: lab evaluations show that Opus-routed injection and auth testing finds 30-40% more vulnerabilities than the same skills on Sonnet, particularly for blind and time-based attacks that require multi-step reasoning.
The budget is controlled by:
- Per-agent request limits:
500 / N_CONCURRENT_AGENTSrequests per agent - JITTER_MULT scaling: Combined request rate stays within stealth limits regardless of concurrency
- Kill switches: 45-minute timeout (60 for injection), hard stop at 500 requests per skill
V3 Pragmatica vs Previous Versions¶
| Version | Strategy | Agents | Issue |
|---|---|---|---|
| V1 | Single model (Opus for everything) | 16 monolithic | Excessive cost, context degradation in long-running agents |
| V2 | 2-tier (Opus/Sonnet) | 16 monolithic | Better cost, but monolithic agents still degraded after 100+ turns |
| V3 Pragmatica | 3-tier with effort levels, scope decomposition | 31+ sub-agents | Focused agents with isolated context, model matched to task complexity |
V3 Pragmatica's key insight is that smaller, focused agents outperform larger monolithic ones even when using the same model, because context window degradation is the primary quality bottleneck in long-running penetration tests. By decomposing skills into scopes and routing each scope to the appropriate model with a calibrated thinking budget, V3 achieves both higher finding rates and lower per-engagement cost.
Compound Lookup Keys¶
The dispatch function resolves models using compound keys for scope-level granularity:
# Build compound lookup key
local lookup_key="$skill"
[ -n "$scope" ] && lookup_key="${skill}-${scope}"
# Resolve model (default: Sonnet)
local model="${MODEL_MAP[$lookup_key]:-${MODEL_MAP[$skill]:-claude-sonnet-4-6}}"
# Resolve thinking budget (default: 2000)
local thinking="${THINKING_MAP[$lookup_key]:-${THINKING_MAP[$skill]:-2000}}"
This allows different scopes of the same skill to receive different models. For example, test-injection-sqli gets Opus HIGH while test-injection-misc gets Opus MEDIUM.