Token Budget Analysis¶

Budget Baseline¶

A full pentest engagement on a medium-complexity application (30-80 known vulnerabilities) consumes approximately 1.8-2.5M input tokens and 450-600K output tokens per run (updated for 1M context with higher thinking budgets). With the V3 Pragmatica wave architecture, a weekly evaluation cadence of 8-12 full scans across 3 lab targets requires budgeting for a maximum of 20x the weekly baseline.

Per-Engagement Estimates¶

Phase	Input Tokens	Output Tokens	Notes
Phase 0 (Context)	15-25K	5-10K	Fingerprinting, tech detection
Phase 0.5 (Walkthrough)	80-120K	20-40K	Playwright crawl, all roles
Phase 1 (Recon)	30-50K	10-20K	Haiku agents, tool output parsing
Phase 2 (Discovery)	100-150K	30-50K	ffuf, jsluice, arjun, source maps
Phase 3 (Scanning)	50-80K	15-25K	Nuclei, nikto output
Phase 3.5 (Route)	50-80K	20-35K	Opus HIGH 24K thinking for test plan
Phase 4 (Testing)	900-1.2M	250-350K	31+ sub-agents across 12 waves, higher thinking budgets
Phase 5 (Verify)	100-150K	30-50K	Exploit verification, dual-verify
Phase 5b (Chain)	40-60K	15-25K	Attack chain correlation
Phase 6 (Report)	50-80K	30-50K	Report generation
Total	~1.8-2.5M	~450-600K

Phase 4 dominates at approximately 50-55% of total tokens. This is expected: the 31 sub-agents each load the skill boilerplate, context.json, and test-plan.json, with higher thinking budgets (16-24K HIGH, 8-14K MED) producing deeper reasoning at the cost of more tokens.

Per-Wave Token Breakdown¶

Within Phase 4, token consumption varies significantly by model tier and thinking budget:

Wave	Agents	Model	Thinking Budget	Est. Input	Est. Output
Wave 0	sqli, xss, idor	Opus HIGH	16-20K	120-160K	35-50K
Wave 1	cmdi, authz, jwt	Opus HIGH	16-20K	110-150K	30-42K
Wave 2	oauth, session, csrf	Opus HIGH/MED	10-20K	95-130K	25-38K
Wave 3	dom, ssrf-core, ssti	Opus HIGH	16-20K	110-150K	30-45K
Wave 4	ssrf-vector, rest, graphql	Opus MED	10-14K	80-110K	20-30K
Wave 5	logic, race, mfa	Opus HIGH/MED	10-20K	95-130K	25-35K
Wave 6	hpp-crlf, bypass, smuggling	Opus MED/HIGH	8-20K	75-105K	20-28K
Wave 7	cache, host-method, matrix	Opus MED/HIGH	8-16K	70-95K	18-25K
Wave 8	prototype, upload, deser	Opus MED	8-10K	60-80K	15-22K
Wave 9	cloud (3 scopes)	Opus MED	8-10K	55-75K	14-20K
Wave 10	crypto, exceptions, supply	Sonnet/Opus	6-10K	45-65K	12-18K
Wave 11	misc injection, misc client	Opus MED	8-10K	35-50K	10-14K

Thinking Budget Impact on Quality¶

Extended thinking is the single largest contributor to output quality in security testing. Lab evaluations demonstrate measurable quality differences:

Thinking Budget	SQLi Detection Rate	XSS Detection Rate	FP Rate
2,000 (Haiku default)	~40%	~35%	~25%
5,000 (Sonnet medium)	~55%	~50%	~18%
8,000 (Opus medium)	~70%	~65%	~12%
12,000 (Opus high)	~82%	~78%	~8%
20,000 (Opus high, 1M)	~87-88%	~84%	~6%

With 1M context windows, higher thinking budgets show continued returns up to ~20K. The 16,000-24,000 cap for HIGH-tier agents balances detection rate improvement against diminishing returns. Beyond 24,000, gains flatten significantly.

Cost Comparison: V3 vs Previous¶

Architecture	Tokens/Engagement	Relative Cost	Finding Rate
V1 (single Opus)	~800K input	1.0x	Baseline
V2 (2-tier, monolithic)	~1.0M input	1.3x	+15% findings
V3 Pragmatica	~1.3M input	2.0-2.5x	+35-40% findings

The 2-2.5x cost increase buys:

31 isolated context windows instead of 16 degrading ones
Focused thinking budgets matched to task complexity
Level 3 endpoint splitting for uniform coverage on large APIs
Dual-verify on Critical findings to eliminate false positives

Weekly Capacity Planning¶

At 12-14 hours per day over a 5-day evaluation week:

Metric	Value
Full scans per week	8-12 (3-4 per lab target)
Tokens per week (input)	~14-20M
Tokens per week (output)	~4-6M
Budget headroom	20x weekly baseline accommodates re-runs and debugging
Cost per engagement	$12-22 (at standard API pricing, higher thinking budgets)
Cost per week	$100-250 depending on lab complexity

Cache Optimization¶

The Claude API's prompt caching significantly reduces effective input costs:

Skill boilerplate (shared across all 31 agents): cached after first load
context.json (loaded by every agent): cached after Phase 0
test-plan.json (loaded by Phase 4 agents): cached after /route

Effective cache hit rates of 40-60% on input tokens reduce the practical cost to approximately 1.3-1.5x compared to V2, making the quality improvement essentially free at scale.

Budget Controls¶

Multiple mechanisms prevent runaway token consumption:

Control	Mechanism
Per-skill request limit	500 requests max, divided by concurrent agents
Per-agent timeout	45 minutes (60 for injection)
Kill switch at 400 requests	Warning logged, agent prepares to wrap up
Kill switch at 500 requests	Hard stop, partial results saved
3x 429 detection	Agent stops if target returns 3 rate-limit responses
Wave health check	Expired sessions caught before wasting next wave's budget
`--max-turns 250`	Claude CLI hard limit per agent session