Skip to content

Token Budget Analysis

Budget Baseline

A full pentest engagement on a medium-complexity application (30-80 known vulnerabilities) consumes approximately 1.8-2.5M input tokens and 450-600K output tokens per run (updated for 1M context with higher thinking budgets). With the V3 Pragmatica wave architecture, a weekly evaluation cadence of 8-12 full scans across 3 lab targets requires budgeting for a maximum of 20x the weekly baseline.

Per-Engagement Estimates

Phase Input Tokens Output Tokens Notes
Phase 0 (Context) 15-25K 5-10K Fingerprinting, tech detection
Phase 0.5 (Walkthrough) 80-120K 20-40K Playwright crawl, all roles
Phase 1 (Recon) 30-50K 10-20K Haiku agents, tool output parsing
Phase 2 (Discovery) 100-150K 30-50K ffuf, jsluice, arjun, source maps
Phase 3 (Scanning) 50-80K 15-25K Nuclei, nikto output
Phase 3.5 (Route) 50-80K 20-35K Opus HIGH 24K thinking for test plan
Phase 4 (Testing) 900-1.2M 250-350K 31+ sub-agents across 12 waves, higher thinking budgets
Phase 5 (Verify) 100-150K 30-50K Exploit verification, dual-verify
Phase 5b (Chain) 40-60K 15-25K Attack chain correlation
Phase 6 (Report) 50-80K 30-50K Report generation
Total ~1.8-2.5M ~450-600K

Phase 4 dominates at approximately 50-55% of total tokens. This is expected: the 31 sub-agents each load the skill boilerplate, context.json, and test-plan.json, with higher thinking budgets (16-24K HIGH, 8-14K MED) producing deeper reasoning at the cost of more tokens.

Per-Wave Token Breakdown

Within Phase 4, token consumption varies significantly by model tier and thinking budget:

Wave Agents Model Thinking Budget Est. Input Est. Output
Wave 0 sqli, xss, idor Opus HIGH 16-20K 120-160K 35-50K
Wave 1 cmdi, authz, jwt Opus HIGH 16-20K 110-150K 30-42K
Wave 2 oauth, session, csrf Opus HIGH/MED 10-20K 95-130K 25-38K
Wave 3 dom, ssrf-core, ssti Opus HIGH 16-20K 110-150K 30-45K
Wave 4 ssrf-vector, rest, graphql Opus MED 10-14K 80-110K 20-30K
Wave 5 logic, race, mfa Opus HIGH/MED 10-20K 95-130K 25-35K
Wave 6 hpp-crlf, bypass, smuggling Opus MED/HIGH 8-20K 75-105K 20-28K
Wave 7 cache, host-method, matrix Opus MED/HIGH 8-16K 70-95K 18-25K
Wave 8 prototype, upload, deser Opus MED 8-10K 60-80K 15-22K
Wave 9 cloud (3 scopes) Opus MED 8-10K 55-75K 14-20K
Wave 10 crypto, exceptions, supply Sonnet/Opus 6-10K 45-65K 12-18K
Wave 11 misc injection, misc client Opus MED 8-10K 35-50K 10-14K

Thinking Budget Impact on Quality

Extended thinking is the single largest contributor to output quality in security testing. Lab evaluations demonstrate measurable quality differences:

Thinking Budget SQLi Detection Rate XSS Detection Rate FP Rate
2,000 (Haiku default) ~40% ~35% ~25%
5,000 (Sonnet medium) ~55% ~50% ~18%
8,000 (Opus medium) ~70% ~65% ~12%
12,000 (Opus high) ~82% ~78% ~8%
20,000 (Opus high, 1M) ~87-88% ~84% ~6%

With 1M context windows, higher thinking budgets show continued returns up to ~20K. The 16,000-24,000 cap for HIGH-tier agents balances detection rate improvement against diminishing returns. Beyond 24,000, gains flatten significantly.

Cost Comparison: V3 vs Previous

Architecture Tokens/Engagement Relative Cost Finding Rate
V1 (single Opus) ~800K input 1.0x Baseline
V2 (2-tier, monolithic) ~1.0M input 1.3x +15% findings
V3 Pragmatica ~1.3M input 2.0-2.5x +35-40% findings

The 2-2.5x cost increase buys:

  • 31 isolated context windows instead of 16 degrading ones
  • Focused thinking budgets matched to task complexity
  • Level 3 endpoint splitting for uniform coverage on large APIs
  • Dual-verify on Critical findings to eliminate false positives

Weekly Capacity Planning

At 12-14 hours per day over a 5-day evaluation week:

Metric Value
Full scans per week 8-12 (3-4 per lab target)
Tokens per week (input) ~14-20M
Tokens per week (output) ~4-6M
Budget headroom 20x weekly baseline accommodates re-runs and debugging
Cost per engagement $12-22 (at standard API pricing, higher thinking budgets)
Cost per week $100-250 depending on lab complexity

Cache Optimization

The Claude API's prompt caching significantly reduces effective input costs:

  • Skill boilerplate (shared across all 31 agents): cached after first load
  • context.json (loaded by every agent): cached after Phase 0
  • test-plan.json (loaded by Phase 4 agents): cached after /route

Effective cache hit rates of 40-60% on input tokens reduce the practical cost to approximately 1.3-1.5x compared to V2, making the quality improvement essentially free at scale.

Budget Controls

Multiple mechanisms prevent runaway token consumption:

Control Mechanism
Per-skill request limit 500 requests max, divided by concurrent agents
Per-agent timeout 45 minutes (60 for injection)
Kill switch at 400 requests Warning logged, agent prepares to wrap up
Kill switch at 500 requests Hard stop, partial results saved
3x 429 detection Agent stops if target returns 3 rate-limit responses
Wave health check Expired sessions caught before wasting next wave's budget
--max-turns 250 Claude CLI hard limit per agent session