Token Budget Analysis¶
Budget Baseline¶
A full pentest engagement on a medium-complexity application (30-80 known vulnerabilities) consumes approximately 1.8-2.5M input tokens and 450-600K output tokens per run (updated for 1M context with higher thinking budgets). With the V3 Pragmatica wave architecture, a weekly evaluation cadence of 8-12 full scans across 3 lab targets requires budgeting for a maximum of 20x the weekly baseline.
Per-Engagement Estimates¶
| Phase | Input Tokens | Output Tokens | Notes |
|---|---|---|---|
| Phase 0 (Context) | 15-25K | 5-10K | Fingerprinting, tech detection |
| Phase 0.5 (Walkthrough) | 80-120K | 20-40K | Playwright crawl, all roles |
| Phase 1 (Recon) | 30-50K | 10-20K | Haiku agents, tool output parsing |
| Phase 2 (Discovery) | 100-150K | 30-50K | ffuf, jsluice, arjun, source maps |
| Phase 3 (Scanning) | 50-80K | 15-25K | Nuclei, nikto output |
| Phase 3.5 (Route) | 50-80K | 20-35K | Opus HIGH 24K thinking for test plan |
| Phase 4 (Testing) | 900-1.2M | 250-350K | 31+ sub-agents across 12 waves, higher thinking budgets |
| Phase 5 (Verify) | 100-150K | 30-50K | Exploit verification, dual-verify |
| Phase 5b (Chain) | 40-60K | 15-25K | Attack chain correlation |
| Phase 6 (Report) | 50-80K | 30-50K | Report generation |
| Total | ~1.8-2.5M | ~450-600K |
Phase 4 dominates at approximately 50-55% of total tokens. This is expected: the 31 sub-agents each load the skill boilerplate, context.json, and test-plan.json, with higher thinking budgets (16-24K HIGH, 8-14K MED) producing deeper reasoning at the cost of more tokens.
Per-Wave Token Breakdown¶
Within Phase 4, token consumption varies significantly by model tier and thinking budget:
| Wave | Agents | Model | Thinking Budget | Est. Input | Est. Output |
|---|---|---|---|---|---|
| Wave 0 | sqli, xss, idor | Opus HIGH | 16-20K | 120-160K | 35-50K |
| Wave 1 | cmdi, authz, jwt | Opus HIGH | 16-20K | 110-150K | 30-42K |
| Wave 2 | oauth, session, csrf | Opus HIGH/MED | 10-20K | 95-130K | 25-38K |
| Wave 3 | dom, ssrf-core, ssti | Opus HIGH | 16-20K | 110-150K | 30-45K |
| Wave 4 | ssrf-vector, rest, graphql | Opus MED | 10-14K | 80-110K | 20-30K |
| Wave 5 | logic, race, mfa | Opus HIGH/MED | 10-20K | 95-130K | 25-35K |
| Wave 6 | hpp-crlf, bypass, smuggling | Opus MED/HIGH | 8-20K | 75-105K | 20-28K |
| Wave 7 | cache, host-method, matrix | Opus MED/HIGH | 8-16K | 70-95K | 18-25K |
| Wave 8 | prototype, upload, deser | Opus MED | 8-10K | 60-80K | 15-22K |
| Wave 9 | cloud (3 scopes) | Opus MED | 8-10K | 55-75K | 14-20K |
| Wave 10 | crypto, exceptions, supply | Sonnet/Opus | 6-10K | 45-65K | 12-18K |
| Wave 11 | misc injection, misc client | Opus MED | 8-10K | 35-50K | 10-14K |
Thinking Budget Impact on Quality¶
Extended thinking is the single largest contributor to output quality in security testing. Lab evaluations demonstrate measurable quality differences:
| Thinking Budget | SQLi Detection Rate | XSS Detection Rate | FP Rate |
|---|---|---|---|
| 2,000 (Haiku default) | ~40% | ~35% | ~25% |
| 5,000 (Sonnet medium) | ~55% | ~50% | ~18% |
| 8,000 (Opus medium) | ~70% | ~65% | ~12% |
| 12,000 (Opus high) | ~82% | ~78% | ~8% |
| 20,000 (Opus high, 1M) | ~87-88% | ~84% | ~6% |
With 1M context windows, higher thinking budgets show continued returns up to ~20K. The 16,000-24,000 cap for HIGH-tier agents balances detection rate improvement against diminishing returns. Beyond 24,000, gains flatten significantly.
Cost Comparison: V3 vs Previous¶
| Architecture | Tokens/Engagement | Relative Cost | Finding Rate |
|---|---|---|---|
| V1 (single Opus) | ~800K input | 1.0x | Baseline |
| V2 (2-tier, monolithic) | ~1.0M input | 1.3x | +15% findings |
| V3 Pragmatica | ~1.3M input | 2.0-2.5x | +35-40% findings |
The 2-2.5x cost increase buys:
- 31 isolated context windows instead of 16 degrading ones
- Focused thinking budgets matched to task complexity
- Level 3 endpoint splitting for uniform coverage on large APIs
- Dual-verify on Critical findings to eliminate false positives
Weekly Capacity Planning¶
At 12-14 hours per day over a 5-day evaluation week:
| Metric | Value |
|---|---|
| Full scans per week | 8-12 (3-4 per lab target) |
| Tokens per week (input) | ~14-20M |
| Tokens per week (output) | ~4-6M |
| Budget headroom | 20x weekly baseline accommodates re-runs and debugging |
| Cost per engagement | $12-22 (at standard API pricing, higher thinking budgets) |
| Cost per week | $100-250 depending on lab complexity |
Cache Optimization¶
The Claude API's prompt caching significantly reduces effective input costs:
- Skill boilerplate (shared across all 31 agents): cached after first load
- context.json (loaded by every agent): cached after Phase 0
- test-plan.json (loaded by Phase 4 agents): cached after
/route
Effective cache hit rates of 40-60% on input tokens reduce the practical cost to approximately 1.3-1.5x compared to V2, making the quality improvement essentially free at scale.
Budget Controls¶
Multiple mechanisms prevent runaway token consumption:
| Control | Mechanism |
|---|---|
| Per-skill request limit | 500 requests max, divided by concurrent agents |
| Per-agent timeout | 45 minutes (60 for injection) |
| Kill switch at 400 requests | Warning logged, agent prepares to wrap up |
| Kill switch at 500 requests | Hard stop, partial results saved |
| 3x 429 detection | Agent stops if target returns 3 rate-limit responses |
| Wave health check | Expired sessions caught before wasting next wave's budget |
--max-turns 250 |
Claude CLI hard limit per agent session |