Eval Settings¶
Complete parameter reference for --eval mode (Lab Max-Score Mode). Every setting that changes compared to a standard pentest engagement.
Activation¶
The --eval flag:
- Sets
LAB_MAX_SCORE=trueenvironment variable - Implies
--fast(skips Phase 1 recon, disables stealth) - Skips
/intakeprompt (labs have no business context) - Skips authorization confirmation (labs are self-owned)
- Auto-loads credentials from
lab-config.json - Records
EVAL_START_TSfor duration tracking - Auto-runs
lab-scorer.pyafter report generation
Parameter Comparison¶
Budgets and Limits¶
| Parameter | Default | Eval Mode | Multiplier |
|---|---|---|---|
| Requests per skill | 500 | 1,500 | 3x |
| Skill timeout | 45 min | 90 min | 2x |
| test-injection timeout | 60 min | 120 min | 2x |
| Max-turns per agent | 250 | 500 | 2x |
| RRE escalations | 3 | 6 | 2x |
| RRE queries per escalation | 5 | 8 | 1.6x |
| RRE attempts per escalation | 10 | 15 | 1.5x |
| RRE total queries | 15 | 30 | 2x |
| RRE total attempts | 30 | 60 | 2x |
Wave Execution¶
| Parameter | Default | Eval Mode | Effect |
|---|---|---|---|
| Agents per wave | 3 | 5 | Wider waves, fewer total waves |
| Total waves | ~12 | ~7-8 | Same 31 agents, compressed schedule |
| Thinking budget | 1.0x | 1.5x | +50% on all agents |
| JITTER_MULT | N_CONCURRENT (3) | 1 | No stealth on local labs |
| Stealth | ON | OFF | No WAF, no rate limits |
Coverage Gates¶
| Parameter | Default | Eval Mode | Effect |
|---|---|---|---|
| Coverage gate | 80% | 95% | Must test 95% of discovered endpoints |
| Endpoint split threshold | 20 | 10 | Splits into parallel agents at 10 endpoints |
Initialization Sequence¶
graph TD
A["Parse --eval flag"] --> B["Set LAB_MAX_SCORE=true"]
B --> C["Imply --fast<br/>(skip recon, disable stealth)"]
C --> D["Record EVAL_START_TS"]
D --> E["Detect lab from target URL"]
E --> F["Load lab-config.json"]
F --> G["Auto-create credentials.json"]
G --> H["Skip authorization prompt"]
H --> I["Begin Phase 0"]
style A fill:#4a148c,color:#fff
style B fill:#6a1b9a,color:#fff
style C fill:#7b1fa2,color:#fff
style D fill:#8e24aa,color:#fff
style E fill:#9c27b0,color:#fff
style F fill:#ab47bc,color:#fff
style G fill:#0277bd,color:#fff
style H fill:#00838f,color:#fff
style I fill:#00695c,color:#fff
Credential Auto-Loading¶
The lab name is extracted from the target URL (e.g., http://vulnhr.test:7331/ becomes vulnhr). The corresponding evals/labs/<lab>/lab-config.json is loaded, and credentials_template is written to the engagement's credentials.json.
LAB_NAME=$(echo "$TARGET_URL" | sed 's|https\?://||;s|[:/].*||')
LAB_CONFIG="evals/labs/$LAB_NAME/lab-config.json"
# Extract credentials_template → write credentials.json
Budget Distribution¶
Per-Skill Request Budget¶
With 1,500 requests per skill and 5 agents per wave:
Each agent operates independently with its own budget slice, kill switch counters, and timeout.
Thinking Budget Scaling¶
The LAB_THINKING_MULTIPLIER (default 1.5) is applied in dispatch_agent():
This increases per-agent thinking tokens:
| Model Tier | Default Thinking | Eval Thinking |
|---|---|---|
| Opus HIGH | 16,000 | 24,000 |
| Opus MEDIUM | 10,000 | 15,000 |
| Haiku | 4,000 | 6,000 |
Runtime Research Escalation (RRE) in Eval Mode¶
RRE is enabled by default in eval mode (can be disabled with --no-research). Doubled budgets allow more aggressive technique discovery:
| Parameter | Default | Eval | Description |
|---|---|---|---|
| Escalations | 3 | 6 | Times RRE can trigger per skill |
| Queries/escalation | 5 | 8 | Web searches per trigger |
| Attempts/escalation | 10 | 15 | Technique attempts per trigger |
| Total queries | 15 | 30 | Cap across all escalations |
| Total attempts | 30 | 60 | Cap across all escalations |
Trigger Condition¶
RRE triggers when confidence is 30-79% with strong signal but exhausted static techniques. Multi-source search: X/Twitter, security blogs, Reddit, YouTube, GitHub. All techniques pass through the 4-layer safety validator before execution.
Wave Schedule Comparison¶
Default Mode (3 agents/wave, ~12 waves)¶
Wave 0: test-crypto, test-supply-chain, test-exceptions (Tier 1)
Wave 1: test-cloud (Tier 2)
Wave 2: test-injection --scope sqli, test-injection --scope xss, test-auth --scope jwt
Wave 3: test-injection --scope ssti-xxe, test-injection --scope cmdi, test-auth --scope oauth
...
Wave 11: test-advanced --scope host-method, chain-findings, verify
Eval Mode (5 agents/wave, ~7-8 waves)¶
Same 31 sub-agents, compressed into fewer, wider waves:
Wave 0: test-crypto, test-supply-chain, test-exceptions, test-cloud, test-injection --scope sqli
Wave 1: test-injection --scope xss, test-injection --scope ssti-xxe, test-injection --scope cmdi,
test-auth --scope jwt, test-auth --scope oauth
...
Wave 6-7: verify, chain-findings + remaining
Stealth Configuration¶
Eval mode disables all stealth mechanisms:
| Feature | Default | Eval |
|---|---|---|
| Chrome User-Agent | Yes | No (tool defaults) |
| curl-impersonate TLS | Yes | No (standard curl) |
| Random jitter | 1-4s | None |
| Wordlist shuffle | Yes | No |
| Phase breaks | 30s | None |
| DNS-over-HTTPS | Yes | No |
| Cookie jar warming | Yes | No |
| Rate limiting | 1-3 req/sec | 50 req/sec |
| Concurrency | 2 threads | 25 threads |
| JITTER_MULT | N_CONCURRENT | 1 |
This is safe because local labs have no WAF, no rate limiting, and no detection systems.
Auto-Scoring¶
After the report phase, --eval automatically invokes the lab scorer:
EVAL_END_TS=$(date +%s)
EVAL_DURATION=$((EVAL_END_TS - EVAL_START_TS))
python evals/lab-scorer.py "$LAB_NAME" "$EDIR" \
--save --html --narrative --duration "$EVAL_DURATION"
Output Files¶
| File | Location | Content |
|---|---|---|
| Scoring result | evals/labs/<lab>/history/<timestamp>.json |
TP/FN/FP counts, per-vuln matches, duration |
| HTML report | evals/labs/<lab>/score_report.html |
Visual score report |
| Gap analysis | evals/labs/<lab>/gap-analysis.md |
Per-vulnerability FN analysis by skill |
Scoring Metadata¶
{
"lab_id": "vulnhr",
"start_ts": 1234567890,
"end_ts": 1234580000,
"duration_seconds": 12110,
"tokens_input": 1200000,
"tokens_output": 340000,
"cache_read_input_tokens": 50000,
"cache_creation_input_tokens": 30000,
"cost_usd": 12.30,
"num_turns": 245,
"status": "completed"
}
Parallel Eval Suite¶
Use /labs-eval to benchmark across all labs simultaneously:
/labs-eval # All labs, parallel (one terminal tab each)
/labs-eval --only vulnhr,dvwa # Subset
/labs-eval --sequential # One at a time
/labs-eval --results # Aggregate latest scores
/labs-eval --results 2026-03-13_14 # Aggregate specific run
Parallel Execution Model¶
- Generate wrapper scripts per lab in
evals/runs/<timestamp>/ - Open terminal tabs (Windows Terminal / macOS Terminal / gnome-terminal)
- Each tab runs
claude -p "/pentest <target> --eval"independently - Token usage captured from
--output-format stream-json lab-scorer.pyruns automatically per lab--resultsaggregates all scores into a suite report
Suite Report¶
The aggregation produces:
- Cross-lab score comparison table
- Per-skill weakness analysis (which skills miss the most across labs)
- Token consumption and cost per lab
- Duration comparison
- Comparison against previous runs (trend detection)
Environment Variables¶
All eval-mode settings are propagated via environment variables to sub-agents:
| Variable | Value | Description |
|---|---|---|
EVAL_MODE |
true |
Master eval flag |
LAB_MAX_SCORE |
true |
Activates boosted budgets |
MAX_REQUESTS |
1500 |
3x request budget |
SKILL_TIMEOUT |
5400 |
90 min timeout |
N_CONCURRENT_AGENTS |
5 |
Agents per wave |
JITTER_MULT |
1 |
No stealth scaling |
LAB_THINKING_MULTIPLIER |
1.5 |
+50% thinking |
COMPLETION_COVERAGE_GATE |
0.95 |
95% coverage required |
ENDPOINT_SPLIT_THRESHOLD |
10 |
Lower split point |
LAB_MAX_TURNS |
500 |
2x max-turns |
MAX_RESEARCH_ESCALATIONS |
6 |
2x RRE budget |
MAX_RESEARCH_QUERIES_PER_ESC |
8 |
Queries per trigger |
MAX_RESEARCH_ATTEMPTS_PER_ESC |
15 |
Attempts per trigger |
MAX_RESEARCH_QUERIES_TOTAL |
30 |
Total query cap |
MAX_RESEARCH_ATTEMPTS_TOTAL |
60 |
Total attempt cap |
EVAL_START_TS |
Unix timestamp | For duration calculation |