Eval Settings¶

Complete parameter reference for --eval mode (Lab Max-Score Mode). Every setting that changes compared to a standard pentest engagement.

Activation¶

/pentest http://target.test:PORT --eval

The --eval flag:

Sets LAB_MAX_SCORE=true environment variable
Implies --fast (skips Phase 1 recon, disables stealth)
Skips /intake prompt (labs have no business context)
Skips authorization confirmation (labs are self-owned)
Auto-loads credentials from lab-config.json
Records EVAL_START_TS for duration tracking
Auto-runs lab-scorer.py after report generation

Parameter Comparison¶

Budgets and Limits¶

Parameter	Default	Eval Mode	Multiplier
Requests per skill	500	1,500	3x
Skill timeout	45 min	90 min	2x
test-injection timeout	60 min	120 min	2x
Max-turns per agent	250	500	2x
RRE escalations	3	6	2x
RRE queries per escalation	5	8	1.6x
RRE attempts per escalation	10	15	1.5x
RRE total queries	15	30	2x
RRE total attempts	30	60	2x

Wave Execution¶

Parameter	Default	Eval Mode	Effect
Agents per wave	3	5	Wider waves, fewer total waves
Total waves	~12	~7-8	Same 31 agents, compressed schedule
Thinking budget	1.0x	1.5x	+50% on all agents
JITTER_MULT	N_CONCURRENT (3)	1	No stealth on local labs
Stealth	ON	OFF	No WAF, no rate limits

Coverage Gates¶

Parameter	Default	Eval Mode	Effect
Coverage gate	80%	95%	Must test 95% of discovered endpoints
Endpoint split threshold	20	10	Splits into parallel agents at 10 endpoints

Initialization Sequence¶

graph TD
    A["Parse --eval flag"] --> B["Set LAB_MAX_SCORE=true"]
    B --> C["Imply --fast<br/>(skip recon, disable stealth)"]
    C --> D["Record EVAL_START_TS"]
    D --> E["Detect lab from target URL"]
    E --> F["Load lab-config.json"]
    F --> G["Auto-create credentials.json"]
    G --> H["Skip authorization prompt"]
    H --> I["Begin Phase 0"]

    style A fill:#4a148c,color:#fff
    style B fill:#6a1b9a,color:#fff
    style C fill:#7b1fa2,color:#fff
    style D fill:#8e24aa,color:#fff
    style E fill:#9c27b0,color:#fff
    style F fill:#ab47bc,color:#fff
    style G fill:#0277bd,color:#fff
    style H fill:#00838f,color:#fff
    style I fill:#00695c,color:#fff

Credential Auto-Loading¶

The lab name is extracted from the target URL (e.g., http://vulnhr.test:7331/ becomes vulnhr). The corresponding evals/labs/<lab>/lab-config.json is loaded, and credentials_template is written to the engagement's credentials.json.

LAB_NAME=$(echo "$TARGET_URL" | sed 's|https\?://||;s|[:/].*||')
LAB_CONFIG="evals/labs/$LAB_NAME/lab-config.json"
# Extract credentials_template → write credentials.json

Budget Distribution¶

Per-Skill Request Budget¶

With 1,500 requests per skill and 5 agents per wave:

1,500 requests / 5 agents = 300 requests per agent

Each agent operates independently with its own budget slice, kill switch counters, and timeout.

Thinking Budget Scaling¶

The LAB_THINKING_MULTIPLIER (default 1.5) is applied in dispatch_agent():

if [ "$LAB_MAX_SCORE" = "true" ]; then
    thinking=$(( thinking * 1.5 ))
fi

This increases per-agent thinking tokens:

Model Tier	Default Thinking	Eval Thinking
Opus HIGH	16,000	24,000
Opus MEDIUM	10,000	15,000
Haiku	4,000	6,000

Runtime Research Escalation (RRE) in Eval Mode¶

RRE is enabled by default in eval mode (can be disabled with --no-research). Doubled budgets allow more aggressive technique discovery:

Parameter	Default	Eval	Description
Escalations	3	6	Times RRE can trigger per skill
Queries/escalation	5	8	Web searches per trigger
Attempts/escalation	10	15	Technique attempts per trigger
Total queries	15	30	Cap across all escalations
Total attempts	30	60	Cap across all escalations

Trigger Condition¶

RRE triggers when confidence is 30-79% with strong signal but exhausted static techniques. Multi-source search: X/Twitter, security blogs, Reddit, YouTube, GitHub. All techniques pass through the 4-layer safety validator before execution.

Wave Schedule Comparison¶

Default Mode (3 agents/wave, ~12 waves)¶

Wave 0:  test-crypto, test-supply-chain, test-exceptions     (Tier 1)
Wave 1:  test-cloud                                            (Tier 2)
Wave 2:  test-injection --scope sqli, test-injection --scope xss, test-auth --scope jwt
Wave 3:  test-injection --scope ssti-xxe, test-injection --scope cmdi, test-auth --scope oauth
...
Wave 11: test-advanced --scope host-method, chain-findings, verify

Eval Mode (5 agents/wave, ~7-8 waves)¶

Same 31 sub-agents, compressed into fewer, wider waves:

Wave 0:  test-crypto, test-supply-chain, test-exceptions, test-cloud, test-injection --scope sqli
Wave 1:  test-injection --scope xss, test-injection --scope ssti-xxe, test-injection --scope cmdi,
         test-auth --scope jwt, test-auth --scope oauth
...
Wave 6-7: verify, chain-findings + remaining

Stealth Configuration¶

Eval mode disables all stealth mechanisms:

Feature	Default	Eval
Chrome User-Agent	Yes	No (tool defaults)
curl-impersonate TLS	Yes	No (standard curl)
Random jitter	1-4s	None
Wordlist shuffle	Yes	No
Phase breaks	30s	None
DNS-over-HTTPS	Yes	No
Cookie jar warming	Yes	No
Rate limiting	1-3 req/sec	50 req/sec
Concurrency	2 threads	25 threads
JITTER_MULT	N_CONCURRENT	1

This is safe because local labs have no WAF, no rate limiting, and no detection systems.

Auto-Scoring¶

After the report phase, --eval automatically invokes the lab scorer:

EVAL_END_TS=$(date +%s)
EVAL_DURATION=$((EVAL_END_TS - EVAL_START_TS))

python evals/lab-scorer.py "$LAB_NAME" "$EDIR" \
    --save --html --narrative --duration "$EVAL_DURATION"

Output Files¶

File	Location	Content
Scoring result	`evals/labs/<lab>/history/<timestamp>.json`	TP/FN/FP counts, per-vuln matches, duration
HTML report	`evals/labs/<lab>/score_report.html`	Visual score report
Gap analysis	`evals/labs/<lab>/gap-analysis.md`	Per-vulnerability FN analysis by skill

Scoring Metadata¶

{
  "lab_id": "vulnhr",
  "start_ts": 1234567890,
  "end_ts": 1234580000,
  "duration_seconds": 12110,
  "tokens_input": 1200000,
  "tokens_output": 340000,
  "cache_read_input_tokens": 50000,
  "cache_creation_input_tokens": 30000,
  "cost_usd": 12.30,
  "num_turns": 245,
  "status": "completed"
}

Parallel Eval Suite¶

Use /labs-eval to benchmark across all labs simultaneously:

/labs-eval                          # All labs, parallel (one terminal tab each)
/labs-eval --only vulnhr,dvwa       # Subset
/labs-eval --sequential             # One at a time
/labs-eval --results                # Aggregate latest scores
/labs-eval --results 2026-03-13_14  # Aggregate specific run

Parallel Execution Model¶

Generate wrapper scripts per lab in evals/runs/<timestamp>/
Open terminal tabs (Windows Terminal / macOS Terminal / gnome-terminal)
Each tab runs claude -p "/pentest <target> --eval" independently
Token usage captured from --output-format stream-json
lab-scorer.py runs automatically per lab
--results aggregates all scores into a suite report

Suite Report¶

The aggregation produces:

Cross-lab score comparison table
Per-skill weakness analysis (which skills miss the most across labs)
Token consumption and cost per lab
Duration comparison
Comparison against previous runs (trend detection)

Environment Variables¶

All eval-mode settings are propagated via environment variables to sub-agents:

Variable	Value	Description
`EVAL_MODE`	`true`	Master eval flag
`LAB_MAX_SCORE`	`true`	Activates boosted budgets
`MAX_REQUESTS`	`1500`	3x request budget
`SKILL_TIMEOUT`	`5400`	90 min timeout
`N_CONCURRENT_AGENTS`	`5`	Agents per wave
`JITTER_MULT`	`1`	No stealth scaling
`LAB_THINKING_MULTIPLIER`	`1.5`	+50% thinking
`COMPLETION_COVERAGE_GATE`	`0.95`	95% coverage required
`ENDPOINT_SPLIT_THRESHOLD`	`10`	Lower split point
`LAB_MAX_TURNS`	`500`	2x max-turns
`MAX_RESEARCH_ESCALATIONS`	`6`	2x RRE budget
`MAX_RESEARCH_QUERIES_PER_ESC`	`8`	Queries per trigger
`MAX_RESEARCH_ATTEMPTS_PER_ESC`	`15`	Attempts per trigger
`MAX_RESEARCH_QUERIES_TOTAL`	`30`	Total query cap
`MAX_RESEARCH_ATTEMPTS_TOTAL`	`60`	Total attempt cap
`EVAL_START_TS`	Unix timestamp	For duration calculation