Skip to content

Eval Settings

Complete parameter reference for --eval mode (Lab Max-Score Mode). Every setting that changes compared to a standard pentest engagement.


Activation

/pentest http://target.test:PORT --eval

The --eval flag:

  1. Sets LAB_MAX_SCORE=true environment variable
  2. Implies --fast (skips Phase 1 recon, disables stealth)
  3. Skips /intake prompt (labs have no business context)
  4. Skips authorization confirmation (labs are self-owned)
  5. Auto-loads credentials from lab-config.json
  6. Records EVAL_START_TS for duration tracking
  7. Auto-runs lab-scorer.py after report generation

Parameter Comparison

Budgets and Limits

Parameter Default Eval Mode Multiplier
Requests per skill 500 1,500 3x
Skill timeout 45 min 90 min 2x
test-injection timeout 60 min 120 min 2x
Max-turns per agent 250 500 2x
RRE escalations 3 6 2x
RRE queries per escalation 5 8 1.6x
RRE attempts per escalation 10 15 1.5x
RRE total queries 15 30 2x
RRE total attempts 30 60 2x

Wave Execution

Parameter Default Eval Mode Effect
Agents per wave 3 5 Wider waves, fewer total waves
Total waves ~12 ~7-8 Same 31 agents, compressed schedule
Thinking budget 1.0x 1.5x +50% on all agents
JITTER_MULT N_CONCURRENT (3) 1 No stealth on local labs
Stealth ON OFF No WAF, no rate limits

Coverage Gates

Parameter Default Eval Mode Effect
Coverage gate 80% 95% Must test 95% of discovered endpoints
Endpoint split threshold 20 10 Splits into parallel agents at 10 endpoints

Initialization Sequence

graph TD
    A["Parse --eval flag"] --> B["Set LAB_MAX_SCORE=true"]
    B --> C["Imply --fast<br/>(skip recon, disable stealth)"]
    C --> D["Record EVAL_START_TS"]
    D --> E["Detect lab from target URL"]
    E --> F["Load lab-config.json"]
    F --> G["Auto-create credentials.json"]
    G --> H["Skip authorization prompt"]
    H --> I["Begin Phase 0"]

    style A fill:#4a148c,color:#fff
    style B fill:#6a1b9a,color:#fff
    style C fill:#7b1fa2,color:#fff
    style D fill:#8e24aa,color:#fff
    style E fill:#9c27b0,color:#fff
    style F fill:#ab47bc,color:#fff
    style G fill:#0277bd,color:#fff
    style H fill:#00838f,color:#fff
    style I fill:#00695c,color:#fff

Credential Auto-Loading

The lab name is extracted from the target URL (e.g., http://vulnhr.test:7331/ becomes vulnhr). The corresponding evals/labs/<lab>/lab-config.json is loaded, and credentials_template is written to the engagement's credentials.json.

LAB_NAME=$(echo "$TARGET_URL" | sed 's|https\?://||;s|[:/].*||')
LAB_CONFIG="evals/labs/$LAB_NAME/lab-config.json"
# Extract credentials_template → write credentials.json

Budget Distribution

Per-Skill Request Budget

With 1,500 requests per skill and 5 agents per wave:

1,500 requests / 5 agents = 300 requests per agent

Each agent operates independently with its own budget slice, kill switch counters, and timeout.

Thinking Budget Scaling

The LAB_THINKING_MULTIPLIER (default 1.5) is applied in dispatch_agent():

if [ "$LAB_MAX_SCORE" = "true" ]; then
    thinking=$(( thinking * 1.5 ))
fi

This increases per-agent thinking tokens:

Model Tier Default Thinking Eval Thinking
Opus HIGH 16,000 24,000
Opus MEDIUM 10,000 15,000
Haiku 4,000 6,000

Runtime Research Escalation (RRE) in Eval Mode

RRE is enabled by default in eval mode (can be disabled with --no-research). Doubled budgets allow more aggressive technique discovery:

Parameter Default Eval Description
Escalations 3 6 Times RRE can trigger per skill
Queries/escalation 5 8 Web searches per trigger
Attempts/escalation 10 15 Technique attempts per trigger
Total queries 15 30 Cap across all escalations
Total attempts 30 60 Cap across all escalations

Trigger Condition

RRE triggers when confidence is 30-79% with strong signal but exhausted static techniques. Multi-source search: X/Twitter, security blogs, Reddit, YouTube, GitHub. All techniques pass through the 4-layer safety validator before execution.


Wave Schedule Comparison

Default Mode (3 agents/wave, ~12 waves)

Wave 0:  test-crypto, test-supply-chain, test-exceptions     (Tier 1)
Wave 1:  test-cloud                                            (Tier 2)
Wave 2:  test-injection --scope sqli, test-injection --scope xss, test-auth --scope jwt
Wave 3:  test-injection --scope ssti-xxe, test-injection --scope cmdi, test-auth --scope oauth
...
Wave 11: test-advanced --scope host-method, chain-findings, verify

Eval Mode (5 agents/wave, ~7-8 waves)

Same 31 sub-agents, compressed into fewer, wider waves:

Wave 0:  test-crypto, test-supply-chain, test-exceptions, test-cloud, test-injection --scope sqli
Wave 1:  test-injection --scope xss, test-injection --scope ssti-xxe, test-injection --scope cmdi,
         test-auth --scope jwt, test-auth --scope oauth
...
Wave 6-7: verify, chain-findings + remaining

Stealth Configuration

Eval mode disables all stealth mechanisms:

Feature Default Eval
Chrome User-Agent Yes No (tool defaults)
curl-impersonate TLS Yes No (standard curl)
Random jitter 1-4s None
Wordlist shuffle Yes No
Phase breaks 30s None
DNS-over-HTTPS Yes No
Cookie jar warming Yes No
Rate limiting 1-3 req/sec 50 req/sec
Concurrency 2 threads 25 threads
JITTER_MULT N_CONCURRENT 1

This is safe because local labs have no WAF, no rate limiting, and no detection systems.


Auto-Scoring

After the report phase, --eval automatically invokes the lab scorer:

EVAL_END_TS=$(date +%s)
EVAL_DURATION=$((EVAL_END_TS - EVAL_START_TS))

python evals/lab-scorer.py "$LAB_NAME" "$EDIR" \
    --save --html --narrative --duration "$EVAL_DURATION"

Output Files

File Location Content
Scoring result evals/labs/<lab>/history/<timestamp>.json TP/FN/FP counts, per-vuln matches, duration
HTML report evals/labs/<lab>/score_report.html Visual score report
Gap analysis evals/labs/<lab>/gap-analysis.md Per-vulnerability FN analysis by skill

Scoring Metadata

{
  "lab_id": "vulnhr",
  "start_ts": 1234567890,
  "end_ts": 1234580000,
  "duration_seconds": 12110,
  "tokens_input": 1200000,
  "tokens_output": 340000,
  "cache_read_input_tokens": 50000,
  "cache_creation_input_tokens": 30000,
  "cost_usd": 12.30,
  "num_turns": 245,
  "status": "completed"
}

Parallel Eval Suite

Use /labs-eval to benchmark across all labs simultaneously:

/labs-eval                          # All labs, parallel (one terminal tab each)
/labs-eval --only vulnhr,dvwa       # Subset
/labs-eval --sequential             # One at a time
/labs-eval --results                # Aggregate latest scores
/labs-eval --results 2026-03-13_14  # Aggregate specific run

Parallel Execution Model

  1. Generate wrapper scripts per lab in evals/runs/<timestamp>/
  2. Open terminal tabs (Windows Terminal / macOS Terminal / gnome-terminal)
  3. Each tab runs claude -p "/pentest <target> --eval" independently
  4. Token usage captured from --output-format stream-json
  5. lab-scorer.py runs automatically per lab
  6. --results aggregates all scores into a suite report

Suite Report

The aggregation produces:

  • Cross-lab score comparison table
  • Per-skill weakness analysis (which skills miss the most across labs)
  • Token consumption and cost per lab
  • Duration comparison
  • Comparison against previous runs (trend detection)

Environment Variables

All eval-mode settings are propagated via environment variables to sub-agents:

Variable Value Description
EVAL_MODE true Master eval flag
LAB_MAX_SCORE true Activates boosted budgets
MAX_REQUESTS 1500 3x request budget
SKILL_TIMEOUT 5400 90 min timeout
N_CONCURRENT_AGENTS 5 Agents per wave
JITTER_MULT 1 No stealth scaling
LAB_THINKING_MULTIPLIER 1.5 +50% thinking
COMPLETION_COVERAGE_GATE 0.95 95% coverage required
ENDPOINT_SPLIT_THRESHOLD 10 Lower split point
LAB_MAX_TURNS 500 2x max-turns
MAX_RESEARCH_ESCALATIONS 6 2x RRE budget
MAX_RESEARCH_QUERIES_PER_ESC 8 Queries per trigger
MAX_RESEARCH_ATTEMPTS_PER_ESC 15 Attempts per trigger
MAX_RESEARCH_QUERIES_TOTAL 30 Total query cap
MAX_RESEARCH_ATTEMPTS_TOTAL 60 Total attempt cap
EVAL_START_TS Unix timestamp For duration calculation