Lab Evaluation Framework¶
Purpose¶
The lab framework provides a controlled, repeatable environment for measuring the pentest suite's vulnerability detection capabilities. Each lab is a known-vulnerable application with a curated answer key, enabling objective scoring of true positives, false negatives, and false positives across engagements.
Lab Targets¶
Three lab targets are currently registered, covering different tech stacks and vulnerability profiles:
| Lab | Application | Tech Stack | Vulnerabilities | Difficulty Distribution |
|---|---|---|---|---|
| VulnHR | HR Portal (Meridian Solutions) | Laravel/PHP, MySQL | 81 | ~30 easy, ~30 medium, ~20 hard |
| SuperSecureBank | Banking Application | .NET 8, SQL Server | 37 | Mixed |
| AltoroMutual | Banking Application | Spring Boot, React | 29 | Mixed |
VulnHR¶
The largest and most comprehensive lab target. Covers 12 OWASP categories plus business logic and extra vulnerability classes. Features 6 user roles with distinct privilege levels and supports 3 authentication methods (form login, LDAP, Sanctum API).
Target: http://vulnhr.test:7331/
Pentest flags: --fast
Roles: admin, hr_manager, hr_specialist, manager, employee, observer
Auth methods: form-post, LDAP form, Sanctum form-api
SuperSecureBank¶
A .NET 8 banking application focused on financial security patterns. Tests cover transaction manipulation, authentication bypass, and .NET-specific vulnerabilities.
AltoroMutual¶
A Spring Boot + React banking application. The React SPA frontend exercises the suite's JavaScript analysis and SPA crawling capabilities. Uses TLS with a proxy configuration.
Answer Keys¶
Each lab has an answer-key.json file at evals/labs/<lab>/answer-key.json that defines every known vulnerability:
{
"lab_name": "VulnHR - HR Portal",
"total_vulnerabilities": 81,
"categories": {
"A01": {"name": "Broken Access Control", "count": 12},
"A03": {"name": "Injection", "count": 13}
},
"vulnerabilities": [
{
"id": "VULN-001",
"name": "SQL Injection in Employee Search",
"category": "A03",
"severity": "Critical",
"difficulty": "Facile",
"skill": "test-injection",
"endpoint": "/api/v1/employees?search=",
"cwe": "CWE-89",
"payload_hint": "' OR 1=1--"
}
]
}
Each vulnerability entry includes:
| Field | Purpose |
|---|---|
id |
Unique identifier for matching |
name |
Human-readable vulnerability name |
category |
OWASP Top 10 category (A01-A10) or custom (BL, X) |
severity |
Critical, High, Medium, Low |
difficulty |
Facile, Media, Difficile (used for gap analysis) |
skill |
Which /test-* skill should detect this |
endpoint |
Affected endpoint path |
cwe |
CWE identifier |
payload_hint |
Hint for gap analysis (not used during testing) |
Docker Compose Setup¶
Labs run as Docker Compose stacks. The /labs-up skill handles startup, health checking, and credential verification.
Starting Labs¶
/labs-up # Start all registered labs
/labs-up --only vulnhr # Start a specific lab
/labs-up --rebuild # Force rebuild images
The startup process:
- Read
evals/labs/registry.jsonfor lab definitions - Run
docker compose up -din each lab's project directory - Execute first-run setup commands (migrations, seeding)
- Poll health endpoints until responsive (configurable timeout)
- Verify ALL credentials against each auth method
- Report Docker performance metrics (CPU, memory, disk)
Registry¶
The evals/labs/registry.json file is the central configuration for all labs. It defines Docker configuration, URLs, authentication methods, credentials, healthcheck parameters, and container names for each registered lab. New labs are added via the /labs-add skill.
The registry supports multiple authentication methods per lab (auth_methods[]), each with its own type (json-api, form-post, form-api), login endpoint, field names, success indicators, and credential list. This handles applications like VulnHR that expose both web form login and API token endpoints.
Lab Management Skills¶
/labs-add¶
Registers a new lab from a GitHub repository URL. Auto-detects Docker configuration, credentials, authentication methods, and port assignments.
The skill:
- Clones the repository into the parent projects directory
- Analyzes docker-compose for ports, services, and profiles
- Checks for port conflicts with existing labs
- Searches for default credentials in READMEs, seeders, and env files
- Detects authentication type and login endpoint
- Presents a summary for user confirmation
- Updates
registry.jsonand createsevals/labs/<id>/lab-config.json - Updates the system hosts file with the lab hostname
- Validates no port conflicts across all registered labs
/labs-up¶
Starts registered labs, verifies health and credentials, checks Docker performance, and opens browser tabs. Supports --only for single-lab startup and --rebuild for image rebuilds.
/labs-eval¶
Launches parallel pentest evaluations across all (or selected) labs. Creates wrapper scripts for each lab, opens terminal tabs for parallel execution, and aggregates results into a suite report.
/labs-eval # All labs, parallel
/labs-eval --only vulnhr # Single lab
/labs-eval --only vulnhr,altoro # Subset
/labs-eval --sequential # Run one at a time
/labs-eval --results # Aggregate from latest run
/labs-eval --results 2026-03-13_14 # Aggregate from specific run
The parallel execution model:
- Generates wrapper scripts per lab in
evals/runs/<timestamp>/ - Opens terminal tabs (Windows Terminal, macOS Terminal, or gnome-terminal)
- Each tab runs an independent
claude -p "/pentest <target> --eval"session - Token usage is captured from
--output-format stream-json lab-scorer.pyruns automatically after each pentest completes--resultsaggregates all lab scores into a suite report with cross-lab skill weakness analysis
The --eval Flag¶
When /pentest is invoked with --eval, it activates lab evaluation mode:
- Skips
brief.jsonprompts (labs have no business context) - Auto-creates
credentials.jsonfrom lab config if available - Skips authorization confirmation prompts
- Implies
--fast(skips Phase 1 recon -- scope is pre-defined) - After the report phase, reminds the user to score with
lab-scorer.py
Eval isolation is enforced
During --eval pentests, the same eval isolation rules apply: no consulting past results, no reading answer keys, no referencing gap-analysis.md. Each evaluation must discover vulnerabilities from scratch to produce valid measurements.
History Tracking¶
Every scored evaluation is saved to evals/labs/<lab>/history/<timestamp>.json. This enables:
- Trend tracking: Comparing detection rates across code changes
- Regression detection: Identifying skills that degraded after updates
- Gap analysis: The
--narrativeflag onlab-scorer.pygeneratesgap-analysis.mdwith per-vulnerability tables and skill coverage breakdowns
The /labs-eval --results aggregation reads from these history files to produce cross-lab suite reports with per-skill weakness identification and comparison against previous runs.