Lab Evaluation Framework¶

Purpose¶

The lab framework provides a controlled, repeatable environment for measuring the pentest suite's vulnerability detection capabilities. Each lab is a known-vulnerable application with a curated answer key, enabling objective scoring of true positives, false negatives, and false positives across engagements.

Lab Targets¶

Three lab targets are currently registered, covering different tech stacks and vulnerability profiles:

Lab	Application	Tech Stack	Vulnerabilities	Difficulty Distribution
VulnHR	HR Portal (Meridian Solutions)	Laravel/PHP, MySQL	81	~30 easy, ~30 medium, ~20 hard
SuperSecureBank	Banking Application	.NET 8, SQL Server	37	Mixed
AltoroMutual	Banking Application	Spring Boot, React	29	Mixed

VulnHR¶

The largest and most comprehensive lab target. Covers 12 OWASP categories plus business logic and extra vulnerability classes. Features 6 user roles with distinct privilege levels and supports 3 authentication methods (form login, LDAP, Sanctum API).

Target:       http://vulnhr.test:7331/
Pentest flags: --fast
Roles:        admin, hr_manager, hr_specialist, manager, employee, observer
Auth methods: form-post, LDAP form, Sanctum form-api

SuperSecureBank¶

A .NET 8 banking application focused on financial security patterns. Tests cover transaction manipulation, authentication bypass, and .NET-specific vulnerabilities.

Target:       http://supersecurebank.test:45127/
Pentest flags: --fast

AltoroMutual¶

A Spring Boot + React banking application. The React SPA frontend exercises the suite's JavaScript analysis and SPA crawling capabilities. Uses TLS with a proxy configuration.

Target:       https://altoromutual.test:8443/
Pentest flags: --fast --proxy 127.0.0.1:9000

Answer Keys¶

Each lab has an answer-key.json file at evals/labs/<lab>/answer-key.json that defines every known vulnerability:

{
  "lab_name": "VulnHR - HR Portal",
  "total_vulnerabilities": 81,
  "categories": {
    "A01": {"name": "Broken Access Control", "count": 12},
    "A03": {"name": "Injection", "count": 13}
  },
  "vulnerabilities": [
    {
      "id": "VULN-001",
      "name": "SQL Injection in Employee Search",
      "category": "A03",
      "severity": "Critical",
      "difficulty": "Facile",
      "skill": "test-injection",
      "endpoint": "/api/v1/employees?search=",
      "cwe": "CWE-89",
      "payload_hint": "' OR 1=1--"
    }
  ]
}

Each vulnerability entry includes:

Field	Purpose
`id`	Unique identifier for matching
`name`	Human-readable vulnerability name
`category`	OWASP Top 10 category (A01-A10) or custom (BL, X)
`severity`	Critical, High, Medium, Low
`difficulty`	Facile, Media, Difficile (used for gap analysis)
`skill`	Which `/test-*` skill should detect this
`endpoint`	Affected endpoint path
`cwe`	CWE identifier
`payload_hint`	Hint for gap analysis (not used during testing)

Docker Compose Setup¶

Labs run as Docker Compose stacks. The /labs-up skill handles startup, health checking, and credential verification.

Starting Labs¶

/labs-up                    # Start all registered labs
/labs-up --only vulnhr      # Start a specific lab
/labs-up --rebuild          # Force rebuild images

The startup process:

Read evals/labs/registry.json for lab definitions
Run docker compose up -d in each lab's project directory
Execute first-run setup commands (migrations, seeding)
Poll health endpoints until responsive (configurable timeout)
Verify ALL credentials against each auth method
Report Docker performance metrics (CPU, memory, disk)

Registry¶

The evals/labs/registry.json file is the central configuration for all labs. It defines Docker configuration, URLs, authentication methods, credentials, healthcheck parameters, and container names for each registered lab. New labs are added via the /labs-add skill.

The registry supports multiple authentication methods per lab (auth_methods[]), each with its own type (json-api, form-post, form-api), login endpoint, field names, success indicators, and credential list. This handles applications like VulnHR that expose both web form login and API token endpoints.

Lab Management Skills¶

/labs-add¶

Registers a new lab from a GitHub repository URL. Auto-detects Docker configuration, credentials, authentication methods, and port assignments.

/labs-add https://github.com/org/vuln-app
/labs-add https://github.com/org/vuln-app my-custom-id

The skill:

Clones the repository into the parent projects directory
Analyzes docker-compose for ports, services, and profiles
Checks for port conflicts with existing labs
Searches for default credentials in READMEs, seeders, and env files
Detects authentication type and login endpoint
Presents a summary for user confirmation
Updates registry.json and creates evals/labs/<id>/lab-config.json
Updates the system hosts file with the lab hostname
Validates no port conflicts across all registered labs

/labs-up¶

Starts registered labs, verifies health and credentials, checks Docker performance, and opens browser tabs. Supports --only for single-lab startup and --rebuild for image rebuilds.

/labs-eval¶

Launches parallel pentest evaluations across all (or selected) labs. Creates wrapper scripts for each lab, opens terminal tabs for parallel execution, and aggregates results into a suite report.

/labs-eval                          # All labs, parallel
/labs-eval --only vulnhr            # Single lab
/labs-eval --only vulnhr,altoro     # Subset
/labs-eval --sequential             # Run one at a time
/labs-eval --results                # Aggregate from latest run
/labs-eval --results 2026-03-13_14  # Aggregate from specific run

The parallel execution model:

Generates wrapper scripts per lab in evals/runs/<timestamp>/
Opens terminal tabs (Windows Terminal, macOS Terminal, or gnome-terminal)
Each tab runs an independent claude -p "/pentest <target> --eval" session
Token usage is captured from --output-format stream-json
lab-scorer.py runs automatically after each pentest completes
--results aggregates all lab scores into a suite report with cross-lab skill weakness analysis

The --eval Flag¶

When /pentest is invoked with --eval, it activates lab evaluation mode:

Skips brief.json prompts (labs have no business context)
Auto-creates credentials.json from lab config if available
Skips authorization confirmation prompts
Implies --fast (skips Phase 1 recon -- scope is pre-defined)
After the report phase, reminds the user to score with lab-scorer.py

Eval isolation is enforced

During --eval pentests, the same eval isolation rules apply: no consulting past results, no reading answer keys, no referencing gap-analysis.md. Each evaluation must discover vulnerabilities from scratch to produce valid measurements.

History Tracking¶

Every scored evaluation is saved to evals/labs/<lab>/history/<timestamp>.json. This enables:

Trend tracking: Comparing detection rates across code changes
Regression detection: Identifying skills that degraded after updates
Gap analysis: The --narrative flag on lab-scorer.py generates gap-analysis.md with per-vulnerability tables and skill coverage breakdowns

The /labs-eval --results aggregation reads from these history files to produce cross-lab suite reports with per-skill weakness identification and comparison against previous runs.