Skip to content

Lab Evaluation Framework

Purpose

The lab framework provides a controlled, repeatable environment for measuring the pentest suite's vulnerability detection capabilities. Each lab is a known-vulnerable application with a curated answer key, enabling objective scoring of true positives, false negatives, and false positives across engagements.

Lab Targets

Three lab targets are currently registered, covering different tech stacks and vulnerability profiles:

Lab Application Tech Stack Vulnerabilities Difficulty Distribution
VulnHR HR Portal (Meridian Solutions) Laravel/PHP, MySQL 81 ~30 easy, ~30 medium, ~20 hard
SuperSecureBank Banking Application .NET 8, SQL Server 37 Mixed
AltoroMutual Banking Application Spring Boot, React 29 Mixed

VulnHR

The largest and most comprehensive lab target. Covers 12 OWASP categories plus business logic and extra vulnerability classes. Features 6 user roles with distinct privilege levels and supports 3 authentication methods (form login, LDAP, Sanctum API).

Target:       http://vulnhr.test:7331/
Pentest flags: --fast
Roles:        admin, hr_manager, hr_specialist, manager, employee, observer
Auth methods: form-post, LDAP form, Sanctum form-api

SuperSecureBank

A .NET 8 banking application focused on financial security patterns. Tests cover transaction manipulation, authentication bypass, and .NET-specific vulnerabilities.

Target:       http://supersecurebank.test:45127/
Pentest flags: --fast

AltoroMutual

A Spring Boot + React banking application. The React SPA frontend exercises the suite's JavaScript analysis and SPA crawling capabilities. Uses TLS with a proxy configuration.

Target:       https://altoromutual.test:8443/
Pentest flags: --fast --proxy 127.0.0.1:9000

Answer Keys

Each lab has an answer-key.json file at evals/labs/<lab>/answer-key.json that defines every known vulnerability:

{
  "lab_name": "VulnHR - HR Portal",
  "total_vulnerabilities": 81,
  "categories": {
    "A01": {"name": "Broken Access Control", "count": 12},
    "A03": {"name": "Injection", "count": 13}
  },
  "vulnerabilities": [
    {
      "id": "VULN-001",
      "name": "SQL Injection in Employee Search",
      "category": "A03",
      "severity": "Critical",
      "difficulty": "Facile",
      "skill": "test-injection",
      "endpoint": "/api/v1/employees?search=",
      "cwe": "CWE-89",
      "payload_hint": "' OR 1=1--"
    }
  ]
}

Each vulnerability entry includes:

Field Purpose
id Unique identifier for matching
name Human-readable vulnerability name
category OWASP Top 10 category (A01-A10) or custom (BL, X)
severity Critical, High, Medium, Low
difficulty Facile, Media, Difficile (used for gap analysis)
skill Which /test-* skill should detect this
endpoint Affected endpoint path
cwe CWE identifier
payload_hint Hint for gap analysis (not used during testing)

Docker Compose Setup

Labs run as Docker Compose stacks. The /labs-up skill handles startup, health checking, and credential verification.

Starting Labs

/labs-up                    # Start all registered labs
/labs-up --only vulnhr      # Start a specific lab
/labs-up --rebuild          # Force rebuild images

The startup process:

  1. Read evals/labs/registry.json for lab definitions
  2. Run docker compose up -d in each lab's project directory
  3. Execute first-run setup commands (migrations, seeding)
  4. Poll health endpoints until responsive (configurable timeout)
  5. Verify ALL credentials against each auth method
  6. Report Docker performance metrics (CPU, memory, disk)

Registry

The evals/labs/registry.json file is the central configuration for all labs. It defines Docker configuration, URLs, authentication methods, credentials, healthcheck parameters, and container names for each registered lab. New labs are added via the /labs-add skill.

The registry supports multiple authentication methods per lab (auth_methods[]), each with its own type (json-api, form-post, form-api), login endpoint, field names, success indicators, and credential list. This handles applications like VulnHR that expose both web form login and API token endpoints.

Lab Management Skills

/labs-add

Registers a new lab from a GitHub repository URL. Auto-detects Docker configuration, credentials, authentication methods, and port assignments.

/labs-add https://github.com/org/vuln-app
/labs-add https://github.com/org/vuln-app my-custom-id

The skill:

  1. Clones the repository into the parent projects directory
  2. Analyzes docker-compose for ports, services, and profiles
  3. Checks for port conflicts with existing labs
  4. Searches for default credentials in READMEs, seeders, and env files
  5. Detects authentication type and login endpoint
  6. Presents a summary for user confirmation
  7. Updates registry.json and creates evals/labs/<id>/lab-config.json
  8. Updates the system hosts file with the lab hostname
  9. Validates no port conflicts across all registered labs

/labs-up

Starts registered labs, verifies health and credentials, checks Docker performance, and opens browser tabs. Supports --only for single-lab startup and --rebuild for image rebuilds.

/labs-eval

Launches parallel pentest evaluations across all (or selected) labs. Creates wrapper scripts for each lab, opens terminal tabs for parallel execution, and aggregates results into a suite report.

/labs-eval                          # All labs, parallel
/labs-eval --only vulnhr            # Single lab
/labs-eval --only vulnhr,altoro     # Subset
/labs-eval --sequential             # Run one at a time
/labs-eval --results                # Aggregate from latest run
/labs-eval --results 2026-03-13_14  # Aggregate from specific run

The parallel execution model:

  1. Generates wrapper scripts per lab in evals/runs/<timestamp>/
  2. Opens terminal tabs (Windows Terminal, macOS Terminal, or gnome-terminal)
  3. Each tab runs an independent claude -p "/pentest <target> --eval" session
  4. Token usage is captured from --output-format stream-json
  5. lab-scorer.py runs automatically after each pentest completes
  6. --results aggregates all lab scores into a suite report with cross-lab skill weakness analysis

The --eval Flag

When /pentest is invoked with --eval, it activates lab evaluation mode:

  • Skips brief.json prompts (labs have no business context)
  • Auto-creates credentials.json from lab config if available
  • Skips authorization confirmation prompts
  • Implies --fast (skips Phase 1 recon -- scope is pre-defined)
  • After the report phase, reminds the user to score with lab-scorer.py

Eval isolation is enforced

During --eval pentests, the same eval isolation rules apply: no consulting past results, no reading answer keys, no referencing gap-analysis.md. Each evaluation must discover vulnerabilities from scratch to produce valid measurements.

History Tracking

Every scored evaluation is saved to evals/labs/<lab>/history/<timestamp>.json. This enables:

  • Trend tracking: Comparing detection rates across code changes
  • Regression detection: Identifying skills that degraded after updates
  • Gap analysis: The --narrative flag on lab-scorer.py generates gap-analysis.md with per-vulnerability tables and skill coverage breakdowns

The /labs-eval --results aggregation reads from these history files to produce cross-lab suite reports with per-skill weakness identification and comparison against previous runs.