Lab Catalog¶

10 registered lab targets covering different tech stacks, vulnerability profiles, and scoring modes.

Summary¶

Lab	Tech Stack	Vulns	Auth	Scoring	Difficulty
VulnHR	Laravel/PHP, MySQL, LDAP	81	Sanctum + LDAP + Form	Answer key	~30 easy / ~30 medium / ~20 hard
Juice Shop	Express.js, Angular, SQLite	55	JWT REST API	Answer key	Mixed (6 difficulty tiers)
SuperSecureBank	.NET 8, SQL Server	37	MVC Form Login	Answer key	Mixed
AltoroMutual	Spring Boot, React, PostgreSQL	29	JWT API	Answer key	Mixed
DVWA	PHP, MariaDB	28	PHP Session Form	Answer key	4 security levels
DVRA	FastAPI, MongoDB	12	OAuth2 Token	Answer key	Mixed
XBOW	Mixed (104 Docker containers)	104	None	Flag-based (`FLAG{...}`)	L1: 45, L2: 51, L3: 8
HackBench	Mixed (16 real-world CVE stacks)	16	Per-challenge	Flag-based (`ev{...}`) + XSS alert	9 easy, 2 medium, 5 hard
Gandalf (Lakera)	LLM Prompt Injection	8	None	Level completion	Progressive (8 levels)
HackMerlin	LLM Prompt Injection	7	None	Level completion	Progressive (7 levels)
PortSwigger Academy	Web Security (all categories)	250+	None	Lab completion	Apprentice / Practitioner / Expert
VibeApps (Neo)	3 AI-generated apps (see below)	74	JWT / NextAuth / Custom Session	Answer key + Neo baseline	Mixed (8C, 13H, 16M, 25L, 12I)

Total: 676+ distinct vulnerabilities/challenges across 14 targets.

VulnHR¶

The largest and most comprehensive lab target. HR Portal for a fictional company (Meridian Solutions). Covers 12 OWASP categories plus business logic and extra vulnerability classes.

Property	Value
Target	`http://vulnhr.test:7331/`
Tech Stack	Laravel/PHP, MySQL, Redis, Nginx, OpenLDAP
Vulnerabilities	81
Roles	6 (admin, hr_manager, hr_specialist, manager, employee, observer)
Auth Methods	REST API (Sanctum), LDAP form, Web form
Containers	hrportal-nginx, hrportal-php, hrportal-postgres, hrportal-redis, hrportal-ldap
Pentest Flags	`--fast`

OWASP Category Breakdown¶

Category	Count
A01 - Broken Access Control	12
A02 - Cryptographic Failures	3
A03 - Injection	13
A04 - Insecure Design	5
A05 - Security Misconfiguration	9
A06 - Vulnerable Components	2
A07 - Auth Failures	5
A08 - Data Integrity	3
A09 - Logging Failures	2
A10 - SSRF	2
X - Extra	13
BL - Business Logic	12

Difficulty Distribution¶

Easy (~30 vulns): DAST-detectable, standard scanner coverage
Medium (~30 vulns): Needs specific configuration or partial manual analysis
Hard (~20 vulns): Requires deep manual analysis or exploit chains

Juice Shop¶

OWASP's flagship vulnerable web application. Express.js + Angular SPA with 111 built-in challenges; 55 are testable via automated pentest (excludes coding challenges, tutorial-only, and UI-puzzle challenges).

Property	Value
Target	`http://juiceshop.test:3000/`
Tech Stack	Express.js, Angular, SQLite
Vulnerabilities	55 (of 111 total challenges)
Roles	4 (admin, customer, demo, accountant)
Auth Method	REST API (JWT)
Container	juice-shop
Pentest Flags	`--fast`

OWASP Category Breakdown¶

Category	Count
A01 - Broken Access Control	12
A02 - Cryptographic Failures	4
A03 - Injection	12
A04 - Insecure Design	4
A05 - Security Misconfiguration	6
A06 - Vulnerable Components	3
A07 - Auth Failures	5
A08 - Data Integrity Failures	3
A09 - Logging Failures	2
A10 - SSRF	2
XSS - Cross-Site Scripting	2

Challenge status API

Use /api/v1/challenges to check solve status for all 111 challenges.

SuperSecureBank¶

A .NET 8 banking application focused on financial security patterns. Tests cover transaction manipulation, authentication bypass, and .NET-specific vulnerabilities.

Property	Value
Target	`http://supersecurebank.test:45127/`
Tech Stack	.NET 8, SQL Server
Vulnerabilities	37
Roles	5 (admin + 4 users)
Auth Method	MVC Form Login
Containers	supersecure-db, supersecure-be, supersecure-fe
Pentest Flags	`--fast`

AltoroMutual¶

A Spring Boot + React banking application. The React SPA frontend exercises the suite's JavaScript analysis and SPA crawling capabilities. Uses TLS with a proxy configuration.

Property	Value
Target	`https://altoromutual.test:8443/`
Tech Stack	Spring Boot, React, PostgreSQL
Vulnerabilities	29
Roles	5 (admin + 4 customers)
Auth Method	JWT API
Containers	altoro-postgres, altoro-app
Pentest Flags	`--fast --proxy 127.0.0.1:9000`

TLS configuration

AltoroMutual uses HTTPS with a self-signed certificate. The --proxy flag routes traffic through a local proxy that handles TLS verification.

DVWA¶

Classic Damn Vulnerable Web Application. PHP + MariaDB with 4 configurable security levels (low, medium, high, impossible). Eval runs on low level.

Property	Value
Target	`http://dvwa.test:4280/`
Tech Stack	PHP, MariaDB
Vulnerabilities	28
Roles	5 (admin + 4 users)
Auth Method	PHP Session Form
Containers	dvwa-dvwa-1, dvwa-db-1
Pentest Flags	`--fast`
Security Level	`low` (configurable via `DEFAULT_SECURITY_LEVEL`)

OWASP Category Breakdown¶

Category	Count
A01 - Broken Access Control	3
A02 - Cryptographic Failures	2
A03 - Injection	10
A04 - Insecure Design	2
A05 - Security Misconfiguration	4
A07 - Auth Failures	3
A08 - Data Integrity Failures	2
A09 - Logging Failures	1
X - Client-Side	1

First-run setup required

DVWA requires /setup.php to create the database before first use. The registry setup_commands handle this automatically via /labs-up.

DVRA¶

Damn Vulnerable RESTaurant API Game. A pure REST API target (no web UI) built with FastAPI + MongoDB. Tests API-specific vulnerabilities including authentication, injection, and access control.

Property	Value
Target	`http://dvra.test:8091/`
Tech Stack	FastAPI (Python), MongoDB
Vulnerabilities	12
Roles	2 types (3 customers + 2 employees)
Auth Method	OAuth2 Token (FastAPI)
Containers	web, db
Pentest Flags	`--fast`

OWASP Category Breakdown¶

Category	Count
A01 - Broken Access Control	3
A02 - Cryptographic Failures	1
A03 - Injection	1
A04 - Insecure Design	2
A05 - Security Misconfiguration	2
A07 - Auth Failures	2
A10 - SSRF	1

XBOW¶

XBOW Validation Benchmarks: 104 independent CTF challenges, each running as a separate Docker Compose stack. Flag-based scoring (FLAG{...}) instead of answer-key matching.

Property	Value
Type	CTF collection (104 challenges)
Port Range	`10001`-`10104` (one per benchmark)
Flag Format	`FLAG{...}` (SHA256-based, generated by `common.mk`)
Auth	None (per-challenge)
Pentest Flags	`--fast`

Difficulty Distribution¶

Level	Count	Description
Level 1	45	Entry-level
Level 2	51	Intermediate
Level 3	8	Advanced

Running XBOW benchmarks¶

XBOW uses dedicated tooling instead of the standard /labs-up + /pentest --eval flow:

# Single benchmark
/pentest-xbow benchmark-name

# Batch by level
/pentest-xbow --level 1

# Batch by tag
/pentest-xbow --tag sqli

# All benchmarks
/pentest-xbow --all

Each benchmark is launched individually via xbow-launcher.py, pentested with /pentest, and scored by xbow-scorer.py using flag comparison.

Lab Management¶

Registry¶

All labs are defined in evals/labs/registry.json. Each entry specifies:

Docker configuration (profile, build, setup commands)
URLs and health check parameters
Authentication methods and credentials
Container names for monitoring
Path to lab-config.json for eval scoring

Adding a New Lab¶

/labs-add https://github.com/org/vuln-app
/labs-add https://github.com/org/vuln-app custom-id

The /labs-add skill auto-detects Docker config, credentials, auth methods, port assignments, and updates the registry + hosts file.

Starting Labs¶

/labs-up                    # Start all
/labs-up --only vulnhr      # Single lab
/labs-up --rebuild          # Force rebuild

Startup process: read registry, docker compose up -d, run setup commands, poll health endpoints, verify ALL credentials against each auth method, report Docker performance.

HackBench¶

Real-world CVE-based CTF challenges from ElectrovoltSec. 16 independent web exploitation challenges, each running as a separate Docker Compose stack. Tests exploit discovery across SQLi, XSS, auth bypass, IDOR, RCE, and n-day vulnerability identification.

Property	Value
Type	CTF collection (16 challenges)
Port Range	`10201`-`10216` (one per challenge)
Flag Format	`ev{hex}` (runtime-generated unique flags) + `alert()` for XSS challenges
Auth	Per-challenge (varies: form login, API, onboarding wizard)
Source	ElectrovoltSec/HackBench
Launcher	`evals/labs/hackbench/hackbench-launcher.py`
Scorer	`evals/labs/hackbench/hackbench-scorer.py`

Challenge Overview¶

Difficulty	Count	Points Each	Total Points
Easy	9	100	900
Medium	2	300	600
Hard	5	500	2500
Total	16		4000

Exploit Categories Covered¶

Category	Challenges	Skills Tested
SQL Injection (UNION, Blind)	2	test-injection
NoSQL Injection	1	test-injection
Stored / DOM XSS	4	test-injection
JWT / Auth Bypass	3	test-auth
IDOR / BOLA	2	test-access
Command Injection / RCE	2	test-injection
Known CVE / N-day	2	CVE search + test-injection

Running HackBench¶

# Single challenge
/pentest-hackbench EV-01

# By difficulty
/pentest-hackbench --difficulty easy

# All challenges
/pentest-hackbench --all

Each challenge is launched via hackbench-launcher.py, pentested with /pentest --eval, and scored by hackbench-scorer.py using flag comparison (string match for flag challenges, alert() detection for XSS challenges).

Infrastructure notes

Some challenges require onboarding/setup via browser before API testing is possible
XSS challenges (EV-09, EV-10, EV-11) use alert(document.domain) as win condition, not string flags
Runtime flags are stored in runtime-flags.json — scorer reads this for validation
Docker Compose port merging requires patched + override file pattern (handled by launcher)

HackMerlin¶

HackMerlin: 7-level progressive LLM prompt injection challenge. Each level adds stronger defenses (input filter, output filter, LLM-as-judge, active deception). Hosted target at hackmerlin.io (no Docker required).

Property	Value
Type	LLM prompt injection (7 levels)
Target	`https://hackmerlin.io/`
Docker	Not required (hosted)
Scoring	Level completion (password extraction)
Skills Tested	test-llm, prompt engineering
Best Score	7/7 (100%) — 2026-03-20

Defense Layers¶

Level	Defenses
L1-L3	None → persona → basic output filter
L4-L5	Input filter + output filter
L6	Complex output filter (reversed + case-insensitive)
L7	Input filter + output filter + LLM-as-judge + active deception layer

Running HackMerlin¶

/pentest-hackmerlin                # All 7 levels
/pentest-hackmerlin --level 7      # Start from level 7

Gandalf¶

Gandalf by Lakera: 8-level progressive LLM prompt injection challenge. Each level adds stronger input/output filters. Hosted target (no Docker needed). Tests prompt injection, system prompt extraction, and filter bypass techniques.

Property	Value
Type	LLM prompt injection (8 levels)
Target	External hosted (Lakera)
Docker	Not required
Scoring	Level completion (password extraction)
Skills Tested	test-llm, prompt engineering

Running Gandalf¶

/pentest-gandalf                # All 8 levels
/pentest-gandalf --level 5      # Start from level 5

PortSwigger Academy¶

PortSwigger Web Security Academy: 250+ labs across all web security categories. Playwright-driven lab launcher creates ephemeral instances on PortSwigger's infrastructure. Each lab has a specific vulnerability to exploit and a "solved" status.

Property	Value
Type	Web security labs (250+ labs)
Target	External hosted (PortSwigger)
Docker	Not required
Scoring	Lab completion (auto-detected by PortSwigger)
Categories	SQLi, XSS, CSRF, CORS, clickjacking, DOM-based, SSRF, XXE, OS command injection, directory traversal, access control, auth, business logic, HTTP request smuggling, WebSockets, deserialization, information disclosure, race conditions, prototype pollution, GraphQL, JWT, OAuth, SSTI, web cache poisoning

Difficulty Tiers¶

Tier	Description
Apprentice	Guided, single-step exploits
Practitioner	Multi-step, real-world scenarios
Expert	Advanced, chained exploits

Running PortSwigger Labs¶

/pentest-portswigger                           # All labs
/pentest-portswigger --category sql-injection   # Single category
/pentest-portswigger --difficulty practitioner   # By difficulty
/pentest-portswigger --batch 10                 # First 10

VibeApps (Neo Benchmark)¶

Three AI-generated web applications from ProjectDiscovery's Vibe-Coding research. Benchmark for comparing AI security scanners against Neo (ProjectDiscovery's AI scanner). 74 confirmed vulnerabilities across 3 apps, each built with a different AI coding tool and tech stack.

App	Domain	Stack	Built with	LOC	Vulns	Port
VaultBank	Banking	React 18, FastAPI, SQLAlchemy, JWT, PostgreSQL	Claude Code (Sonnet 4.6)	10,470	30	8101
MedPortal	Healthcare	Next.js 14, Prisma, PostgreSQL, NextAuth.js	Codex (gpt-5-codex)	4,528	20	8102
ClaimFlow	Insurance	SvelteKit, Drizzle ORM, SQLite, Custom Auth	Cursor	12,368	24	8103

Roles per App¶

App	Roles (5 each)
VaultBank	Admin, Branch Manager, Compliance Officer, Teller, Customer
MedPortal	Admin, Doctor, Nurse, Lab Technician, Patient
ClaimFlow	Admin, Underwriter, Adjuster, Agent/Broker, Policyholder

Vulnerability Distribution¶

Severity	VaultBank	MedPortal	ClaimFlow	Total
Critical	6	0	2	8
High	3	6	4	13
Medium	6	1	9	16
Low	13	7	5	25
Info	2	6	4	12
Total	30	20	24	74

Neo Baseline (ProjectDiscovery)¶

Metric	Neo	Claude (PD)	Snyk	Invicti
True Positives	66/74	41/74	0/74	10/74
False Positives	5	24	5	10
Precision	93%	63%	0%	50%
Critical+High	21/21	13/21	0/21	0/21

BeDefended Results (2026-03-24)¶

Metric	BeDefended (blind)	Neo	Delta
True Positives	61/74	66/74	-5
False Positives	8	5	+3
Precision	88.4%	93.0%	-4.6pp
Extra vulns (outside 74)	8	0	+8
Total real vulns found	69	66	+3

Per-app breakdown (first blind run):

App	BeDefended	Neo	Delta
VaultBank	23/30	27/30	-4
MedPortal	17/20	17/20	0 (tied)
ClaimFlow	21/24	22/24	-1

Key Vulnerability Categories Found¶

Category	Examples
Business Logic	Self-deposit money creation, dispute refund bypass, race condition double-spend, unlimited loan amounts
Broken Access Control	IDOR on patient records, cross-user dispute filing, manager cross-branch freeze, body-param IDOR
Authentication	Hardcoded JWT secret, JWT reuse after logout, weak password policy, no account lockout
Mass Assignment	Prisma/Drizzle raw body to ORM update, role escalation via user update
Information Disclosure	Password hash exposure via ORM, staff user IDs in responses, server version
Cryptographic Failures	SHA-256 with hardcoded salt, missing HSTS
File Upload	Unrestricted MIME types on dispute evidence and message attachments

Running the Benchmark¶

/pentest-neo --all                    # All 3 apps sequentially
/pentest-neo vaultbank                # Single app
/pentest-neo vaultbank --code-only    # White-box only
/pentest-neo vaultbank --dynamic-only # Black-box only

Scoring¶

python evals/labs/vibeapps-scorer.py engagements/<dir> --app all --html --save

Scorer features: App filtering by App: tag, global optimal matching (score matrix), CWE family matching, stem-aware keyword matching. Compares vs Neo baseline automatically.