Benchmarks¶
Controlled, repeatable environments for measuring the platform's vulnerability detection capabilities. Each benchmark is a known-vulnerable application with a curated answer key, enabling objective TP/FN/FP scoring across engagements.
Quick Start¶
/labs-up # Start all labs
/labs-up --only vulnhr # Start a single lab
/pentest http://vulnhr.test:7331 --eval # Run benchmark pentest
/labs-eval # Parallel eval across all labs
/labs-eval --results # Aggregate latest scores
Sections¶
| Page | Description |
|---|---|
| Benchmark Catalog | Complete catalog of all 35 benchmarks across 4 domains (web, LLM, mobile, code security) with detailed descriptions |
| Latest Scores | Updated scoreboard with the latest result for each lab, breakdown by category/skill/difficulty, and run history |
| Lab Catalog | Local Docker lab targets with tech stack, vulnerability counts, auth methods, and OWASP category breakdowns |
| Eval Settings | Complete parameter reference for --eval mode: budgets, timeouts, wave config, RRE, coverage gates |
| Scoring | TP/FN/FP scoring methodology, gap analysis, history tracking |
| Skill Evals | Per-skill evaluation and benchmarking |
Benchmark Portfolio (35 benchmarks, ~22,885 challenges)¶
Web Application Security (17 benchmarks)¶
Local Docker labs, CTF collections, AI-generated apps, external platforms, and academic benchmarks covering web vulnerabilities, real CVEs, WAF bypass, and attack chains.
NEW: BountyBench (NeurIPS 2025, AISI-endorsed), CVE-Bench v2.0 (ICML 2025), PACEbench (attack chains + WAF), AutoPenBench (EMNLP 2025, milestone scoring), Wiz Arena (257 challenges, industry leaderboard)
LLM / AI Agent Security (7 benchmarks)¶
Progressive prompt injection, automated LLM probing, agent exploitation, and MCP security testing.
NEW: Garak (NIST, 200+ probes), CyberSecEval-2 (Meta AI, 8 vuln classes), PINT (Lakera formal benchmark), DVAA (vulnerable AI agent), Gandalf Agent Breaker
Mobile Application Security (5 benchmarks)¶
Reverse engineering, OWASP Mobile Top 10, static analysis, and dynamic testing.
ALL NEW: OWASP MAS Crackmes (L1-L4), InjuredAndroid (15+ CTF challenges), DroidBench 2.0 (120 SAST tests), PIVAA (scanner benchmark), InsecureBankv2 (OWASP MASTG)
Code Security / Secure Code Review (6 benchmarks)¶
Secure code generation, three-gate repair validation, FP measurement, and per-CWE calibration.
NEW: SusVibes (CMU, 200 tasks), AICGSecEval (Tencent, 29 CWEs), OpenSSF CVE (200 JS/TS CVEs), OWASP Benchmark (21K Java tests), Juliet (NIST, per-CWE calibration)
Architecture Overview¶
graph TB
R["evals/labs/registry.json"] --> UP["/labs-up<br/>Start + Health Check + Credential Verify"]
UP --> P["/pentest TARGET --eval<br/>Lab Max-Score Mode"]
P --> S["lab-scorer.py<br/>TP / FN / FP Scoring"]
S --> H["history/<timestamp>.json"]
S --> G["gap-analysis.md"]
S --> HTML["score_report.html"]
style R fill:#4a148c,color:#fff
style UP fill:#6a1b9a,color:#fff
style P fill:#0277bd,color:#fff
style S fill:#00838f,color:#fff
style H fill:#00695c,color:#fff
style G fill:#00695c,color:#fff
style HTML fill:#00695c,color:#fff
Eval Isolation¶
Strict isolation during --eval pentests
During benchmark pentests, the same rules apply as real engagements: no consulting past results, no reading answer keys, no referencing gap-analysis.md. Each evaluation discovers vulnerabilities from scratch to produce valid measurements.
Blocked file patterns during active /pentest¶
evals/labs/*/answer-key.jsonevals/labs/*/history/*.jsonevals/labs/*/score_report.*evals/labs/*/gap-analysis.md.claude/worktrees/