Benchmarks¶

Controlled, repeatable environments for measuring the platform's vulnerability detection capabilities. Each benchmark is a known-vulnerable application with a curated answer key, enabling objective TP/FN/FP scoring across engagements.

Quick Start¶

/labs-up                           # Start all labs
/labs-up --only vulnhr             # Start a single lab
/pentest http://vulnhr.test:7331 --eval   # Run benchmark pentest
/labs-eval                         # Parallel eval across all labs
/labs-eval --results               # Aggregate latest scores

Sections¶

Page	Description
Benchmark Catalog	Complete catalog of all 35 benchmarks across 4 domains (web, LLM, mobile, code security) with detailed descriptions
Latest Scores	Updated scoreboard with the latest result for each lab, breakdown by category/skill/difficulty, and run history
Lab Catalog	Local Docker lab targets with tech stack, vulnerability counts, auth methods, and OWASP category breakdowns
Eval Settings	Complete parameter reference for `--eval` mode: budgets, timeouts, wave config, RRE, coverage gates
Scoring	TP/FN/FP scoring methodology, gap analysis, history tracking
Skill Evals	Per-skill evaluation and benchmarking

Benchmark Portfolio (35 benchmarks, ~22,885 challenges)¶

Web Application Security (17 benchmarks)¶

Local Docker labs, CTF collections, AI-generated apps, external platforms, and academic benchmarks covering web vulnerabilities, real CVEs, WAF bypass, and attack chains.

NEW: BountyBench (NeurIPS 2025, AISI-endorsed), CVE-Bench v2.0 (ICML 2025), PACEbench (attack chains + WAF), AutoPenBench (EMNLP 2025, milestone scoring), Wiz Arena (257 challenges, industry leaderboard)

LLM / AI Agent Security (7 benchmarks)¶

Progressive prompt injection, automated LLM probing, agent exploitation, and MCP security testing.

NEW: Garak (NIST, 200+ probes), CyberSecEval-2 (Meta AI, 8 vuln classes), PINT (Lakera formal benchmark), DVAA (vulnerable AI agent), Gandalf Agent Breaker

Mobile Application Security (5 benchmarks)¶

Reverse engineering, OWASP Mobile Top 10, static analysis, and dynamic testing.

ALL NEW: OWASP MAS Crackmes (L1-L4), InjuredAndroid (15+ CTF challenges), DroidBench 2.0 (120 SAST tests), PIVAA (scanner benchmark), InsecureBankv2 (OWASP MASTG)

Code Security / Secure Code Review (6 benchmarks)¶

Secure code generation, three-gate repair validation, FP measurement, and per-CWE calibration.

NEW: SusVibes (CMU, 200 tasks), AICGSecEval (Tencent, 29 CWEs), OpenSSF CVE (200 JS/TS CVEs), OWASP Benchmark (21K Java tests), Juliet (NIST, per-CWE calibration)

Architecture Overview¶

graph TB
    R["evals/labs/registry.json"] --> UP["/labs-up<br/>Start + Health Check + Credential Verify"]
    UP --> P["/pentest TARGET --eval<br/>Lab Max-Score Mode"]
    P --> S["lab-scorer.py<br/>TP / FN / FP Scoring"]
    S --> H["history/&lt;timestamp&gt;.json"]
    S --> G["gap-analysis.md"]
    S --> HTML["score_report.html"]

    style R fill:#4a148c,color:#fff
    style UP fill:#6a1b9a,color:#fff
    style P fill:#0277bd,color:#fff
    style S fill:#00838f,color:#fff
    style H fill:#00695c,color:#fff
    style G fill:#00695c,color:#fff
    style HTML fill:#00695c,color:#fff

Eval Isolation¶

Strict isolation during --eval pentests

During benchmark pentests, the same rules apply as real engagements: no consulting past results, no reading answer keys, no referencing gap-analysis.md. Each evaluation discovers vulnerabilities from scratch to produce valid measurements.

Blocked file patterns during active `/pentest`¶

evals/labs/*/answer-key.json
evals/labs/*/history/*.json
evals/labs/*/score_report.*
evals/labs/*/gap-analysis.md
.claude/worktrees/