Benchmark Catalog¶

Comprehensive catalog of all 35 benchmarks integrated into the BeDefended automated pentesting suite, organized by domain. Each benchmark provides controlled, repeatable evaluation of specific security testing capabilities.

Summary by Domain¶

Domain	Benchmarks	Challenges	Scoring Types	Key Venues
Web Application	17	~725	Flag, answer-key, dollar-impact, milestone, pass@3	NeurIPS, ICML, EMNLP
LLM / AI Agent	7	400+ probes/scenarios	Probe, level, ESR, flag	NIST, Meta AI, Lakera
Mobile	5	~160	Flag, binary, answer-key	OWASP, Academic
Code Security	6	~21,600	Dual, three-gate, F1, ROC, per-CWE	ICLR, OWASP, NIST, OpenSSF, Tencent
TOTAL	35	~22,885	12 distinct scoring methods	3 AISI-endorsed

Web Application Security¶

Local Docker Labs (Answer-Key Based)¶

Lab	Tech Stack	Vulns	Skill	Status
VulnHR	Laravel/PHP, MySQL, LDAP	81	`/pentest`	100% (87/87)
Juice Shop	Express.js, Angular, SQLite	55	`/pentest`	Not evaluated
SuperSecureBank	.NET 8, SQL Server	37	`/pentest`	91.9% (31/37)
AltoroMutual	Spring Boot, React, PostgreSQL	29	`/pentest`	40.5%
DVWA	PHP, MariaDB	28	`/pentest`	Not evaluated
DVRA	FastAPI, MongoDB	12	`/pentest`	Not evaluated

CTF Benchmarks (Flag-Based)¶

Benchmark	Challenges	Source	Scoring	Skill
XBOW	104	ProjectDiscovery	Flag (`FLAG{...}`)	`/pentest-xbow`
XBOW Blind	104	ProjectDiscovery (no hints)	Flag	`/pentest-xbow-nohint`
HackBench	16	ElectrovoltSec	Flag (`ev{...}`) + points	`/pentest-hackbench`

AI-Generated App Benchmarks¶

Benchmark	Apps	Vulns	Source	Scoring	Skill
Neo / Vibe-Coding	3 (VaultBank, MedPortal, ClaimFlow)	74	ProjectDiscovery	Finding-based (vs Neo baseline)	`/pentest-neo`

External CTF Platforms¶

Platform	Challenges	Source	Scoring	Skill
PortSwigger Academy	250+	PortSwigger	Lab completion	`/pentest-portswigger`
Root-Me	Variable	root-me.org	Points	`/pentest-rootme`
HackingHub	67 hubs	HackingHub	Flag + multi-flag	`/pentest-hackinghub`

NEW — Academic & Industry Web Benchmarks¶

Benchmark	Challenges	Source	Venue	Scoring	Skill
BountyBench	120 tasks (40 bounties × 3)	Stanford/CRFM	NeurIPS 2025	Flag + dollar-impact ($10–$30,485)	`/pentest-bountybench`
CVE-Bench v2.0	40 CVEs	UIUC (Daniel Kang Lab)	ICML 2025 Spotlight	Flag + ABC validation	`/pentest-cvebench`
PACEbench	32 scenarios	Liu et al.	arXiv Oct 2025	Pass@3 + chain tracking	`/pentest-pacebench`
AutoPenBench	33+	Politecnico di Torino	EMNLP 2025 Industry	Milestone-based (MC + MS)	`/pentest-autopenbench`
Wiz AI Cyber Arena	257 (web+API+cloud+CVE+zeroday)	Wiz Research	Industry leaderboard	Pass@3, deterministic	`/pentest-wizarena`

BountyBench Details¶

AISI-endorsed: Used by US AI Safety Institute & UK AI Safety Institute for model evaluation
Dollar-impact scoring: Each bounty has a real-world dollar value ($10 to $30,485)
Three task types: Detect (find vuln), Exploit (prove exploitation), Patch (fix vulnerability)
25 real OSS targets: Lunary, LibreChat, MLflow, Django, FastAPI, curl, and more
27 CWEs covering 9 of OWASP Top 10
Reference scores: Claude Code 57.5% exploit, 87.5% patch

CVE-Bench v2.0 Details¶

ICML 2025 Spotlight paper with rigorous ABC (Agentic Benchmark Checklist) validation
US AI Safety Institute contributed to development
40 critical-severity CVEs in containerized sandboxes
ABC validation: Pre-patch exploit → verifier confirms → post-patch check ensures fix
Top agent: 13% success rate — massive improvement headroom

PACEbench Details¶

4 scenario types: A-CVE (single vuln), B-CVE (blended), C-CVE (chained/lateral movement), D-CVE (WAF-protected)
D-CVE uses production WAFs: ModSecurity, Naxsi, Coraza
No model has solved ANY D-CVE — frontier-pushing benchmark
Pass@3 scoring: 3 attempts per scenario for statistical rigor

AutoPenBench Details¶

EMNLP 2025 Industry Track publication
Milestone-based scoring: Not just pass/fail — tracks intermediate progress
MC (Command Milestones): Specific commands completed (3-16 per scenario)
MS (Stage Milestones): Major stages reached (recon, vuln ID, exploit, post-exploit)
Kali Linux attack containers: Tools run inside Kali, not host
MIT license: Fully open source

Wiz AI Cyber Model Arena Details¶

Industry-standard leaderboard: 25 agent-model combinations tested
5 domains: Zero-day discovery, CVE detection, API security, web CTF, cloud security
Current leader: Claude Opus 4.6 with Claude Code
Deterministic scoring: No LLM-as-judge, dynamic validation prevents hardcoding
Network-isolated Docker containers: Each challenge fully sandboxed

LLM / AI Agent Security¶

Existing LLM Benchmarks¶

Benchmark	Levels	Source	Type	Skill
Gandalf	8 + Agent Breaker	Lakera	Progressive prompt injection	`/pentest-gandalf`
HackMerlin	7	bgalek	Progressive prompt injection (strong defense)	`/pentest-hackmerlin`

NEW — LLM Security Benchmarks¶

Benchmark	Probes/Scenarios	Source	Scoring	Skill
Garak	200+ probes	NIST (Leon Derczynski)	Probe pass/fail, category profiles	`/pentest-garak`
CyberSecEval-2	200+ scenarios	Meta AI	ESR per vulnerability class	`/pentest-cyberseceval`
PINT Benchmark	Large taxonomy	Lakera	ESR per attack vector	`/pentest-pint`
DVAA	~15 scenarios	OpenA2A	Exploitation success	`/pentest-dvaa`
Gandalf Agent Breaker	8+ levels	Lakera	Level completion	`/pentest-gandalf --agent-breaker`

Garak Details¶

NIST researcher created — de facto standard for automated LLM red-teaming
200+ probes across: prompt injection, jailbreak, data extraction, hallucination, toxicity, encoding
Dual-mode: Standalone Garak eval OR comparison with /test-llm findings
Docker-based: Runs inside pentest-tools container

CyberSecEval-2 Details¶

Meta AI comprehensive LLM security evaluation
8 vulnerability classes: Prompt injection (direct + indirect), jailbreak, code injection, data exfiltration, agent/tool abuse, plugin exploitation, multi-step social engineering, safety bypass
ESR metric: Exploit Success Rate — industry standard
25-50% injection success documented as baseline

PINT Benchmark Details¶

Lakera (Gandalf creators) formal academic benchmark
Taxonomy-mapped attacks: direct, indirect (RAG/documents), plugin-based, RAG-based
Per-attack metadata: category, vector, target, severity, stealthiness, transferability
Extends Gandalf/HackMerlin eval with structured evaluation

DVAA Details¶

First intentionally vulnerable AI agent platform
MCP-specific attacks: Tool interface abuse, cross-agent trust exploitation
OWASP Agentic AI Top 10 coverage
Docker-based: Containerized vulnerable AI agent

Mobile Application Security¶

All NEW — no mobile benchmarks existed before this integration.

Benchmark	Challenges	Source	Platform	Scoring	Skill
OWASP MAS Crackmes	4+ levels	OWASP	Android + iOS	Flag (secret extraction)	`/pentest-crackmes`
InjuredAndroid	15+	B3nac (bug bounty)	Android	Flag per challenge	`/pentest-injuredandroid`
DroidBench 2.0	120	Uni Paderborn	Android	Binary (leak detected)	`/pentest-droidbench`
PIVAA	OWASP M1-M10	High-Tech Bridge	Android	Answer-key	`/pentest-pivaa`
InsecureBankv2	Multiple	Data Theorem / OWASP	Android + Server	Answer-key	`/pentest-insecurebank`

OWASP MAS Crackmes Details¶

THE standard mobile reverse engineering benchmark
Progressive difficulty: L1 (root detection bypass) → L4 (combined protections)
Tests: Root detection bypass, anti-debugging, native code analysis, obfuscation handling
Tools: Frida, Objection, jadx, apktool, adb

InjuredAndroid Details¶

Based on real bug bounty findings
Categories: Hardcoded credentials, insecure data storage, exported activities, deep link abuse, WebView XSS, Firebase misconfiguration, broadcast receiver abuse, certificate pinning
CTF-style with flags — automatable

DroidBench 2.0 Details¶

Standard SAST benchmark for Android analysis tools
120 test cases across 9 categories: activity communication, lifecycle, field sensitivity, emulator detection, ICC, reflection, threading, callbacks, general Java
Automated by design — built for tool comparison
Ground truth labels for each test case (leak/no-leak)

PIVAA Details¶

Created specifically as a scanner benchmark
Maps directly to OWASP Mobile Top 10 2024 (M1-M10)
Modern replacement for DIVA (Damn Insecure and Vulnerable App)

InsecureBankv2 Details¶

Official OWASP MASTG testing app (MASTG-APP-0010)
Banking domain = high-value vulnerability types
Hybrid: Docker server component + Android client APK
Categories: Auth bypass, weak crypto, insecure storage, WebView, broadcast abuse, content provider leaks

Code Security / Secure Code Review¶

Existing Code Review Benchmarks¶

Benchmark	Type	Skill
WebGoat	Java/Spring code review only	`/code-review`

NEW — Code Security Benchmarks¶

Benchmark	Test Cases	Source	Venue	Scoring	Skill
SusVibes	200 tasks, 77 CWEs	CMU/Columbia (Lei Li Lab)	ICLR 2026 sub	Dual (functional + security)	`/review-susvibes`
AICGSecEval / A.S.E	29 CWEs, repo-level	Tencent Security	2025	Three-gate (compile + test + security)	`/review-aicgseceval`
OpenSSF CVE Benchmark	~200 JS/TS CVEs	Open Source Security Foundation	OpenSSF	Precision/Recall/F1, FP rate	`/review-ossf`
OWASP Benchmark	21,041 test cases	OWASP Foundation	OWASP	ROC scorecard per category	`/review-owasp-benchmark`
Juliet Test Suite	Thousands	NIST SAMATE	CC0 Public Domain	Per-CWE detection rate	`/review-juliet`

SusVibes Details¶

Tests SECURE CODE GENERATION (not just vuln detection)
200 tasks from 108 real OSS projects
77 CWEs covered
Dual scoring: Functional correctness AND security correctness
Shocking baseline: Claude 4 Sonnet 47.5% functional, only 8.25% secure
Real repo-level tasks: avg 160K LOC, 867 files, 170 lines to generate

AICGSecEval Details¶

Most rigorous code review benchmark available
Three-gate repair validation:
Gate 1: Generated fix compiles
Gate 2: Generated fix passes test suite
Gate 3: Generated fix passes security verification (Semgrep/CodeQL/PoV)
Cross-file dependency analysis required
29 CWEs with containerized build environments per CVE

OpenSSF CVE Benchmark Details¶

OpenSSF = highest credibility for open-source security
~200 JavaScript/TypeScript CVEs with vulnerable/patched code pairs
Explicit FP rate measurement — critical for real-world code review
Built-in comparison against ESLint, CodeQL, nodejsscan

OWASP Benchmark Details¶

THE original SAST/DAST benchmark — OWASP Foundation
21,041 test cases in a Java web application = massive statistical significance
ROC curves per vulnerability category
Standard reference used by all commercial SAST vendors
Categories: SQLi, XSS, CMDi, Path Traversal, LDAP Injection, Weak Crypto, Weak Hashing, Weak Random, Trust Boundary, XPath Injection, Secure Cookie

Juliet Test Suite Details¶

NIST-maintained, CC0 license (unrestricted commercial use)
Per-CWE calibration: identifies which CWEs the agent handles well vs poorly
Known vulnerable + known safe pairs enable precise TP/FP measurement
100+ CWEs across C/C++, Java, C#
Standard reference used by ALL SAST tool vendors

Benchmark Commands Reference¶

Web Application¶

# Local Docker Labs
/pentest http://vulnhr.test:7331 --eval       # VulnHR
/pentest http://juiceshop.test:3000 --eval     # Juice Shop
/pentest http://localhost:8080 --eval           # SuperSecureBank / DVWA / DVRA

# CTF Benchmarks
/pentest-xbow XBEN-001                         # Single XBOW benchmark
/pentest-xbow --batch 1                        # First 10 XBOW benchmarks
/pentest-hackbench EV-01                       # Single HackBench challenge
/pentest-hackbench --all                       # All 16 HackBench challenges

# NEW — Academic Web Benchmarks
/pentest-bountybench BB-001                    # Single BountyBench bounty
/pentest-bountybench --task exploit --all      # All bounties, exploit mode
/pentest-bountybench --task patch --all        # All bounties, patch mode
/pentest-cvebench CVE-2024-XXXXX              # Single CVE-Bench challenge
/pentest-cvebench --all                        # All 40 CVEs
/pentest-pacebench --type d-cve                # WAF bypass scenarios only
/pentest-autopenbench --category web           # Web category only
/pentest-wizarena --domain cloud               # Cloud security challenges

# External Platforms
/pentest-portswigger --category sql-injection  # PortSwigger category
/pentest-rootme --category web-server          # Root-Me category
/pentest-hackinghub HH-WEB-001                 # Single HackingHub hub

LLM / AI Agent¶

# Existing
/pentest-gandalf                               # Standard 8 levels
/pentest-hackmerlin                            # 7 levels with strong defense

# NEW
/pentest-gandalf --agent-breaker               # Agent Breaker mode
/pentest-garak https://target.com/api/chat     # Garak 200+ probes
/pentest-garak https://target.com/api/chat --compare-test-llm  # Compare with /test-llm
/pentest-cyberseceval https://target.com/api   # CyberSecEval-2
/pentest-pint https://target.com/api/chat      # PINT taxonomy eval
/pentest-dvaa --all                            # All DVAA scenarios

Mobile¶

# ALL NEW
/pentest-crackmes --level 1                    # OWASP Crackme Level 1
/pentest-crackmes --all                        # All levels
/pentest-injuredandroid --all                  # All InjuredAndroid challenges
/pentest-droidbench --all                      # All 120 DroidBench test cases
/pentest-pivaa --all                           # PIVAA full scan
/pentest-insecurebank --all                    # InsecureBankv2 full assessment

Code Security¶

# ALL NEW
/review-susvibes --batch 1                     # First 10 SusVibes tasks
/review-susvibes --cwe CWE-79                  # All XSS tasks
/review-aicgseceval --axis repair              # Repair evaluation only
/review-ossf --tool-compare codeql             # Compare with CodeQL
/review-owasp-benchmark --category sqli        # SQL injection category
/review-owasp-benchmark --scorecard            # Generate full scorecard
/review-juliet --language java --cwe CWE-89    # Java SQLi calibration

Orchestrators¶

/pentest-all-labs                              # Run ALL benchmarks sequentially
/pentest-all-labs --only-web                   # Web benchmarks only
/pentest-all-labs --only-llm                   # LLM benchmarks only
/pentest-all-labs --only-mobile                # Mobile benchmarks only
/pentest-all-labs --only-code-review           # Code review benchmarks only
/pentest-neo --all                             # All 3 VibeApps (vs Neo baseline)
/labs-eval                                     # Parallel eval across all labs

AISI-Endorsed Benchmarks¶

Three benchmarks are endorsed/used by government AI Safety Institutes:

Benchmark	AISI Usage	Venue
BountyBench	US AISI + UK AISI model evaluation	NeurIPS 2025
CVE-Bench v2.0	US AISI contributed to development	ICML 2025
Cybench (reference)	US AISI + UK AISI pre-deployment testing	Stanford/CMU

Scoring Methods¶

Method	Description	Used By
Answer-key	TP/FN/FP against curated vulnerability list	VulnHR, Juice Shop, SSB, AltoroMutual, DVWA, DVRA
Flag-based	Extract specific flag string	XBOW, HackBench, HackingHub, BountyBench, CVE-Bench
Level completion	Progressive challenge levels	Gandalf, HackMerlin
Lab completion	"Congratulations" detection	PortSwigger
Points-based	Variable points per challenge	Root-Me, HackBench
Finding-based	Compare findings vs ground truth	Neo/VibeApps
Dollar-impact	Real-world bounty value weighting	BountyBench
Pass@3	3 attempts, pass if any succeed	PACEbench, Wiz Arena
Milestone-based	Track intermediate progress (MC + MS)	AutoPenBench
Probe pass/fail	Automated probe success rate	Garak, PINT
ESR	Exploit Success Rate per class	CyberSecEval-2
Dual (functional + security)	Must be correct AND secure	SusVibes
Three-gate	Compile + tests + security check	AICGSecEval
ROC scorecard	TPR vs FPR per category	OWASP Benchmark
Per-CWE	Detection rate per CWE ID	Juliet
Precision/Recall/F1	Standard IR metrics	OpenSSF