Benchmark Catalog
Comprehensive catalog of all 35 benchmarks integrated into the BeDefended automated pentesting suite, organized by domain. Each benchmark provides controlled, repeatable evaluation of specific security testing capabilities.
Summary by Domain
| Domain |
Benchmarks |
Challenges |
Scoring Types |
Key Venues |
| Web Application |
17 |
~725 |
Flag, answer-key, dollar-impact, milestone, pass@3 |
NeurIPS, ICML, EMNLP |
| LLM / AI Agent |
7 |
400+ probes/scenarios |
Probe, level, ESR, flag |
NIST, Meta AI, Lakera |
| Mobile |
5 |
~160 |
Flag, binary, answer-key |
OWASP, Academic |
| Code Security |
6 |
~21,600 |
Dual, three-gate, F1, ROC, per-CWE |
ICLR, OWASP, NIST, OpenSSF, Tencent |
| TOTAL |
35 |
~22,885 |
12 distinct scoring methods |
3 AISI-endorsed |
Web Application Security
Local Docker Labs (Answer-Key Based)
| Lab |
Tech Stack |
Vulns |
Skill |
Status |
| VulnHR |
Laravel/PHP, MySQL, LDAP |
81 |
/pentest |
100% (87/87) |
| Juice Shop |
Express.js, Angular, SQLite |
55 |
/pentest |
Not evaluated |
| SuperSecureBank |
.NET 8, SQL Server |
37 |
/pentest |
91.9% (31/37) |
| AltoroMutual |
Spring Boot, React, PostgreSQL |
29 |
/pentest |
40.5% |
| DVWA |
PHP, MariaDB |
28 |
/pentest |
Not evaluated |
| DVRA |
FastAPI, MongoDB |
12 |
/pentest |
Not evaluated |
CTF Benchmarks (Flag-Based)
| Benchmark |
Challenges |
Source |
Scoring |
Skill |
| XBOW |
104 |
ProjectDiscovery |
Flag (FLAG{...}) |
/pentest-xbow |
| XBOW Blind |
104 |
ProjectDiscovery (no hints) |
Flag |
/pentest-xbow-nohint |
| HackBench |
16 |
ElectrovoltSec |
Flag (ev{...}) + points |
/pentest-hackbench |
AI-Generated App Benchmarks
| Benchmark |
Apps |
Vulns |
Source |
Scoring |
Skill |
| Neo / Vibe-Coding |
3 (VaultBank, MedPortal, ClaimFlow) |
74 |
ProjectDiscovery |
Finding-based (vs Neo baseline) |
/pentest-neo |
| Platform |
Challenges |
Source |
Scoring |
Skill |
| PortSwigger Academy |
250+ |
PortSwigger |
Lab completion |
/pentest-portswigger |
| Root-Me |
Variable |
root-me.org |
Points |
/pentest-rootme |
| HackingHub |
67 hubs |
HackingHub |
Flag + multi-flag |
/pentest-hackinghub |
NEW — Academic & Industry Web Benchmarks
| Benchmark |
Challenges |
Source |
Venue |
Scoring |
Skill |
| BountyBench |
120 tasks (40 bounties × 3) |
Stanford/CRFM |
NeurIPS 2025 |
Flag + dollar-impact ($10–$30,485) |
/pentest-bountybench |
| CVE-Bench v2.0 |
40 CVEs |
UIUC (Daniel Kang Lab) |
ICML 2025 Spotlight |
Flag + ABC validation |
/pentest-cvebench |
| PACEbench |
32 scenarios |
Liu et al. |
arXiv Oct 2025 |
Pass@3 + chain tracking |
/pentest-pacebench |
| AutoPenBench |
33+ |
Politecnico di Torino |
EMNLP 2025 Industry |
Milestone-based (MC + MS) |
/pentest-autopenbench |
| Wiz AI Cyber Arena |
257 (web+API+cloud+CVE+zeroday) |
Wiz Research |
Industry leaderboard |
Pass@3, deterministic |
/pentest-wizarena |
BountyBench Details
- AISI-endorsed: Used by US AI Safety Institute & UK AI Safety Institute for model evaluation
- Dollar-impact scoring: Each bounty has a real-world dollar value ($10 to $30,485)
- Three task types: Detect (find vuln), Exploit (prove exploitation), Patch (fix vulnerability)
- 25 real OSS targets: Lunary, LibreChat, MLflow, Django, FastAPI, curl, and more
- 27 CWEs covering 9 of OWASP Top 10
- Reference scores: Claude Code 57.5% exploit, 87.5% patch
CVE-Bench v2.0 Details
- ICML 2025 Spotlight paper with rigorous ABC (Agentic Benchmark Checklist) validation
- US AI Safety Institute contributed to development
- 40 critical-severity CVEs in containerized sandboxes
- ABC validation: Pre-patch exploit → verifier confirms → post-patch check ensures fix
- Top agent: 13% success rate — massive improvement headroom
PACEbench Details
- 4 scenario types: A-CVE (single vuln), B-CVE (blended), C-CVE (chained/lateral movement), D-CVE (WAF-protected)
- D-CVE uses production WAFs: ModSecurity, Naxsi, Coraza
- No model has solved ANY D-CVE — frontier-pushing benchmark
- Pass@3 scoring: 3 attempts per scenario for statistical rigor
AutoPenBench Details
- EMNLP 2025 Industry Track publication
- Milestone-based scoring: Not just pass/fail — tracks intermediate progress
- MC (Command Milestones): Specific commands completed (3-16 per scenario)
- MS (Stage Milestones): Major stages reached (recon, vuln ID, exploit, post-exploit)
- Kali Linux attack containers: Tools run inside Kali, not host
- MIT license: Fully open source
Wiz AI Cyber Model Arena Details
- Industry-standard leaderboard: 25 agent-model combinations tested
- 5 domains: Zero-day discovery, CVE detection, API security, web CTF, cloud security
- Current leader: Claude Opus 4.6 with Claude Code
- Deterministic scoring: No LLM-as-judge, dynamic validation prevents hardcoding
- Network-isolated Docker containers: Each challenge fully sandboxed
LLM / AI Agent Security
Existing LLM Benchmarks
| Benchmark |
Levels |
Source |
Type |
Skill |
| Gandalf |
8 + Agent Breaker |
Lakera |
Progressive prompt injection |
/pentest-gandalf |
| HackMerlin |
7 |
bgalek |
Progressive prompt injection (strong defense) |
/pentest-hackmerlin |
NEW — LLM Security Benchmarks
| Benchmark |
Probes/Scenarios |
Source |
Scoring |
Skill |
| Garak |
200+ probes |
NIST (Leon Derczynski) |
Probe pass/fail, category profiles |
/pentest-garak |
| CyberSecEval-2 |
200+ scenarios |
Meta AI |
ESR per vulnerability class |
/pentest-cyberseceval |
| PINT Benchmark |
Large taxonomy |
Lakera |
ESR per attack vector |
/pentest-pint |
| DVAA |
~15 scenarios |
OpenA2A |
Exploitation success |
/pentest-dvaa |
| Gandalf Agent Breaker |
8+ levels |
Lakera |
Level completion |
/pentest-gandalf --agent-breaker |
Garak Details
- NIST researcher created — de facto standard for automated LLM red-teaming
- 200+ probes across: prompt injection, jailbreak, data extraction, hallucination, toxicity, encoding
- Dual-mode: Standalone Garak eval OR comparison with
/test-llm findings
- Docker-based: Runs inside pentest-tools container
CyberSecEval-2 Details
- Meta AI comprehensive LLM security evaluation
- 8 vulnerability classes: Prompt injection (direct + indirect), jailbreak, code injection, data exfiltration, agent/tool abuse, plugin exploitation, multi-step social engineering, safety bypass
- ESR metric: Exploit Success Rate — industry standard
- 25-50% injection success documented as baseline
PINT Benchmark Details
- Lakera (Gandalf creators) formal academic benchmark
- Taxonomy-mapped attacks: direct, indirect (RAG/documents), plugin-based, RAG-based
- Per-attack metadata: category, vector, target, severity, stealthiness, transferability
- Extends Gandalf/HackMerlin eval with structured evaluation
DVAA Details
- First intentionally vulnerable AI agent platform
- MCP-specific attacks: Tool interface abuse, cross-agent trust exploitation
- OWASP Agentic AI Top 10 coverage
- Docker-based: Containerized vulnerable AI agent
Mobile Application Security
All NEW — no mobile benchmarks existed before this integration.
| Benchmark |
Challenges |
Source |
Platform |
Scoring |
Skill |
| OWASP MAS Crackmes |
4+ levels |
OWASP |
Android + iOS |
Flag (secret extraction) |
/pentest-crackmes |
| InjuredAndroid |
15+ |
B3nac (bug bounty) |
Android |
Flag per challenge |
/pentest-injuredandroid |
| DroidBench 2.0 |
120 |
Uni Paderborn |
Android |
Binary (leak detected) |
/pentest-droidbench |
| PIVAA |
OWASP M1-M10 |
High-Tech Bridge |
Android |
Answer-key |
/pentest-pivaa |
| InsecureBankv2 |
Multiple |
Data Theorem / OWASP |
Android + Server |
Answer-key |
/pentest-insecurebank |
OWASP MAS Crackmes Details
- THE standard mobile reverse engineering benchmark
- Progressive difficulty: L1 (root detection bypass) → L4 (combined protections)
- Tests: Root detection bypass, anti-debugging, native code analysis, obfuscation handling
- Tools: Frida, Objection, jadx, apktool, adb
InjuredAndroid Details
- Based on real bug bounty findings
- Categories: Hardcoded credentials, insecure data storage, exported activities, deep link abuse, WebView XSS, Firebase misconfiguration, broadcast receiver abuse, certificate pinning
- CTF-style with flags — automatable
DroidBench 2.0 Details
- Standard SAST benchmark for Android analysis tools
- 120 test cases across 9 categories: activity communication, lifecycle, field sensitivity, emulator detection, ICC, reflection, threading, callbacks, general Java
- Automated by design — built for tool comparison
- Ground truth labels for each test case (leak/no-leak)
PIVAA Details
- Created specifically as a scanner benchmark
- Maps directly to OWASP Mobile Top 10 2024 (M1-M10)
- Modern replacement for DIVA (Damn Insecure and Vulnerable App)
InsecureBankv2 Details
- Official OWASP MASTG testing app (MASTG-APP-0010)
- Banking domain = high-value vulnerability types
- Hybrid: Docker server component + Android client APK
- Categories: Auth bypass, weak crypto, insecure storage, WebView, broadcast abuse, content provider leaks
Code Security / Secure Code Review
Existing Code Review Benchmarks
| Benchmark |
Type |
Skill |
| WebGoat |
Java/Spring code review only |
/code-review |
NEW — Code Security Benchmarks
| Benchmark |
Test Cases |
Source |
Venue |
Scoring |
Skill |
| SusVibes |
200 tasks, 77 CWEs |
CMU/Columbia (Lei Li Lab) |
ICLR 2026 sub |
Dual (functional + security) |
/review-susvibes |
| AICGSecEval / A.S.E |
29 CWEs, repo-level |
Tencent Security |
2025 |
Three-gate (compile + test + security) |
/review-aicgseceval |
| OpenSSF CVE Benchmark |
~200 JS/TS CVEs |
Open Source Security Foundation |
OpenSSF |
Precision/Recall/F1, FP rate |
/review-ossf |
| OWASP Benchmark |
21,041 test cases |
OWASP Foundation |
OWASP |
ROC scorecard per category |
/review-owasp-benchmark |
| Juliet Test Suite |
Thousands |
NIST SAMATE |
CC0 Public Domain |
Per-CWE detection rate |
/review-juliet |
SusVibes Details
- Tests SECURE CODE GENERATION (not just vuln detection)
- 200 tasks from 108 real OSS projects
- 77 CWEs covered
- Dual scoring: Functional correctness AND security correctness
- Shocking baseline: Claude 4 Sonnet 47.5% functional, only 8.25% secure
- Real repo-level tasks: avg 160K LOC, 867 files, 170 lines to generate
AICGSecEval Details
- Most rigorous code review benchmark available
- Three-gate repair validation:
- Gate 1: Generated fix compiles
- Gate 2: Generated fix passes test suite
- Gate 3: Generated fix passes security verification (Semgrep/CodeQL/PoV)
- Cross-file dependency analysis required
- 29 CWEs with containerized build environments per CVE
OpenSSF CVE Benchmark Details
- OpenSSF = highest credibility for open-source security
- ~200 JavaScript/TypeScript CVEs with vulnerable/patched code pairs
- Explicit FP rate measurement — critical for real-world code review
- Built-in comparison against ESLint, CodeQL, nodejsscan
OWASP Benchmark Details
- THE original SAST/DAST benchmark — OWASP Foundation
- 21,041 test cases in a Java web application = massive statistical significance
- ROC curves per vulnerability category
- Standard reference used by all commercial SAST vendors
- Categories: SQLi, XSS, CMDi, Path Traversal, LDAP Injection, Weak Crypto, Weak Hashing, Weak Random, Trust Boundary, XPath Injection, Secure Cookie
Juliet Test Suite Details
- NIST-maintained, CC0 license (unrestricted commercial use)
- Per-CWE calibration: identifies which CWEs the agent handles well vs poorly
- Known vulnerable + known safe pairs enable precise TP/FP measurement
- 100+ CWEs across C/C++, Java, C#
- Standard reference used by ALL SAST tool vendors
Benchmark Commands Reference
Web Application
# Local Docker Labs
/pentest http://vulnhr.test:7331 --eval # VulnHR
/pentest http://juiceshop.test:3000 --eval # Juice Shop
/pentest http://localhost:8080 --eval # SuperSecureBank / DVWA / DVRA
# CTF Benchmarks
/pentest-xbow XBEN-001 # Single XBOW benchmark
/pentest-xbow --batch 1 # First 10 XBOW benchmarks
/pentest-hackbench EV-01 # Single HackBench challenge
/pentest-hackbench --all # All 16 HackBench challenges
# NEW — Academic Web Benchmarks
/pentest-bountybench BB-001 # Single BountyBench bounty
/pentest-bountybench --task exploit --all # All bounties, exploit mode
/pentest-bountybench --task patch --all # All bounties, patch mode
/pentest-cvebench CVE-2024-XXXXX # Single CVE-Bench challenge
/pentest-cvebench --all # All 40 CVEs
/pentest-pacebench --type d-cve # WAF bypass scenarios only
/pentest-autopenbench --category web # Web category only
/pentest-wizarena --domain cloud # Cloud security challenges
# External Platforms
/pentest-portswigger --category sql-injection # PortSwigger category
/pentest-rootme --category web-server # Root-Me category
/pentest-hackinghub HH-WEB-001 # Single HackingHub hub
LLM / AI Agent
# Existing
/pentest-gandalf # Standard 8 levels
/pentest-hackmerlin # 7 levels with strong defense
# NEW
/pentest-gandalf --agent-breaker # Agent Breaker mode
/pentest-garak https://target.com/api/chat # Garak 200+ probes
/pentest-garak https://target.com/api/chat --compare-test-llm # Compare with /test-llm
/pentest-cyberseceval https://target.com/api # CyberSecEval-2
/pentest-pint https://target.com/api/chat # PINT taxonomy eval
/pentest-dvaa --all # All DVAA scenarios
Mobile
# ALL NEW
/pentest-crackmes --level 1 # OWASP Crackme Level 1
/pentest-crackmes --all # All levels
/pentest-injuredandroid --all # All InjuredAndroid challenges
/pentest-droidbench --all # All 120 DroidBench test cases
/pentest-pivaa --all # PIVAA full scan
/pentest-insecurebank --all # InsecureBankv2 full assessment
Code Security
# ALL NEW
/review-susvibes --batch 1 # First 10 SusVibes tasks
/review-susvibes --cwe CWE-79 # All XSS tasks
/review-aicgseceval --axis repair # Repair evaluation only
/review-ossf --tool-compare codeql # Compare with CodeQL
/review-owasp-benchmark --category sqli # SQL injection category
/review-owasp-benchmark --scorecard # Generate full scorecard
/review-juliet --language java --cwe CWE-89 # Java SQLi calibration
Orchestrators
/pentest-all-labs # Run ALL benchmarks sequentially
/pentest-all-labs --only-web # Web benchmarks only
/pentest-all-labs --only-llm # LLM benchmarks only
/pentest-all-labs --only-mobile # Mobile benchmarks only
/pentest-all-labs --only-code-review # Code review benchmarks only
/pentest-neo --all # All 3 VibeApps (vs Neo baseline)
/labs-eval # Parallel eval across all labs
AISI-Endorsed Benchmarks
Three benchmarks are endorsed/used by government AI Safety Institutes:
| Benchmark |
AISI Usage |
Venue |
| BountyBench |
US AISI + UK AISI model evaluation |
NeurIPS 2025 |
| CVE-Bench v2.0 |
US AISI contributed to development |
ICML 2025 |
| Cybench (reference) |
US AISI + UK AISI pre-deployment testing |
Stanford/CMU |
Scoring Methods
| Method |
Description |
Used By |
| Answer-key |
TP/FN/FP against curated vulnerability list |
VulnHR, Juice Shop, SSB, AltoroMutual, DVWA, DVRA |
| Flag-based |
Extract specific flag string |
XBOW, HackBench, HackingHub, BountyBench, CVE-Bench |
| Level completion |
Progressive challenge levels |
Gandalf, HackMerlin |
| Lab completion |
"Congratulations" detection |
PortSwigger |
| Points-based |
Variable points per challenge |
Root-Me, HackBench |
| Finding-based |
Compare findings vs ground truth |
Neo/VibeApps |
| Dollar-impact |
Real-world bounty value weighting |
BountyBench |
| Pass@3 |
3 attempts, pass if any succeed |
PACEbench, Wiz Arena |
| Milestone-based |
Track intermediate progress (MC + MS) |
AutoPenBench |
| Probe pass/fail |
Automated probe success rate |
Garak, PINT |
| ESR |
Exploit Success Rate per class |
CyberSecEval-2 |
| Dual (functional + security) |
Must be correct AND secure |
SusVibes |
| Three-gate |
Compile + tests + security check |
AICGSecEval |
| ROC scorecard |
TPR vs FPR per category |
OWASP Benchmark |
| Per-CWE |
Detection rate per CWE ID |
Juliet |
| Precision/Recall/F1 |
Standard IR metrics |
OpenSSF |