Skip to content

Benchmark Catalog

Comprehensive catalog of all 35 benchmarks integrated into the BeDefended automated pentesting suite, organized by domain. Each benchmark provides controlled, repeatable evaluation of specific security testing capabilities.


Summary by Domain

Domain Benchmarks Challenges Scoring Types Key Venues
Web Application 17 ~725 Flag, answer-key, dollar-impact, milestone, pass@3 NeurIPS, ICML, EMNLP
LLM / AI Agent 7 400+ probes/scenarios Probe, level, ESR, flag NIST, Meta AI, Lakera
Mobile 5 ~160 Flag, binary, answer-key OWASP, Academic
Code Security 6 ~21,600 Dual, three-gate, F1, ROC, per-CWE ICLR, OWASP, NIST, OpenSSF, Tencent
TOTAL 35 ~22,885 12 distinct scoring methods 3 AISI-endorsed

Web Application Security

Local Docker Labs (Answer-Key Based)

Lab Tech Stack Vulns Skill Status
VulnHR Laravel/PHP, MySQL, LDAP 81 /pentest 100% (87/87)
Juice Shop Express.js, Angular, SQLite 55 /pentest Not evaluated
SuperSecureBank .NET 8, SQL Server 37 /pentest 91.9% (31/37)
AltoroMutual Spring Boot, React, PostgreSQL 29 /pentest 40.5%
DVWA PHP, MariaDB 28 /pentest Not evaluated
DVRA FastAPI, MongoDB 12 /pentest Not evaluated

CTF Benchmarks (Flag-Based)

Benchmark Challenges Source Scoring Skill
XBOW 104 ProjectDiscovery Flag (FLAG{...}) /pentest-xbow
XBOW Blind 104 ProjectDiscovery (no hints) Flag /pentest-xbow-nohint
HackBench 16 ElectrovoltSec Flag (ev{...}) + points /pentest-hackbench

AI-Generated App Benchmarks

Benchmark Apps Vulns Source Scoring Skill
Neo / Vibe-Coding 3 (VaultBank, MedPortal, ClaimFlow) 74 ProjectDiscovery Finding-based (vs Neo baseline) /pentest-neo

External CTF Platforms

Platform Challenges Source Scoring Skill
PortSwigger Academy 250+ PortSwigger Lab completion /pentest-portswigger
Root-Me Variable root-me.org Points /pentest-rootme
HackingHub 67 hubs HackingHub Flag + multi-flag /pentest-hackinghub

NEW — Academic & Industry Web Benchmarks

Benchmark Challenges Source Venue Scoring Skill
BountyBench 120 tasks (40 bounties × 3) Stanford/CRFM NeurIPS 2025 Flag + dollar-impact ($10–$30,485) /pentest-bountybench
CVE-Bench v2.0 40 CVEs UIUC (Daniel Kang Lab) ICML 2025 Spotlight Flag + ABC validation /pentest-cvebench
PACEbench 32 scenarios Liu et al. arXiv Oct 2025 Pass@3 + chain tracking /pentest-pacebench
AutoPenBench 33+ Politecnico di Torino EMNLP 2025 Industry Milestone-based (MC + MS) /pentest-autopenbench
Wiz AI Cyber Arena 257 (web+API+cloud+CVE+zeroday) Wiz Research Industry leaderboard Pass@3, deterministic /pentest-wizarena

BountyBench Details

  • AISI-endorsed: Used by US AI Safety Institute & UK AI Safety Institute for model evaluation
  • Dollar-impact scoring: Each bounty has a real-world dollar value ($10 to $30,485)
  • Three task types: Detect (find vuln), Exploit (prove exploitation), Patch (fix vulnerability)
  • 25 real OSS targets: Lunary, LibreChat, MLflow, Django, FastAPI, curl, and more
  • 27 CWEs covering 9 of OWASP Top 10
  • Reference scores: Claude Code 57.5% exploit, 87.5% patch

CVE-Bench v2.0 Details

  • ICML 2025 Spotlight paper with rigorous ABC (Agentic Benchmark Checklist) validation
  • US AI Safety Institute contributed to development
  • 40 critical-severity CVEs in containerized sandboxes
  • ABC validation: Pre-patch exploit → verifier confirms → post-patch check ensures fix
  • Top agent: 13% success rate — massive improvement headroom

PACEbench Details

  • 4 scenario types: A-CVE (single vuln), B-CVE (blended), C-CVE (chained/lateral movement), D-CVE (WAF-protected)
  • D-CVE uses production WAFs: ModSecurity, Naxsi, Coraza
  • No model has solved ANY D-CVE — frontier-pushing benchmark
  • Pass@3 scoring: 3 attempts per scenario for statistical rigor

AutoPenBench Details

  • EMNLP 2025 Industry Track publication
  • Milestone-based scoring: Not just pass/fail — tracks intermediate progress
  • MC (Command Milestones): Specific commands completed (3-16 per scenario)
  • MS (Stage Milestones): Major stages reached (recon, vuln ID, exploit, post-exploit)
  • Kali Linux attack containers: Tools run inside Kali, not host
  • MIT license: Fully open source

Wiz AI Cyber Model Arena Details

  • Industry-standard leaderboard: 25 agent-model combinations tested
  • 5 domains: Zero-day discovery, CVE detection, API security, web CTF, cloud security
  • Current leader: Claude Opus 4.6 with Claude Code
  • Deterministic scoring: No LLM-as-judge, dynamic validation prevents hardcoding
  • Network-isolated Docker containers: Each challenge fully sandboxed

LLM / AI Agent Security

Existing LLM Benchmarks

Benchmark Levels Source Type Skill
Gandalf 8 + Agent Breaker Lakera Progressive prompt injection /pentest-gandalf
HackMerlin 7 bgalek Progressive prompt injection (strong defense) /pentest-hackmerlin

NEW — LLM Security Benchmarks

Benchmark Probes/Scenarios Source Scoring Skill
Garak 200+ probes NIST (Leon Derczynski) Probe pass/fail, category profiles /pentest-garak
CyberSecEval-2 200+ scenarios Meta AI ESR per vulnerability class /pentest-cyberseceval
PINT Benchmark Large taxonomy Lakera ESR per attack vector /pentest-pint
DVAA ~15 scenarios OpenA2A Exploitation success /pentest-dvaa
Gandalf Agent Breaker 8+ levels Lakera Level completion /pentest-gandalf --agent-breaker

Garak Details

  • NIST researcher created — de facto standard for automated LLM red-teaming
  • 200+ probes across: prompt injection, jailbreak, data extraction, hallucination, toxicity, encoding
  • Dual-mode: Standalone Garak eval OR comparison with /test-llm findings
  • Docker-based: Runs inside pentest-tools container

CyberSecEval-2 Details

  • Meta AI comprehensive LLM security evaluation
  • 8 vulnerability classes: Prompt injection (direct + indirect), jailbreak, code injection, data exfiltration, agent/tool abuse, plugin exploitation, multi-step social engineering, safety bypass
  • ESR metric: Exploit Success Rate — industry standard
  • 25-50% injection success documented as baseline

PINT Benchmark Details

  • Lakera (Gandalf creators) formal academic benchmark
  • Taxonomy-mapped attacks: direct, indirect (RAG/documents), plugin-based, RAG-based
  • Per-attack metadata: category, vector, target, severity, stealthiness, transferability
  • Extends Gandalf/HackMerlin eval with structured evaluation

DVAA Details

  • First intentionally vulnerable AI agent platform
  • MCP-specific attacks: Tool interface abuse, cross-agent trust exploitation
  • OWASP Agentic AI Top 10 coverage
  • Docker-based: Containerized vulnerable AI agent

Mobile Application Security

All NEW — no mobile benchmarks existed before this integration.

Benchmark Challenges Source Platform Scoring Skill
OWASP MAS Crackmes 4+ levels OWASP Android + iOS Flag (secret extraction) /pentest-crackmes
InjuredAndroid 15+ B3nac (bug bounty) Android Flag per challenge /pentest-injuredandroid
DroidBench 2.0 120 Uni Paderborn Android Binary (leak detected) /pentest-droidbench
PIVAA OWASP M1-M10 High-Tech Bridge Android Answer-key /pentest-pivaa
InsecureBankv2 Multiple Data Theorem / OWASP Android + Server Answer-key /pentest-insecurebank

OWASP MAS Crackmes Details

  • THE standard mobile reverse engineering benchmark
  • Progressive difficulty: L1 (root detection bypass) → L4 (combined protections)
  • Tests: Root detection bypass, anti-debugging, native code analysis, obfuscation handling
  • Tools: Frida, Objection, jadx, apktool, adb

InjuredAndroid Details

  • Based on real bug bounty findings
  • Categories: Hardcoded credentials, insecure data storage, exported activities, deep link abuse, WebView XSS, Firebase misconfiguration, broadcast receiver abuse, certificate pinning
  • CTF-style with flags — automatable

DroidBench 2.0 Details

  • Standard SAST benchmark for Android analysis tools
  • 120 test cases across 9 categories: activity communication, lifecycle, field sensitivity, emulator detection, ICC, reflection, threading, callbacks, general Java
  • Automated by design — built for tool comparison
  • Ground truth labels for each test case (leak/no-leak)

PIVAA Details

  • Created specifically as a scanner benchmark
  • Maps directly to OWASP Mobile Top 10 2024 (M1-M10)
  • Modern replacement for DIVA (Damn Insecure and Vulnerable App)

InsecureBankv2 Details

  • Official OWASP MASTG testing app (MASTG-APP-0010)
  • Banking domain = high-value vulnerability types
  • Hybrid: Docker server component + Android client APK
  • Categories: Auth bypass, weak crypto, insecure storage, WebView, broadcast abuse, content provider leaks

Code Security / Secure Code Review

Existing Code Review Benchmarks

Benchmark Type Skill
WebGoat Java/Spring code review only /code-review

NEW — Code Security Benchmarks

Benchmark Test Cases Source Venue Scoring Skill
SusVibes 200 tasks, 77 CWEs CMU/Columbia (Lei Li Lab) ICLR 2026 sub Dual (functional + security) /review-susvibes
AICGSecEval / A.S.E 29 CWEs, repo-level Tencent Security 2025 Three-gate (compile + test + security) /review-aicgseceval
OpenSSF CVE Benchmark ~200 JS/TS CVEs Open Source Security Foundation OpenSSF Precision/Recall/F1, FP rate /review-ossf
OWASP Benchmark 21,041 test cases OWASP Foundation OWASP ROC scorecard per category /review-owasp-benchmark
Juliet Test Suite Thousands NIST SAMATE CC0 Public Domain Per-CWE detection rate /review-juliet

SusVibes Details

  • Tests SECURE CODE GENERATION (not just vuln detection)
  • 200 tasks from 108 real OSS projects
  • 77 CWEs covered
  • Dual scoring: Functional correctness AND security correctness
  • Shocking baseline: Claude 4 Sonnet 47.5% functional, only 8.25% secure
  • Real repo-level tasks: avg 160K LOC, 867 files, 170 lines to generate

AICGSecEval Details

  • Most rigorous code review benchmark available
  • Three-gate repair validation:
  • Gate 1: Generated fix compiles
  • Gate 2: Generated fix passes test suite
  • Gate 3: Generated fix passes security verification (Semgrep/CodeQL/PoV)
  • Cross-file dependency analysis required
  • 29 CWEs with containerized build environments per CVE

OpenSSF CVE Benchmark Details

  • OpenSSF = highest credibility for open-source security
  • ~200 JavaScript/TypeScript CVEs with vulnerable/patched code pairs
  • Explicit FP rate measurement — critical for real-world code review
  • Built-in comparison against ESLint, CodeQL, nodejsscan

OWASP Benchmark Details

  • THE original SAST/DAST benchmark — OWASP Foundation
  • 21,041 test cases in a Java web application = massive statistical significance
  • ROC curves per vulnerability category
  • Standard reference used by all commercial SAST vendors
  • Categories: SQLi, XSS, CMDi, Path Traversal, LDAP Injection, Weak Crypto, Weak Hashing, Weak Random, Trust Boundary, XPath Injection, Secure Cookie

Juliet Test Suite Details

  • NIST-maintained, CC0 license (unrestricted commercial use)
  • Per-CWE calibration: identifies which CWEs the agent handles well vs poorly
  • Known vulnerable + known safe pairs enable precise TP/FP measurement
  • 100+ CWEs across C/C++, Java, C#
  • Standard reference used by ALL SAST tool vendors

Benchmark Commands Reference

Web Application

# Local Docker Labs
/pentest http://vulnhr.test:7331 --eval       # VulnHR
/pentest http://juiceshop.test:3000 --eval     # Juice Shop
/pentest http://localhost:8080 --eval           # SuperSecureBank / DVWA / DVRA

# CTF Benchmarks
/pentest-xbow XBEN-001                         # Single XBOW benchmark
/pentest-xbow --batch 1                        # First 10 XBOW benchmarks
/pentest-hackbench EV-01                       # Single HackBench challenge
/pentest-hackbench --all                       # All 16 HackBench challenges

# NEW — Academic Web Benchmarks
/pentest-bountybench BB-001                    # Single BountyBench bounty
/pentest-bountybench --task exploit --all      # All bounties, exploit mode
/pentest-bountybench --task patch --all        # All bounties, patch mode
/pentest-cvebench CVE-2024-XXXXX              # Single CVE-Bench challenge
/pentest-cvebench --all                        # All 40 CVEs
/pentest-pacebench --type d-cve                # WAF bypass scenarios only
/pentest-autopenbench --category web           # Web category only
/pentest-wizarena --domain cloud               # Cloud security challenges

# External Platforms
/pentest-portswigger --category sql-injection  # PortSwigger category
/pentest-rootme --category web-server          # Root-Me category
/pentest-hackinghub HH-WEB-001                 # Single HackingHub hub

LLM / AI Agent

# Existing
/pentest-gandalf                               # Standard 8 levels
/pentest-hackmerlin                            # 7 levels with strong defense

# NEW
/pentest-gandalf --agent-breaker               # Agent Breaker mode
/pentest-garak https://target.com/api/chat     # Garak 200+ probes
/pentest-garak https://target.com/api/chat --compare-test-llm  # Compare with /test-llm
/pentest-cyberseceval https://target.com/api   # CyberSecEval-2
/pentest-pint https://target.com/api/chat      # PINT taxonomy eval
/pentest-dvaa --all                            # All DVAA scenarios

Mobile

# ALL NEW
/pentest-crackmes --level 1                    # OWASP Crackme Level 1
/pentest-crackmes --all                        # All levels
/pentest-injuredandroid --all                  # All InjuredAndroid challenges
/pentest-droidbench --all                      # All 120 DroidBench test cases
/pentest-pivaa --all                           # PIVAA full scan
/pentest-insecurebank --all                    # InsecureBankv2 full assessment

Code Security

# ALL NEW
/review-susvibes --batch 1                     # First 10 SusVibes tasks
/review-susvibes --cwe CWE-79                  # All XSS tasks
/review-aicgseceval --axis repair              # Repair evaluation only
/review-ossf --tool-compare codeql             # Compare with CodeQL
/review-owasp-benchmark --category sqli        # SQL injection category
/review-owasp-benchmark --scorecard            # Generate full scorecard
/review-juliet --language java --cwe CWE-89    # Java SQLi calibration

Orchestrators

/pentest-all-labs                              # Run ALL benchmarks sequentially
/pentest-all-labs --only-web                   # Web benchmarks only
/pentest-all-labs --only-llm                   # LLM benchmarks only
/pentest-all-labs --only-mobile                # Mobile benchmarks only
/pentest-all-labs --only-code-review           # Code review benchmarks only
/pentest-neo --all                             # All 3 VibeApps (vs Neo baseline)
/labs-eval                                     # Parallel eval across all labs

AISI-Endorsed Benchmarks

Three benchmarks are endorsed/used by government AI Safety Institutes:

Benchmark AISI Usage Venue
BountyBench US AISI + UK AISI model evaluation NeurIPS 2025
CVE-Bench v2.0 US AISI contributed to development ICML 2025
Cybench (reference) US AISI + UK AISI pre-deployment testing Stanford/CMU

Scoring Methods

Method Description Used By
Answer-key TP/FN/FP against curated vulnerability list VulnHR, Juice Shop, SSB, AltoroMutual, DVWA, DVRA
Flag-based Extract specific flag string XBOW, HackBench, HackingHub, BountyBench, CVE-Bench
Level completion Progressive challenge levels Gandalf, HackMerlin
Lab completion "Congratulations" detection PortSwigger
Points-based Variable points per challenge Root-Me, HackBench
Finding-based Compare findings vs ground truth Neo/VibeApps
Dollar-impact Real-world bounty value weighting BountyBench
Pass@3 3 attempts, pass if any succeed PACEbench, Wiz Arena
Milestone-based Track intermediate progress (MC + MS) AutoPenBench
Probe pass/fail Automated probe success rate Garak, PINT
ESR Exploit Success Rate per class CyberSecEval-2
Dual (functional + security) Must be correct AND secure SusVibes
Three-gate Compile + tests + security check AICGSecEval
ROC scorecard TPR vs FPR per category OWASP Benchmark
Per-CWE Detection rate per CWE ID Juliet
Precision/Recall/F1 Standard IR metrics OpenSSF