Skip to content

Lab Scoring

Overview

lab-scorer.py compares pentest findings against a lab's answer key to produce objective detection metrics. It scores each finding as a true positive (TP), false negative (FN), or false positive (FP), with per-category, per-difficulty, and per-skill breakdowns.

Usage

# Basic scoring
python evals/lab-scorer.py vulnhr engagements/vulnhr-test-7331

# Save to history
python evals/lab-scorer.py vulnhr engagements/vulnhr-test-7331 --save

# Generate HTML report
python evals/lab-scorer.py vulnhr engagements/vulnhr-test-7331 --save --html

# Full scoring with gap analysis narrative
python evals/lab-scorer.py vulnhr engagements/vulnhr-test-7331 --save --html --narrative

# Include execution metrics
python evals/lab-scorer.py vulnhr engagements/vulnhr-test-7331 --save --html --narrative \
    --duration 9720 --tokens-in 1200000 --tokens-out 340000

How Findings Are Matched

The scorer loads all FINDING-*.md files from the engagement's findings/ directory and matches each answer key vulnerability against them using a weighted scoring system:

Signal Weight Description
Endpoint match 40 points Normalized path components compared against finding content
Name keywords 25 points Vulnerability name words (>3 chars) matched against finding
CWE match 20 points CWE identifier found in finding content
Vuln type match 15 points Known vulnerability type keywords matched

Match Thresholds

Score Classification
>= 60 Found (True Positive)
35-59 Partial (detected but incomplete exploitation)
< 35 Missed (False Negative)

For endpoint matching, the scorer normalizes paths by splitting on /, removing empty segments and path parameters (segments starting with {), and requiring that all-but-one path components appear in the finding content. This handles variations like /api/v1/employees/{id} matching a finding that references /api/v1/employees/42.

Vulnerability Type Keywords

The scorer maintains a keyword map for common vulnerability types:

vuln_type_keywords = {
    "sqli": ["sql injection", "sqli", "union select", "blind sql"],
    "xss": ["cross-site scripting", "xss", "alert(", "innerhtml"],
    "csrf": ["csrf", "cross-site request forgery"],
    "ssrf": ["ssrf", "server-side request forgery"],
    "idor": ["idor", "insecure direct object"],
    "rce": ["remote code execution", "command injection", "rce"],
    "xxe": ["xxe", "xml external entity"],
    "deserialization": ["deserializ", "unserialize"],
    "path_traversal": ["path traversal", "directory traversal", "../"],
    "open_redirect": ["open redirect", "url redirect"],
}

Score Calculation

The final score uses a weighted formula:

score = found + (partial * 0.5)
score_pct = (score / total_vulns) * 100

Partial findings count as half credit, reflecting that the vulnerability was detected but not fully exploited or verified.

Output: Console Report

The console report provides immediate feedback with per-category and per-difficulty breakdowns:

======================================================================
  LAB SCORE: VulnHR - HR Portal (Meridian Solutions)
  41.5/81 = 51.2%
======================================================================

  Found: 35 | Partial: 13 | Missed: 33
  Extra findings: 5 | Total findings: 40

  Category                            Total Found  Part  Miss     %
  ----------------------------------------------------------------
  A01 - Broken Access Control            12     8     2     2   75%
  A03 - Injection                        13     9     1     3   73%

  Difficulty      Total Found  Part  Miss     %
  ------------------------------------------
  Facile             30    22     4     4   80%
  Media              30    10     6    14   43%
  Difficile          21     3     3    15   21%

  Skill gaps (lowest coverage):
    test-logic               5/12 (42%) -- 7 missed
    test-advanced            2/8 (25%) -- 6 missed

Output: HTML Report

The --html flag generates a standalone HTML page at evals/labs/<lab>/score_report.html with:

  • Color-coded overall score (green >= 80%, yellow >= 60%, red < 60%)
  • Summary statistics cards (found, partial, missed, extra)
  • Execution metrics (duration, token consumption) when provided
  • Full vulnerability table with status icons, severity coloring, matched finding IDs, and match scores

Output: Narrative Gap Analysis

The --narrative flag generates evals/labs/<lab>/gap-analysis.md containing:

Per-Vulnerability Table

A complete table showing every answer key vulnerability with its detection status, matched finding ID, category, severity, and difficulty.

False Negatives by Skill

Groups missed vulnerabilities by the skill that should have detected them. Each entry includes the endpoint and an optional payload hint from the answer key:

### test-injection (3 missed)

- **[VULN-015]** Blind SQLi in Leave Sort -- `/api/v1/leaves?sort=` -- *' AND SLEEP(3)--*
- **[VULN-023]** SSTI in Export Template -- `/api/v1/export` -- *{{7*7}}*

This directly identifies which skills need improvement and provides concrete guidance on what was missed.

Partial Findings

Lists findings that were detected but scored below the full match threshold, indicating incomplete exploitation.

Extra Findings (FP Candidates)

Findings not matched to any answer key entry. These may be valid additional vulnerabilities (0-days, variants) or false positives. The narrative flags them for manual review.

Per-Skill and Per-Category Summary Tables

Coverage percentages with color indicators, sorted from lowest to highest, identifying the weakest skills and OWASP categories for prioritized improvement.

Execution Metrics

Optional flags capture runtime performance data:

Flag Value Purpose
--duration <seconds> Pentest wall-clock time Track speed improvements
--tokens-in <count> Claude API input tokens Monitor cost per engagement
--tokens-out <count> Claude API output tokens Monitor cost per engagement

These metrics are embedded in the scoring JSON and displayed in both console and HTML reports. The /labs-eval skill passes these automatically by extracting them from the Claude CLI stream-json output.

History Tracking

With --save, results are persisted to evals/labs/<lab>/history/<timestamp>.json:

evals/labs/vulnhr/history/
  2026-03-10_1430.json
  2026-03-11_0915.json
  2026-03-13_1100.json

Each history file contains the complete scoring data: per-vulnerability match details (results array), by_category, by_difficulty, and by_skill breakdowns, plus execution metrics if provided. This enables programmatic trend analysis and regression detection across runs.

Integration with /labs-eval

The /labs-eval skill invokes lab-scorer.py automatically after each pentest completes:

python evals/lab-scorer.py "$SCORER_LAB" "$EDIR" --save --html --narrative \
    --duration "$DURATION" \
    --tokens-in "$TOKENS_IN" \
    --tokens-out "$TOKENS_OUT"

Results feed into the suite-level aggregation report that compares performance across all labs, identifies the weakest skills system-wide, and tracks progress against previous runs.