Lab Scoring¶

Overview¶

lab-scorer.py compares pentest findings against a lab's answer key to produce objective detection metrics. It scores each finding as a true positive (TP), false negative (FN), or false positive (FP), with per-category, per-difficulty, and per-skill breakdowns.

Usage¶

# Basic scoring
python evals/lab-scorer.py vulnhr engagements/vulnhr-test-7331

# Save to history
python evals/lab-scorer.py vulnhr engagements/vulnhr-test-7331 --save

# Generate HTML report
python evals/lab-scorer.py vulnhr engagements/vulnhr-test-7331 --save --html

# Full scoring with gap analysis narrative
python evals/lab-scorer.py vulnhr engagements/vulnhr-test-7331 --save --html --narrative

# Include execution metrics
python evals/lab-scorer.py vulnhr engagements/vulnhr-test-7331 --save --html --narrative \
    --duration 9720 --tokens-in 1200000 --tokens-out 340000

How Findings Are Matched¶

The scorer loads all FINDING-*.md files from the engagement's findings/ directory and matches each answer key vulnerability against them using a weighted scoring system:

Signal	Weight	Description
Endpoint match	40 points	Normalized path components compared against finding content
Name keywords	25 points	Vulnerability name words (>3 chars) matched against finding
CWE match	20 points	CWE identifier found in finding content
Vuln type match	15 points	Known vulnerability type keywords matched

Match Thresholds¶

Score	Classification
>= 60	Found (True Positive)
35-59	Partial (detected but incomplete exploitation)
< 35	Missed (False Negative)

For endpoint matching, the scorer normalizes paths by splitting on /, removing empty segments and path parameters (segments starting with {), and requiring that all-but-one path components appear in the finding content. This handles variations like /api/v1/employees/{id} matching a finding that references /api/v1/employees/42.

Vulnerability Type Keywords¶

The scorer maintains a keyword map for common vulnerability types:

vuln_type_keywords = {
    "sqli": ["sql injection", "sqli", "union select", "blind sql"],
    "xss": ["cross-site scripting", "xss", "alert(", "innerhtml"],
    "csrf": ["csrf", "cross-site request forgery"],
    "ssrf": ["ssrf", "server-side request forgery"],
    "idor": ["idor", "insecure direct object"],
    "rce": ["remote code execution", "command injection", "rce"],
    "xxe": ["xxe", "xml external entity"],
    "deserialization": ["deserializ", "unserialize"],
    "path_traversal": ["path traversal", "directory traversal", "../"],
    "open_redirect": ["open redirect", "url redirect"],
}

Score Calculation¶

The final score uses a weighted formula:

score = found + (partial * 0.5)
score_pct = (score / total_vulns) * 100

Partial findings count as half credit, reflecting that the vulnerability was detected but not fully exploited or verified.

Output: Console Report¶

The console report provides immediate feedback with per-category and per-difficulty breakdowns:

======================================================================
  LAB SCORE: VulnHR - HR Portal (Meridian Solutions)
  41.5/81 = 51.2%
======================================================================

  Found: 35 | Partial: 13 | Missed: 33
  Extra findings: 5 | Total findings: 40

  Category                            Total Found  Part  Miss     %
  ----------------------------------------------------------------
  A01 - Broken Access Control            12     8     2     2   75%
  A03 - Injection                        13     9     1     3   73%

  Difficulty      Total Found  Part  Miss     %
  ------------------------------------------
  Facile             30    22     4     4   80%
  Media              30    10     6    14   43%
  Difficile          21     3     3    15   21%

  Skill gaps (lowest coverage):
    test-logic               5/12 (42%) -- 7 missed
    test-advanced            2/8 (25%) -- 6 missed

Output: HTML Report¶

The --html flag generates a standalone HTML page at evals/labs/<lab>/score_report.html with:

Color-coded overall score (green >= 80%, yellow >= 60%, red < 60%)
Summary statistics cards (found, partial, missed, extra)
Execution metrics (duration, token consumption) when provided
Full vulnerability table with status icons, severity coloring, matched finding IDs, and match scores

Output: Narrative Gap Analysis¶

The --narrative flag generates evals/labs/<lab>/gap-analysis.md containing:

Per-Vulnerability Table¶

A complete table showing every answer key vulnerability with its detection status, matched finding ID, category, severity, and difficulty.

False Negatives by Skill¶

Groups missed vulnerabilities by the skill that should have detected them. Each entry includes the endpoint and an optional payload hint from the answer key:

### test-injection (3 missed)

- **[VULN-015]** Blind SQLi in Leave Sort -- `/api/v1/leaves?sort=` -- *' AND SLEEP(3)--*
- **[VULN-023]** SSTI in Export Template -- `/api/v1/export` -- *{{7*7}}*

This directly identifies which skills need improvement and provides concrete guidance on what was missed.

Partial Findings¶

Lists findings that were detected but scored below the full match threshold, indicating incomplete exploitation.

Extra Findings (FP Candidates)¶

Findings not matched to any answer key entry. These may be valid additional vulnerabilities (0-days, variants) or false positives. The narrative flags them for manual review.

Per-Skill and Per-Category Summary Tables¶

Coverage percentages with color indicators, sorted from lowest to highest, identifying the weakest skills and OWASP categories for prioritized improvement.

Execution Metrics¶

Optional flags capture runtime performance data:

Flag	Value	Purpose
`--duration <seconds>`	Pentest wall-clock time	Track speed improvements
`--tokens-in <count>`	Claude API input tokens	Monitor cost per engagement
`--tokens-out <count>`	Claude API output tokens	Monitor cost per engagement

These metrics are embedded in the scoring JSON and displayed in both console and HTML reports. The /labs-eval skill passes these automatically by extracting them from the Claude CLI stream-json output.

History Tracking¶

With --save, results are persisted to evals/labs/<lab>/history/<timestamp>.json:

evals/labs/vulnhr/history/
  2026-03-10_1430.json
  2026-03-11_0915.json
  2026-03-13_1100.json

Each history file contains the complete scoring data: per-vulnerability match details (results array), by_category, by_difficulty, and by_skill breakdowns, plus execution metrics if provided. This enables programmatic trend analysis and regression detection across runs.

Integration with /labs-eval¶

The /labs-eval skill invokes lab-scorer.py automatically after each pentest completes:

python evals/lab-scorer.py "$SCORER_LAB" "$EDIR" --save --html --narrative \
    --duration "$DURATION" \
    --tokens-in "$TOKENS_IN" \
    --tokens-out "$TOKENS_OUT"

Results feed into the suite-level aggregation report that compares performance across all labs, identifies the weakest skills system-wide, and tracks progress against previous runs.