Skip to content

Skill Evaluations

Overview

The skill evaluation system measures individual skill quality independently from full pentest engagements. Using run-eval.sh for execution and score-eval.py for analysis, it tests each skill against predefined prompts and expected behaviors, producing reproducibility metrics and baseline comparisons.

Architecture

evals/
  configs/          Per-skill eval configurations (JSON)
  results/          Skill eval outputs and score reports
  baselines/        Baseline outputs (without skill context)
  run-eval.sh       Eval runner (dispatches claude -p)
  score-eval.py     Eval scorer (analyzes outputs)
  optimize-descriptions.py  Description quality analyzer

Eval Configurations

Each skill has a JSON configuration file at evals/configs/<skill>.json:

{
  "skill": "intake",
  "priority": "high",
  "test_prompts": [
    {
      "id": "intake-guided",
      "prompt": "/intake for https://example.com",
      "expected_behaviors": [
        "Asks about application type and business context",
        "Collects tech stack information",
        "Generates brief.json with _version field"
      ],
      "failure_indicators": [
        "Skips questionnaire entirely",
        "Produces invalid JSON"
      ]
    }
  ],
  "scoring_criteria": {
    "completeness": "All expected behaviors demonstrated",
    "accuracy": "Output matches skill specification",
    "structure": "Valid output format and required fields present"
  }
}

Configuration Fields

Field Description
test_prompts[].id Unique identifier for the test case
test_prompts[].prompt The exact prompt sent to claude -p
test_prompts[].expected_behaviors Behaviors that should appear in the output
test_prompts[].failure_indicators Patterns that indicate the skill malfunctioned
scoring_criteria Human-readable rubric for manual review

Running Evaluations

run-eval.sh

The eval runner dispatches prompts via claude -p and captures outputs:

# Run all evals for a skill (3 runs each for consistency)
./evals/run-eval.sh intake

# Run baseline (without skill context) for comparison
./evals/run-eval.sh intake --baseline

# Run a specific test case
./evals/run-eval.sh intake intake-guided

# Run all skill evals
./evals/run-eval.sh --all

# Score results
./evals/run-eval.sh --score intake

# Compare skill vs baseline
./evals/run-eval.sh --compare intake

Execution Details

Each test prompt runs 3 times (RUNS_PER_PROMPT=3) to measure consistency. For each run:

  1. The prompt is sent to claude -p with --max-turns 5 (skill mode) or --max-turns 1 (baseline mode)
  2. Output is captured to evals/results/<skill>/<test_id>_run<N>.txt
  3. Metadata is written to evals/results/<skill>/<test_id>_run<N>.meta.json

Metadata includes skill name, test ID, run number, duration, timestamp, output size, and the expected behaviors and failure indicators for later scoring.

Baseline Comparison

Running with --baseline executes the same prompts without skill context (raw Claude response with --max-turns 1). This measures how much value the skill file adds over the model's built-in knowledge:

# Generate baseline
./evals/run-eval.sh intake --baseline

# Compare
./evals/run-eval.sh --compare intake

The comparison reports average output size difference, average duration difference, and detail improvement percentage per test case.

Scoring with score-eval.py

The scorer analyzes outputs against expected behaviors and failure indicators:

# Score a specific skill
python evals/score-eval.py intake

# Score with HTML report
python evals/score-eval.py intake --html

# Score all skills
python evals/score-eval.py --all --html

Scoring Method

  1. Keyword matching: Each expected behavior is decomposed into keywords (words >3 characters). The ratio of matched keywords to total keywords produces a behavior score (0.0-1.0). Scores above 0.3 are classified as PASS.

  2. Failure indicator detection: Similar keyword matching with a 0.4 threshold. Detected failures incur a 15% penalty per indicator on the run score.

  3. Run score calculation:

    run_score = max(0, avg_behavior_score - (failure_count * 0.15))
    

  4. Test score: Average across all runs for that test case.

  5. Overall skill score: Average across all test cases.

Score Thresholds

Score Color Assessment
>= 80% Green Skill performs well
60-79% Yellow Needs improvement
40-59% Orange Significant gaps
< 40% Red Fundamental issues

JSON and HTML Reports

Results are saved to evals/results/<skill>/score_report.json automatically. The --html flag additionally generates evals/results/<skill>/score_report.html with:

  • Color-coded overall score and skill priority
  • Per-test drill-down with pass/fail indicators per behavior
  • Per-run detail with failure indicator warnings
  • Baseline comparison deltas when available

Variance Analysis

With 3 runs per prompt, the scorer identifies inconsistent skills:

  • High variance (scores differ >20% across runs): The skill may have non-deterministic behavior, ambiguous instructions, or race conditions in tool execution
  • Consistent failure on specific behaviors: A behavior that fails across all 3 runs indicates a systematic gap in the skill that needs to be addressed
  • Consistent pass: Behaviors passing on all runs confirm stable, reliable skill coverage

Description Optimization

optimize-descriptions.py analyzes and improves SKILL.md frontmatter descriptions for triggering accuracy:

# Analyze a single skill's description quality
python evals/optimize-descriptions.py analyze intake

# Get improvement suggestions
python evals/optimize-descriptions.py suggest intake

# Generate HTML comparison report for all skills
python evals/optimize-descriptions.py report

Quality Checks

The analyzer evaluates descriptions against several criteria:

Check Penalty Rationale
Too short (<30 chars) -15 pts Insufficient context for skill triggering
Too long (>200 chars) -15 pts Dilutes key triggering terms
No action verb -15 pts Description should state what the skill does
Generic words (various, multiple, etc.) -15 pts Vague terms reduce matching precision
Missing OWASP/CWE reference (test skills) -5 pts Taxonomy anchoring improves relevance
Missing output format mention -5 pts Clarity about what the skill produces

Suggested Improvements

For test skills, the optimizer generates structured description templates that include specific vulnerability types, safety constraints, and output guarantees:

Current:  "Test web application for injection vulnerabilities"
Suggested: "Test for SQLi, XSS, SSTI, CMDi, XXE, NoSQLi, LDAPi vulnerabilities.
            WAF-adaptive, non-destructive. Every finding verified with reproducible PoC."

Comparison Report

The report command generates evals/results/description_analysis.html with:

  • All skills ranked by description quality score (worst first)
  • Issue and suggestion counts per skill
  • Badges for AI-enabled skills (disable-model-invocation: false) and context-forked skills (context: fork)
  • Summary statistics: good/needs work/poor counts and average score across all skills

Integration Workflow

A typical skill improvement cycle:

  1. Run eval baseline: ./evals/run-eval.sh <skill> --baseline
  2. Run skill eval: ./evals/run-eval.sh <skill>
  3. Score and compare: python evals/score-eval.py <skill> --html
  4. Identify failing behaviors from the HTML report
  5. Update the skill's SKILL.md to address gaps
  6. Analyze description quality: python evals/optimize-descriptions.py analyze <skill>
  7. Apply description suggestions if score is below 80%
  8. Re-run eval to verify improvement: ./evals/run-eval.sh <skill>
  9. Compare with baseline: ./evals/run-eval.sh --compare <skill>
  10. Review improvement delta and iterate if needed