Skill Evaluations¶

Overview¶

The skill evaluation system measures individual skill quality independently from full pentest engagements. Using run-eval.sh for execution and score-eval.py for analysis, it tests each skill against predefined prompts and expected behaviors, producing reproducibility metrics and baseline comparisons.

Architecture¶

evals/
  configs/          Per-skill eval configurations (JSON)
  results/          Skill eval outputs and score reports
  baselines/        Baseline outputs (without skill context)
  run-eval.sh       Eval runner (dispatches claude -p)
  score-eval.py     Eval scorer (analyzes outputs)
  optimize-descriptions.py  Description quality analyzer

Eval Configurations¶

Each skill has a JSON configuration file at evals/configs/<skill>.json:

{
  "skill": "intake",
  "priority": "high",
  "test_prompts": [
    {
      "id": "intake-guided",
      "prompt": "/intake for https://example.com",
      "expected_behaviors": [
        "Asks about application type and business context",
        "Collects tech stack information",
        "Generates brief.json with _version field"
      ],
      "failure_indicators": [
        "Skips questionnaire entirely",
        "Produces invalid JSON"
      ]
    }
  ],
  "scoring_criteria": {
    "completeness": "All expected behaviors demonstrated",
    "accuracy": "Output matches skill specification",
    "structure": "Valid output format and required fields present"
  }
}

Configuration Fields¶

Field	Description
`test_prompts[].id`	Unique identifier for the test case
`test_prompts[].prompt`	The exact prompt sent to `claude -p`
`test_prompts[].expected_behaviors`	Behaviors that should appear in the output
`test_prompts[].failure_indicators`	Patterns that indicate the skill malfunctioned
`scoring_criteria`	Human-readable rubric for manual review

Running Evaluations¶

run-eval.sh¶

The eval runner dispatches prompts via claude -p and captures outputs:

# Run all evals for a skill (3 runs each for consistency)
./evals/run-eval.sh intake

# Run baseline (without skill context) for comparison
./evals/run-eval.sh intake --baseline

# Run a specific test case
./evals/run-eval.sh intake intake-guided

# Run all skill evals
./evals/run-eval.sh --all

# Score results
./evals/run-eval.sh --score intake

# Compare skill vs baseline
./evals/run-eval.sh --compare intake

Execution Details¶

Each test prompt runs 3 times (RUNS_PER_PROMPT=3) to measure consistency. For each run:

The prompt is sent to claude -p with --max-turns 5 (skill mode) or --max-turns 1 (baseline mode)
Output is captured to evals/results/<skill>/<test_id>_run<N>.txt
Metadata is written to evals/results/<skill>/<test_id>_run<N>.meta.json

Metadata includes skill name, test ID, run number, duration, timestamp, output size, and the expected behaviors and failure indicators for later scoring.

Baseline Comparison¶

Running with --baseline executes the same prompts without skill context (raw Claude response with --max-turns 1). This measures how much value the skill file adds over the model's built-in knowledge:

# Generate baseline
./evals/run-eval.sh intake --baseline

# Compare
./evals/run-eval.sh --compare intake

The comparison reports average output size difference, average duration difference, and detail improvement percentage per test case.

Scoring with score-eval.py¶

The scorer analyzes outputs against expected behaviors and failure indicators:

# Score a specific skill
python evals/score-eval.py intake

# Score with HTML report
python evals/score-eval.py intake --html

# Score all skills
python evals/score-eval.py --all --html

Scoring Method¶

Keyword matching: Each expected behavior is decomposed into keywords (words >3 characters). The ratio of matched keywords to total keywords produces a behavior score (0.0-1.0). Scores above 0.3 are classified as PASS.
Failure indicator detection: Similar keyword matching with a 0.4 threshold. Detected failures incur a 15% penalty per indicator on the run score.

Run score calculation:

run_score = max(0, avg_behavior_score - (failure_count * 0.15))

Test score: Average across all runs for that test case.
Overall skill score: Average across all test cases.

Score Thresholds¶

Score	Color	Assessment
>= 80%	Green	Skill performs well
60-79%	Yellow	Needs improvement
40-59%	Orange	Significant gaps
< 40%	Red	Fundamental issues

JSON and HTML Reports¶

Results are saved to evals/results/<skill>/score_report.json automatically. The --html flag additionally generates evals/results/<skill>/score_report.html with:

Color-coded overall score and skill priority
Per-test drill-down with pass/fail indicators per behavior
Per-run detail with failure indicator warnings
Baseline comparison deltas when available

Variance Analysis¶

With 3 runs per prompt, the scorer identifies inconsistent skills:

High variance (scores differ >20% across runs): The skill may have non-deterministic behavior, ambiguous instructions, or race conditions in tool execution
Consistent failure on specific behaviors: A behavior that fails across all 3 runs indicates a systematic gap in the skill that needs to be addressed
Consistent pass: Behaviors passing on all runs confirm stable, reliable skill coverage

Description Optimization¶

optimize-descriptions.py analyzes and improves SKILL.md frontmatter descriptions for triggering accuracy:

# Analyze a single skill's description quality
python evals/optimize-descriptions.py analyze intake

# Get improvement suggestions
python evals/optimize-descriptions.py suggest intake

# Generate HTML comparison report for all skills
python evals/optimize-descriptions.py report

Quality Checks¶

The analyzer evaluates descriptions against several criteria:

Check	Penalty	Rationale
Too short (<30 chars)	-15 pts	Insufficient context for skill triggering
Too long (>200 chars)	-15 pts	Dilutes key triggering terms
No action verb	-15 pts	Description should state what the skill does
Generic words (various, multiple, etc.)	-15 pts	Vague terms reduce matching precision
Missing OWASP/CWE reference (test skills)	-5 pts	Taxonomy anchoring improves relevance
Missing output format mention	-5 pts	Clarity about what the skill produces

Suggested Improvements¶

For test skills, the optimizer generates structured description templates that include specific vulnerability types, safety constraints, and output guarantees:

Current:  "Test web application for injection vulnerabilities"
Suggested: "Test for SQLi, XSS, SSTI, CMDi, XXE, NoSQLi, LDAPi vulnerabilities.
            WAF-adaptive, non-destructive. Every finding verified with reproducible PoC."

Comparison Report¶

The report command generates evals/results/description_analysis.html with:

All skills ranked by description quality score (worst first)
Issue and suggestion counts per skill
Badges for AI-enabled skills (disable-model-invocation: false) and context-forked skills (context: fork)
Summary statistics: good/needs work/poor counts and average score across all skills

Integration Workflow¶

A typical skill improvement cycle:

Run eval baseline: ./evals/run-eval.sh <skill> --baseline
Run skill eval: ./evals/run-eval.sh <skill>
Score and compare: python evals/score-eval.py <skill> --html
Identify failing behaviors from the HTML report
Update the skill's SKILL.md to address gaps
Analyze description quality: python evals/optimize-descriptions.py analyze <skill>
Apply description suggestions if score is below 80%
Re-run eval to verify improvement: ./evals/run-eval.sh <skill>
Compare with baseline: ./evals/run-eval.sh --compare <skill>
Review improvement delta and iterate if needed