Skill Evaluations¶
Overview¶
The skill evaluation system measures individual skill quality independently from full pentest engagements. Using run-eval.sh for execution and score-eval.py for analysis, it tests each skill against predefined prompts and expected behaviors, producing reproducibility metrics and baseline comparisons.
Architecture¶
evals/
configs/ Per-skill eval configurations (JSON)
results/ Skill eval outputs and score reports
baselines/ Baseline outputs (without skill context)
run-eval.sh Eval runner (dispatches claude -p)
score-eval.py Eval scorer (analyzes outputs)
optimize-descriptions.py Description quality analyzer
Eval Configurations¶
Each skill has a JSON configuration file at evals/configs/<skill>.json:
{
"skill": "intake",
"priority": "high",
"test_prompts": [
{
"id": "intake-guided",
"prompt": "/intake for https://example.com",
"expected_behaviors": [
"Asks about application type and business context",
"Collects tech stack information",
"Generates brief.json with _version field"
],
"failure_indicators": [
"Skips questionnaire entirely",
"Produces invalid JSON"
]
}
],
"scoring_criteria": {
"completeness": "All expected behaviors demonstrated",
"accuracy": "Output matches skill specification",
"structure": "Valid output format and required fields present"
}
}
Configuration Fields¶
| Field | Description |
|---|---|
test_prompts[].id |
Unique identifier for the test case |
test_prompts[].prompt |
The exact prompt sent to claude -p |
test_prompts[].expected_behaviors |
Behaviors that should appear in the output |
test_prompts[].failure_indicators |
Patterns that indicate the skill malfunctioned |
scoring_criteria |
Human-readable rubric for manual review |
Running Evaluations¶
run-eval.sh¶
The eval runner dispatches prompts via claude -p and captures outputs:
# Run all evals for a skill (3 runs each for consistency)
./evals/run-eval.sh intake
# Run baseline (without skill context) for comparison
./evals/run-eval.sh intake --baseline
# Run a specific test case
./evals/run-eval.sh intake intake-guided
# Run all skill evals
./evals/run-eval.sh --all
# Score results
./evals/run-eval.sh --score intake
# Compare skill vs baseline
./evals/run-eval.sh --compare intake
Execution Details¶
Each test prompt runs 3 times (RUNS_PER_PROMPT=3) to measure consistency. For each run:
- The prompt is sent to
claude -pwith--max-turns 5(skill mode) or--max-turns 1(baseline mode) - Output is captured to
evals/results/<skill>/<test_id>_run<N>.txt - Metadata is written to
evals/results/<skill>/<test_id>_run<N>.meta.json
Metadata includes skill name, test ID, run number, duration, timestamp, output size, and the expected behaviors and failure indicators for later scoring.
Baseline Comparison¶
Running with --baseline executes the same prompts without skill context (raw Claude response with --max-turns 1). This measures how much value the skill file adds over the model's built-in knowledge:
# Generate baseline
./evals/run-eval.sh intake --baseline
# Compare
./evals/run-eval.sh --compare intake
The comparison reports average output size difference, average duration difference, and detail improvement percentage per test case.
Scoring with score-eval.py¶
The scorer analyzes outputs against expected behaviors and failure indicators:
# Score a specific skill
python evals/score-eval.py intake
# Score with HTML report
python evals/score-eval.py intake --html
# Score all skills
python evals/score-eval.py --all --html
Scoring Method¶
-
Keyword matching: Each expected behavior is decomposed into keywords (words >3 characters). The ratio of matched keywords to total keywords produces a behavior score (0.0-1.0). Scores above 0.3 are classified as PASS.
-
Failure indicator detection: Similar keyword matching with a 0.4 threshold. Detected failures incur a 15% penalty per indicator on the run score.
-
Run score calculation:
-
Test score: Average across all runs for that test case.
-
Overall skill score: Average across all test cases.
Score Thresholds¶
| Score | Color | Assessment |
|---|---|---|
| >= 80% | Green | Skill performs well |
| 60-79% | Yellow | Needs improvement |
| 40-59% | Orange | Significant gaps |
| < 40% | Red | Fundamental issues |
JSON and HTML Reports¶
Results are saved to evals/results/<skill>/score_report.json automatically. The --html flag additionally generates evals/results/<skill>/score_report.html with:
- Color-coded overall score and skill priority
- Per-test drill-down with pass/fail indicators per behavior
- Per-run detail with failure indicator warnings
- Baseline comparison deltas when available
Variance Analysis¶
With 3 runs per prompt, the scorer identifies inconsistent skills:
- High variance (scores differ >20% across runs): The skill may have non-deterministic behavior, ambiguous instructions, or race conditions in tool execution
- Consistent failure on specific behaviors: A behavior that fails across all 3 runs indicates a systematic gap in the skill that needs to be addressed
- Consistent pass: Behaviors passing on all runs confirm stable, reliable skill coverage
Description Optimization¶
optimize-descriptions.py analyzes and improves SKILL.md frontmatter descriptions for triggering accuracy:
# Analyze a single skill's description quality
python evals/optimize-descriptions.py analyze intake
# Get improvement suggestions
python evals/optimize-descriptions.py suggest intake
# Generate HTML comparison report for all skills
python evals/optimize-descriptions.py report
Quality Checks¶
The analyzer evaluates descriptions against several criteria:
| Check | Penalty | Rationale |
|---|---|---|
| Too short (<30 chars) | -15 pts | Insufficient context for skill triggering |
| Too long (>200 chars) | -15 pts | Dilutes key triggering terms |
| No action verb | -15 pts | Description should state what the skill does |
| Generic words (various, multiple, etc.) | -15 pts | Vague terms reduce matching precision |
| Missing OWASP/CWE reference (test skills) | -5 pts | Taxonomy anchoring improves relevance |
| Missing output format mention | -5 pts | Clarity about what the skill produces |
Suggested Improvements¶
For test skills, the optimizer generates structured description templates that include specific vulnerability types, safety constraints, and output guarantees:
Current: "Test web application for injection vulnerabilities"
Suggested: "Test for SQLi, XSS, SSTI, CMDi, XXE, NoSQLi, LDAPi vulnerabilities.
WAF-adaptive, non-destructive. Every finding verified with reproducible PoC."
Comparison Report¶
The report command generates evals/results/description_analysis.html with:
- All skills ranked by description quality score (worst first)
- Issue and suggestion counts per skill
- Badges for AI-enabled skills (
disable-model-invocation: false) and context-forked skills (context: fork) - Summary statistics: good/needs work/poor counts and average score across all skills
Integration Workflow¶
A typical skill improvement cycle:
- Run eval baseline:
./evals/run-eval.sh <skill> --baseline - Run skill eval:
./evals/run-eval.sh <skill> - Score and compare:
python evals/score-eval.py <skill> --html - Identify failing behaviors from the HTML report
- Update the skill's SKILL.md to address gaps
- Analyze description quality:
python evals/optimize-descriptions.py analyze <skill> - Apply description suggestions if score is below 80%
- Re-run eval to verify improvement:
./evals/run-eval.sh <skill> - Compare with baseline:
./evals/run-eval.sh --compare <skill> - Review improvement delta and iterate if needed