Confidence Calibration¶
A feedback loop that tracks false positive (FP) and true positive (TP) rates per vulnerability type and tech stack. Over time, the system learns which finding categories are reliable and which need extra verification.
How it works¶
- After an engagement, a pentester reviews each finding and marks it as true positive or false positive
- Each verdict is stored with the finding's vulnerability type (extracted from the title) and the target's tech stack (from
context.json) - The system maintains running FP rates per
(vuln_type, tech_stack)pair - On future engagements, the calibration data informs:
- Whether a finding type historically has a high FP rate on this tech stack
- A suggested confidence threshold for automated verification
Evidence strength levels¶
Every finding has an evidence strength tag:
| Level | Meaning | Example |
|---|---|---|
L1 |
Unverified scanner output | Nuclei template match |
L2 |
Heuristic match | Error-based detection |
L3 |
Confirmed with PoC | Working alert(document.domain) |
L4 |
Verified exploit chain | SSRF to internal API access |
L5 |
Full compromise demonstrated | RCE with id output |
Higher levels correlate with lower FP rates. The calibration data validates this and flags anomalies (e.g., an L3 finding type that still has a 40% FP rate).
API endpoints¶
Submit feedback¶
Response:
{
"status": "recorded",
"vuln_type": "Reflected XSS",
"tech_stack": "Spring Boot + PostgreSQL",
"is_false_positive": true,
"updated_fp_rate": 0.23,
"total_data_points": 47
}
The vuln_type is auto-extracted from the finding title. The tech_stack is read from the engagement's context.json.
Get all calibration data¶
Returns all (vuln_type, tech_stack) pairs with their FP rates, TP counts, and average confidence scores.
Get calibration for a specific type¶
| Parameter | Type | Default | Description |
|---|---|---|---|
vuln_type |
path | required | URL-encoded vulnerability type (e.g. Reflected%20XSS) |
tech_stack |
query | * |
Filter by tech stack. * returns aggregate across all stacks |
Response includes a recommendation field with a suggested confidence threshold and verification priority.
Practical example¶
After 50 engagements, the system might show:
| Vuln type | Tech stack | FP rate | Recommendation |
|---|---|---|---|
| Reflected XSS | WordPress | 8% | Standard verification |
| Reflected XSS | React SPA | 42% | High priority verification |
| SQL Injection | Laravel | 5% | Standard verification |
| SSRF | Spring Boot | 31% | High priority verification |
This tells you: XSS findings on React SPAs need careful manual review, while SQLi on Laravel is almost always real.
Connections to other features¶
- Learning Loop: the Learning Loop uses calibration data to weight technique recommendations -- techniques that produce high-FP findings are ranked lower
- Verification phase: during Phase 5 (
/verify), findings from high-FP categories get extra verification effort