Skip to content

Confidence Calibration

A feedback loop that tracks false positive (FP) and true positive (TP) rates per vulnerability type and tech stack. Over time, the system learns which finding categories are reliable and which need extra verification.


How it works

  1. After an engagement, a pentester reviews each finding and marks it as true positive or false positive
  2. Each verdict is stored with the finding's vulnerability type (extracted from the title) and the target's tech stack (from context.json)
  3. The system maintains running FP rates per (vuln_type, tech_stack) pair
  4. On future engagements, the calibration data informs:
    • Whether a finding type historically has a high FP rate on this tech stack
    • A suggested confidence threshold for automated verification

Evidence strength levels

Every finding has an evidence strength tag:

Level Meaning Example
L1 Unverified scanner output Nuclei template match
L2 Heuristic match Error-based detection
L3 Confirmed with PoC Working alert(document.domain)
L4 Verified exploit chain SSRF to internal API access
L5 Full compromise demonstrated RCE with id output

Higher levels correlate with lower FP rates. The calibration data validates this and flags anomalies (e.g., an L3 finding type that still has a 40% FP rate).


API endpoints

Submit feedback

POST /api/v1/confidence/engagements/{engagement_name}/findings/{finding_id}/feedback
{
  "is_false_positive": true,
  "fp_reason": "WAF injection in error page, not actual XSS"
}

Response:

{
  "status": "recorded",
  "vuln_type": "Reflected XSS",
  "tech_stack": "Spring Boot + PostgreSQL",
  "is_false_positive": true,
  "updated_fp_rate": 0.23,
  "total_data_points": 47
}

The vuln_type is auto-extracted from the finding title. The tech_stack is read from the engagement's context.json.

Get all calibration data

GET /api/v1/confidence/calibration

Returns all (vuln_type, tech_stack) pairs with their FP rates, TP counts, and average confidence scores.

Get calibration for a specific type

GET /api/v1/confidence/calibration/{vuln_type}?tech_stack=Spring+Boot
Parameter Type Default Description
vuln_type path required URL-encoded vulnerability type (e.g. Reflected%20XSS)
tech_stack query * Filter by tech stack. * returns aggregate across all stacks

Response includes a recommendation field with a suggested confidence threshold and verification priority.


Practical example

After 50 engagements, the system might show:

Vuln type Tech stack FP rate Recommendation
Reflected XSS WordPress 8% Standard verification
Reflected XSS React SPA 42% High priority verification
SQL Injection Laravel 5% Standard verification
SSRF Spring Boot 31% High priority verification

This tells you: XSS findings on React SPAs need careful manual review, while SQLi on Laravel is almost always real.


Connections to other features

  • Learning Loop: the Learning Loop uses calibration data to weight technique recommendations -- techniques that produce high-FP findings are ranked lower
  • Verification phase: during Phase 5 (/verify), findings from high-FP categories get extra verification effort