Confidence Calibration¶

A feedback loop that tracks false positive (FP) and true positive (TP) rates per vulnerability type and tech stack. Over time, the system learns which finding categories are reliable and which need extra verification.

How it works¶

After an engagement, a pentester reviews each finding and marks it as true positive or false positive
Each verdict is stored with the finding's vulnerability type (extracted from the title) and the target's tech stack (from context.json)
The system maintains running FP rates per (vuln_type, tech_stack) pair
On future engagements, the calibration data informs:
- Whether a finding type historically has a high FP rate on this tech stack
- A suggested confidence threshold for automated verification

Evidence strength levels¶

Every finding has an evidence strength tag:

Level	Meaning	Example
`L1`	Unverified scanner output	Nuclei template match
`L2`	Heuristic match	Error-based detection
`L3`	Confirmed with PoC	Working `alert(document.domain)`
`L4`	Verified exploit chain	SSRF to internal API access
`L5`	Full compromise demonstrated	RCE with `id` output

Higher levels correlate with lower FP rates. The calibration data validates this and flags anomalies (e.g., an L3 finding type that still has a 40% FP rate).

API endpoints¶

Submit feedback¶

POST /api/v1/confidence/engagements/{engagement_name}/findings/{finding_id}/feedback

{
  "is_false_positive": true,
  "fp_reason": "WAF injection in error page, not actual XSS"
}

Response:

{
  "status": "recorded",
  "vuln_type": "Reflected XSS",
  "tech_stack": "Spring Boot + PostgreSQL",
  "is_false_positive": true,
  "updated_fp_rate": 0.23,
  "total_data_points": 47
}

The vuln_type is auto-extracted from the finding title. The tech_stack is read from the engagement's context.json.

Get all calibration data¶

GET /api/v1/confidence/calibration

Returns all (vuln_type, tech_stack) pairs with their FP rates, TP counts, and average confidence scores.

Get calibration for a specific type¶

GET /api/v1/confidence/calibration/{vuln_type}?tech_stack=Spring+Boot

Parameter	Type	Default	Description
`vuln_type`	path	required	URL-encoded vulnerability type (e.g. `Reflected%20XSS`)
`tech_stack`	query	`*`	Filter by tech stack. `*` returns aggregate across all stacks

Response includes a recommendation field with a suggested confidence threshold and verification priority.

Practical example¶

After 50 engagements, the system might show:

Vuln type	Tech stack	FP rate	Recommendation
Reflected XSS	WordPress	8%	Standard verification
Reflected XSS	React SPA	42%	High priority verification
SQL Injection	Laravel	5%	Standard verification
SSRF	Spring Boot	31%	High priority verification

This tells you: XSS findings on React SPAs need careful manual review, while SQLi on Laravel is almost always real.

Connections to other features¶

Learning Loop: the Learning Loop uses calibration data to weight technique recommendations -- techniques that produce high-FP findings are ranked lower
Verification phase: during Phase 5 (/verify), findings from high-FP categories get extra verification effort