Dual-Engine Architecture: Claude + Codex¶

Overview¶

The platform now runs a dual-engine architecture with a clear split of responsibilities:

Claude is the primary engine for live offensive work.
Codex is the primary engine for bounded support lanes and the runtime advisory engine during analysis and exploitation.

The goal is not "two models doing the same thing." The goal is to reserve the most expensive Claude reasoning for the points where it adds the most value, and move the rest of the load to Codex without reducing detection quality.

Operating Split¶

Area	Primary Engine	Secondary Engine	Notes
Live pentest execution	Claude	Codex advisory	Claude sends requests, interprets target behavior, and decides next actions
Live bug bounty execution	Claude	Codex advisory	Claude remains the executor and final arbiter
Bounded review lanes	Codex	Claude fallback	Static review, synthesis, clustering, ranking, post-processing
Borderline finding verification	Claude	Codex `finding_verifier`	Claude keeps final decision authority
Reporting, retro, learn, threat model	Codex	Claude fallback	Token-heavy synthesis is offloaded
Session memory and next-step generation	Codex	Heuristic fallback	Used primarily in perpetual bug bounty workflows

Core Principle¶

The system follows a strict division:

Claude executes high-ambiguity live exploitation
Codex advises, critiques, clusters, ranks, compresses, and offloads bounded work

That boundary matters because live testing depends on dynamic target feedback, while bounded work depends on structured analysis and high throughput.

Why Two Engines¶

Single-model offensive workflows fail in two predictable ways:

Reasoning convergence: after enough attempts, the same model tends to repeat the same attack pattern.
Context saturation: long-running hunts accumulate too much low-value context and start wasting premium reasoning budget.

Codex breaks both failure modes:

it brings a different reasoning profile
it works well on compact bounded prompts
it absorbs token-heavy synthesis, triage, and memory compaction

Model Policy¶

Claude Models¶

Usage	Model	Reasoning
Hard pentest lane	`claude-opus-4-6`	`high`
Standard pentest lane	`claude-opus-4-6`	`medium`
Claude fallback lane	`claude-sonnet-4-6`	`high`

Codex Models¶

Usage	Model	Reasoning
Default support lane	`gpt-5.4`	`high`
Stuck-breaker / hard second opinion	`gpt-5.4`	`xhigh`
Rare arbiter lane	`gpt-5.4-pro`	`xhigh`

Why This Split¶

claude-opus-4-6 is reserved for live offensive reasoning and final verification.
gpt-5.4 high is the default Codex lane for bounded work and normal advisory.
gpt-5.4 xhigh is reserved for hard stuck states, chain expansion, and difficult ambiguity.
gpt-5.4-pro xhigh is intentionally rare and only used as an arbiter when the value of another premium pass is justified.

Pentest Flow¶

Claude Responsibilities¶

During a pentest, Claude remains the live operator:

executes the actual tests against the target
chooses the next move
adapts payloads to live responses
validates whether a finding is real
makes the final severity and reportability call

Codex Runtime Advisory Checkpoints¶

Checkpoint	Trigger	Codex Role	Typical Output
Post-route	Test plan or route summary ready	`hypothesis_engine`	Orthogonal hypotheses and next tests
Mid main-testing stagnation	Phase 4 stalls or signals do not improve	`critic`	Blind spots, missed assumptions, pivot suggestions
Pre-verify	High-value surfaces identified but not fully closed	`chain_planner` / `critic`	Chain candidates and verification priorities
Borderline finding	Evidence exists but verdict is not clean	`finding_verifier`	Promote, downgrade, or retest guidance
Hard stuck	Repeated attempts with no useful signal	`stuck_breaker`	Three distinct attack angles

Pentest Deconfliction Rules¶

Scenario	Action
Claude confirms, Codex agrees	Finding stands at highest confidence
Claude confirms, Codex disputes	Claude re-checks with the dispute in mind; human review if still ambiguous
Claude is uncertain, Codex finds a better angle	Claude retries using the suggested path
Both dispute	Finding is dropped
Codex unavailable	Claude-only mode continues without architectural failure

Claude is always the final decision maker for live pentest outcomes.

Bug Bounty Flow¶

The perpetual bug bounty loop is intentionally more Codex-heavy than the pentest flow.

Claude Responsibilities¶

live interaction with the target
exploitation of the most promising surfaces
final decision on whether a lead is a real bug

Codex Primary Lanes¶

Lane	Primary Engine	Purpose
Program ranking support	Codex	Prioritize programs and candidate surfaces
Discovery digestion	Codex	Cluster surfaces, infer workflows, rank next tests
Runtime exploit support	Codex	Suggest payload ladders, bypasses, alternative angles
Candidate finding triage	Codex	Deduplicate and pre-score weak or partial signals
Session memory compaction	Codex	Persist compact state for the next session
Reporting and retrospectives	Codex	Generate token-heavy synthesis outputs

Persistent Bug Bounty Artifacts¶

Each run can now persist compact artifacts instead of forcing Claude to re-read raw logs:

Artifact	Purpose
`session-memory.json`	Tested surfaces, promising leads, dead ends, gaps
`discovery-digest.json`	Surface clustering and suspicious areas
`candidate-findings.json`	Weak signals and promoted candidates
`next-tests.json`	Prioritized next-step suggestions

Per-session copies are stored under each program's memory/ directory, and the latest compact artifacts are also exposed at the program root for reuse by the next run.

Bug Bounty Guardrails¶

Bug bounty is more sensitive to noise than pentest work, so Codex support is intentionally governed by aggressive fallback:

invalid schema output falls back to Claude
low-confidence output falls back to Claude
repeated non-novel advice falls back to Claude
high-impact ambiguous findings always return to Claude for the final verdict

Token Strategy¶

The dual-engine design is mainly about token allocation discipline:

Claude is used where live reasoning quality matters most.
Codex is used where context compression, bounded analysis, and repeated synthesis dominate.

Practical Effects¶

Category	Before	Now
Reporting and retrospectives	Claude-heavy	Codex primary
Bounded code review	Claude-heavy	Codex primary with Claude fallback
Runtime second opinions	Ad hoc	Standardized consults
Bug bounty session carry-over	Raw logs or human memory	Compact Codex-generated artifacts

The existing P9-P15 offload alone is estimated to save roughly 110K-150K Claude tokens per engagement, before counting the new bug bounty memory and digest lanes.

Advisory Roles¶

The main Codex runtime roles are:

Role	Purpose
`hypothesis_engine`	Generate orthogonal test hypotheses
`critic`	Challenge dominant assumptions and break tunnel vision
`chain_planner`	Combine partial primitives into exploitable chains
`finding_verifier`	Evaluate borderline findings before a final verdict
`stuck_breaker`	Generate fresh angles when exploitation stalls

These roles do not replace Claude. They make Claude spend fewer tokens on repeated bounded reasoning.

Routing and Configuration¶

The architecture is enforced through:

File	Purpose
`.claude/skills/pentest/helpers/agent-dispatch-config.json`	Lane registry and Claude/Codex routing policy
`scripts/model_routing_policy.py`	Exposes routing metadata to the runtime
`.claude/skills/pentest/helpers/codex-dispatch.md`	Dispatch protocol and advisory contracts
`.claude/skills/pentest/helpers/codex-role-contracts.md`	Structured role outputs
`scripts/ai_exec.py`	AI task chains, including bug bounty Codex lanes
`bugbounty/session_memory_compact.py`	Compact memory and digest generation

Metrics¶

The bug bounty runtime now records lightweight Codex effectiveness metrics:

File	Purpose
`bugbounty/.runtime/metrics/codex-advisory.jsonl`	Per-task advisory outcomes, confidence, and status
`bugbounty/.runtime/metrics/codex-loop.jsonl`	Per-hunt artifact usage and prompt injection tracking

These metrics exist so the architecture can be tuned from real runs instead of intuition.

Compatibility¶

The system still degrades safely:

State	Behavior
Claude + Codex available	Full dual-engine workflow
Codex partially unavailable	Claude continues, bounded offload is skipped
Codex fully unavailable	Claude-only mode

The important point is that Codex is now structural, not optional in the design. But Claude remains sufficient to keep the workflow operational if Codex is temporarily unavailable.