Dual-Engine Architecture: Claude + Codex¶
Overview¶
The platform now runs a dual-engine architecture with a clear split of responsibilities:
- Claude is the primary engine for live offensive work.
- Codex is the primary engine for bounded support lanes and the runtime advisory engine during analysis and exploitation.
The goal is not "two models doing the same thing." The goal is to reserve the most expensive Claude reasoning for the points where it adds the most value, and move the rest of the load to Codex without reducing detection quality.
Operating Split¶
| Area | Primary Engine | Secondary Engine | Notes |
|---|---|---|---|
| Live pentest execution | Claude | Codex advisory | Claude sends requests, interprets target behavior, and decides next actions |
| Live bug bounty execution | Claude | Codex advisory | Claude remains the executor and final arbiter |
| Bounded review lanes | Codex | Claude fallback | Static review, synthesis, clustering, ranking, post-processing |
| Borderline finding verification | Claude | Codex finding_verifier |
Claude keeps final decision authority |
| Reporting, retro, learn, threat model | Codex | Claude fallback | Token-heavy synthesis is offloaded |
| Session memory and next-step generation | Codex | Heuristic fallback | Used primarily in perpetual bug bounty workflows |
Core Principle¶
The system follows a strict division:
- Claude executes high-ambiguity live exploitation
- Codex advises, critiques, clusters, ranks, compresses, and offloads bounded work
That boundary matters because live testing depends on dynamic target feedback, while bounded work depends on structured analysis and high throughput.
Why Two Engines¶
Single-model offensive workflows fail in two predictable ways:
- Reasoning convergence: after enough attempts, the same model tends to repeat the same attack pattern.
- Context saturation: long-running hunts accumulate too much low-value context and start wasting premium reasoning budget.
Codex breaks both failure modes:
- it brings a different reasoning profile
- it works well on compact bounded prompts
- it absorbs token-heavy synthesis, triage, and memory compaction
Model Policy¶
Claude Models¶
| Usage | Model | Reasoning |
|---|---|---|
| Hard pentest lane | claude-opus-4-6 |
high |
| Standard pentest lane | claude-opus-4-6 |
medium |
| Claude fallback lane | claude-sonnet-4-6 |
high |
Codex Models¶
| Usage | Model | Reasoning |
|---|---|---|
| Default support lane | gpt-5.4 |
high |
| Stuck-breaker / hard second opinion | gpt-5.4 |
xhigh |
| Rare arbiter lane | gpt-5.4-pro |
xhigh |
Why This Split¶
claude-opus-4-6is reserved for live offensive reasoning and final verification.gpt-5.4 highis the default Codex lane for bounded work and normal advisory.gpt-5.4 xhighis reserved for hard stuck states, chain expansion, and difficult ambiguity.gpt-5.4-pro xhighis intentionally rare and only used as an arbiter when the value of another premium pass is justified.
Pentest Flow¶
Claude Responsibilities¶
During a pentest, Claude remains the live operator:
- executes the actual tests against the target
- chooses the next move
- adapts payloads to live responses
- validates whether a finding is real
- makes the final severity and reportability call
Codex Runtime Advisory Checkpoints¶
| Checkpoint | Trigger | Codex Role | Typical Output |
|---|---|---|---|
| Post-route | Test plan or route summary ready | hypothesis_engine |
Orthogonal hypotheses and next tests |
| Mid main-testing stagnation | Phase 4 stalls or signals do not improve | critic |
Blind spots, missed assumptions, pivot suggestions |
| Pre-verify | High-value surfaces identified but not fully closed | chain_planner / critic |
Chain candidates and verification priorities |
| Borderline finding | Evidence exists but verdict is not clean | finding_verifier |
Promote, downgrade, or retest guidance |
| Hard stuck | Repeated attempts with no useful signal | stuck_breaker |
Three distinct attack angles |
Pentest Deconfliction Rules¶
| Scenario | Action |
|---|---|
| Claude confirms, Codex agrees | Finding stands at highest confidence |
| Claude confirms, Codex disputes | Claude re-checks with the dispute in mind; human review if still ambiguous |
| Claude is uncertain, Codex finds a better angle | Claude retries using the suggested path |
| Both dispute | Finding is dropped |
| Codex unavailable | Claude-only mode continues without architectural failure |
Claude is always the final decision maker for live pentest outcomes.
Bug Bounty Flow¶
The perpetual bug bounty loop is intentionally more Codex-heavy than the pentest flow.
Claude Responsibilities¶
- live interaction with the target
- exploitation of the most promising surfaces
- final decision on whether a lead is a real bug
Codex Primary Lanes¶
| Lane | Primary Engine | Purpose |
|---|---|---|
| Program ranking support | Codex | Prioritize programs and candidate surfaces |
| Discovery digestion | Codex | Cluster surfaces, infer workflows, rank next tests |
| Runtime exploit support | Codex | Suggest payload ladders, bypasses, alternative angles |
| Candidate finding triage | Codex | Deduplicate and pre-score weak or partial signals |
| Session memory compaction | Codex | Persist compact state for the next session |
| Reporting and retrospectives | Codex | Generate token-heavy synthesis outputs |
Persistent Bug Bounty Artifacts¶
Each run can now persist compact artifacts instead of forcing Claude to re-read raw logs:
| Artifact | Purpose |
|---|---|
session-memory.json |
Tested surfaces, promising leads, dead ends, gaps |
discovery-digest.json |
Surface clustering and suspicious areas |
candidate-findings.json |
Weak signals and promoted candidates |
next-tests.json |
Prioritized next-step suggestions |
Per-session copies are stored under each program's memory/ directory, and the latest compact artifacts are also exposed at the program root for reuse by the next run.
Bug Bounty Guardrails¶
Bug bounty is more sensitive to noise than pentest work, so Codex support is intentionally governed by aggressive fallback:
- invalid schema output falls back to Claude
- low-confidence output falls back to Claude
- repeated non-novel advice falls back to Claude
- high-impact ambiguous findings always return to Claude for the final verdict
Token Strategy¶
The dual-engine design is mainly about token allocation discipline:
- Claude is used where live reasoning quality matters most.
- Codex is used where context compression, bounded analysis, and repeated synthesis dominate.
Practical Effects¶
| Category | Before | Now |
|---|---|---|
| Reporting and retrospectives | Claude-heavy | Codex primary |
| Bounded code review | Claude-heavy | Codex primary with Claude fallback |
| Runtime second opinions | Ad hoc | Standardized consults |
| Bug bounty session carry-over | Raw logs or human memory | Compact Codex-generated artifacts |
The existing P9-P15 offload alone is estimated to save roughly 110K-150K Claude tokens per engagement, before counting the new bug bounty memory and digest lanes.
Advisory Roles¶
The main Codex runtime roles are:
| Role | Purpose |
|---|---|
hypothesis_engine |
Generate orthogonal test hypotheses |
critic |
Challenge dominant assumptions and break tunnel vision |
chain_planner |
Combine partial primitives into exploitable chains |
finding_verifier |
Evaluate borderline findings before a final verdict |
stuck_breaker |
Generate fresh angles when exploitation stalls |
These roles do not replace Claude. They make Claude spend fewer tokens on repeated bounded reasoning.
Routing and Configuration¶
The architecture is enforced through:
| File | Purpose |
|---|---|
.claude/skills/pentest/helpers/agent-dispatch-config.json |
Lane registry and Claude/Codex routing policy |
scripts/model_routing_policy.py |
Exposes routing metadata to the runtime |
.claude/skills/pentest/helpers/codex-dispatch.md |
Dispatch protocol and advisory contracts |
.claude/skills/pentest/helpers/codex-role-contracts.md |
Structured role outputs |
scripts/ai_exec.py |
AI task chains, including bug bounty Codex lanes |
bugbounty/session_memory_compact.py |
Compact memory and digest generation |
Metrics¶
The bug bounty runtime now records lightweight Codex effectiveness metrics:
| File | Purpose |
|---|---|
bugbounty/.runtime/metrics/codex-advisory.jsonl |
Per-task advisory outcomes, confidence, and status |
bugbounty/.runtime/metrics/codex-loop.jsonl |
Per-hunt artifact usage and prompt injection tracking |
These metrics exist so the architecture can be tuned from real runs instead of intuition.
Compatibility¶
The system still degrades safely:
| State | Behavior |
|---|---|
| Claude + Codex available | Full dual-engine workflow |
| Codex partially unavailable | Claude continues, bounded offload is skipped |
| Codex fully unavailable | Claude-only mode |
The important point is that Codex is now structural, not optional in the design. But Claude remains sufficient to keep the workflow operational if Codex is temporarily unavailable.