Phase 0.5: Walkthrough (Authenticated Discovery)¶
Overview¶
Phase 0.5 uses a headless browser (Playwright with Chromium) to navigate the application as each user role. This automated crawling discovers all pages, endpoints, functionality, and API calls that would be missed by static tools. It's particularly effective for Single Page Applications (SPAs) that render content dynamically.
Purpose: Build a comprehensive map of every user-accessible page and endpoint, with full authentication context for each role.
Why Walkthrough Matters¶
Many reconnaissance tools analyze static HTML and miss critical paths because:
- Dynamic JavaScript Rendering: SPAs render content after loading via JavaScript—static crawlers see empty
<div id="app"></div> - Hidden Navigation: Menus that appear only after login, modals, collapsible sections
- Lazy Loading: Content loaded on scroll or button click
- API Calls: Navigation endpoints may not appear in HTML links
- Role-Based Content: Admin dashboards, user profiles, moderator tools only visible to specific roles
- WebSocket APIs: Real-time features missed by HTTP-only crawlers
Execution Flow¶
graph TB
A["Start Phase 0.5<br/>Walkthrough"] --> B["Load credentials.json<br/>All user roles"]
B --> C{Multiple Roles?}
C -->|Yes| D["Parallel Crawl<br/>Max 3 users/batch"]
C -->|No| E["Sequential Crawl<br/>Single role"]
D --> F["Authenticate<br/>as each user"]
E --> F
F --> G["Navigate App<br/>BFS crawling"]
G --> H["Record URLs<br/>& API Endpoints"]
H --> I["Detect Errors<br/>403, 500, timeouts"]
I --> J["Merge Results<br/>app-map.json"]
J --> K["Continue to<br/>Phase 1: Recon"]
style A fill:#4a148c,color:#fff
style K fill:#4a148c,color:#fff
style J fill:#ab47bc,color:#fff
How It Works¶
1. Credential Setup¶
The walkthrough requires a credentials.json file at the project root:
{
"login": {
"url": "https://app.example.com/login",
"method": "form",
"wait_for": ".dashboard"
},
"users": {
"admin": {
"email": "admin@example.com",
"password": "SecurePass123!",
"role": "Administrator"
},
"user": {
"email": "user@example.com",
"password": "UserPass456!",
"role": "Regular User"
},
"moderator": {
"email": "mod@example.com",
"password": "ModPass789!",
"role": "Moderator"
}
}
}
CRITICAL: The login section MUST be present. Without it, the crawler runs unauthenticated and misses all protected pages.
2. Parallel Crawling Strategy¶
For efficiency with multiple users: - 1-3 users: Sequential crawling in single process - 4-6 users: Batch 1 (users 1-3) → Batch 2 (users 4-6), each batch parallel - 7+ users: Multiple batches of max 3 users each
This prevents timeouts and internal errors from too many concurrent Playwright instances.
Command example:
docker run --rm -v $(pwd):/work pentest-tools \
/opt/pentest-venv/bin/python3 browser/crawler.py \
--role admin --output app-map-admin.json
docker run --rm -v $(pwd):/work pentest-tools \
/opt/pentest-venv/bin/python3 browser/crawler.py \
--merge --admin app-map-admin.json --user app-map-user.json \
--output app-map.json
3. BFS Crawling Algorithm¶
The Playwright crawler uses breadth-first search to systematically discover pages:
- Start at app root (e.g.,
https://app.example.com/dashboard) - Extract all links:
<a href="...">,<button onclick="...">, form actions - Visit each link, record response code (200, 302, 403, 500, etc.)
- Extract any API calls made (via DevTools network monitoring)
- Queue new URLs for visiting
- Continue until no new URLs remain or timeout/depth limit reached
4. API Discovery During Crawl¶
Modern apps make API calls as you browse. The crawler:
- Monitors Network tab for all HTTP/HTTPS requests
- Extracts X-Requested-With: XMLHttpRequest headers (indicates API call)
- Captures Content-Type: application/json endpoints
- Records request/response structure for later testing
- Identifies GraphQL, REST, and SOAP endpoints
Example output:
{
"api_endpoints": [
{
"method": "GET",
"path": "/api/v1/users/me",
"status": 200,
"content_type": "application/json",
"auth": "Bearer JWT"
},
{
"method": "POST",
"path": "/api/v1/tickets",
"status": 201,
"request_body_sample": {"title": "...", "description": "..."}
}
]
}
Output Files¶
| File | Content | Purpose |
|---|---|---|
app-map.json |
Complete crawl results | Master reference of all discovered URLs, status codes, roles |
crawled-urls.txt |
Plain text list | One URL per line for use by other tools |
api-endpoints.txt |
API-specific URLs | Only /api/* endpoints for Phase 2 parameter discovery |
walkthrough-report.md |
Human-readable summary | Markdown report with statistics, errors, warnings |
Example app-map.json¶
{
"target": "https://app.example.com",
"roles_tested": ["admin", "user", "moderator"],
"crawl_stats": {
"total_urls": 47,
"successful_200": 42,
"forbidden_403": 3,
"errors_500": 2,
"crawl_duration_seconds": 145
},
"urls_by_role": {
"admin": {
"/dashboard": {
"status": 200,
"title": "Admin Dashboard",
"methods": ["GET"],
"forms": [{"action": "/api/v1/users", "method": "POST"}]
},
"/settings": {"status": 200, "title": "Settings"},
"/admin/users": {"status": 200, "title": "User Management"},
"/reports": {"status": 403, "reason": "Forbidden in this role"}
},
"user": {
"/dashboard": {"status": 200, "title": "User Dashboard"},
"/profile": {"status": 200, "title": "My Profile"},
"/admin/users": {"status": 403, "reason": "Admin only"}
}
},
"api_endpoints_discovered": [
"/api/v1/auth/login",
"/api/v1/auth/logout",
"/api/v1/users",
"/api/v1/users/{id}",
"/api/v1/tickets"
]
}
Error Handling¶
The crawler diagnoses why pages fail and documents issues:
| Status | Meaning | Action |
|---|---|---|
| 200 | Successful page load | Continue crawling links on page |
| 302/301 | Redirect | Follow the redirect chain |
| 403 | Forbidden | Log as inaccessible to current role; expected for access control testing |
| 404 | Not found | Continue (bad link) |
| 500 | Server error | Log and skip; may indicate application crash |
| Timeout | Page takes >30s to load | Log and skip; may indicate performance issues |
Critical: If a role's login FAILS, the entire crawl for that role is invalid. The login_success flag is checked before proceeding.
Walkthrough Completeness Rules¶
❌ NOT COMPLETE if:
- Any user/role failed to authenticate (check login_success: false)
- Fewer URLs crawled than expected (e.g., admin crawled 5 URLs when there should be 20+)
- Warnings about incomplete navigation
- Errors in walkthrough-report.md
✅ COMPLETE if:
- All users in credentials.json successfully logged in
- roles_tested count matches number of users in credentials.json
- At least one URL discovered per role
- No critical errors in the crawl
If incomplete: Go back and fix credentials or target issues, then re-run crawl before proceeding to Phase 1.
Integration with Phases 1-2¶
The walkthrough output is used by:
- Phase 1 (Recon): URLs from
crawled-urls.txtused for wayback machine searches - Phase 2 (Discovery): API endpoints piped into ffuf and arjun for parameter discovery
- Phase 3 (Scan): All discovered URLs scanned with nuclei templates
- Phase 4 (Testing): URLs become the endpoints tested by manual skills
Special Cases¶
Single Page Applications (SPAs)¶
SPAs with heavy JavaScript rendering are the walkthrough's strength. It renders the page and executes all JavaScript, so dynamically created links are discovered.
Multi-Step Forms¶
Some apps have wizards or multi-step forms. The crawler: - Fills form fields (if heuristics can guess field types) - Submits forms - Records resulting URLs
WebSocket APIs¶
Real-time apps using WebSockets are partially captured. The crawler logs WebSocket connections but doesn't fully participate in WebSocket conversations.
CAPTCHA / Rate Limiting¶
If the app requires CAPTCHA or rate-limits login attempts:
- Configure credentials.json with a pre-solved CAPTCHA token, or
- Manually provide cookies/tokens instead of username/password
- Adjust wait_for selectors to skip CAPTCHA verification
Next Phase¶
After Phase 0.5 completes successfully with all roles crawled, continue to Phase 1: Recon to perform passive and active reconnaissance on the discovered attack surface.