Backup & Disaster Recovery Plan¶

Compliance: GDPR (Art. 32), HIPAA (45 CFR §164.308(a)(7)), ISO 27001 (A.12.3), NIST-800-53 (CP-9, CP-10)

1. Overview¶

1.1 Purpose¶

Ensure rapid recovery of critical systems and data in case of: - Hardware failure - Ransomware/data destruction - Natural disaster - Accidental data deletion - Cybersecurity incident

1.2 Key Objectives¶

RTO (Recovery Time Objective): ≤ 4 hours for production systems
RPO (Recovery Point Objective): ≤ 1 hour (data loss tolerance)
Annual Recovery Test: 100% success rate
Documentation: Complete runbooks for every recovery scenario

2. Backup Strategy¶

2.1 Data Requiring Backup¶

Critical¶

SQLite database (engagement data, user accounts, findings)
Generated reports (DOCX, XLSX, PDF files)
Engagement directories (context.json, findings markdown)
Encryption keys (AWS KMS master keys)
TLS certificates (for HTTPS)

Important¶

Application code (git repository, but also database export)
Configuration files (docker-compose.yml, .env)
Audit logs

Optional¶

Development/test data (can be regenerated)

2.2 Backup Frequency¶

Data Type	Frequency	Retention
SQLite Database	Daily (incremental every 1 hour)	30 days
Reports Directory	Daily	30 days
Engagements	Daily	1 year
Application Code	Every commit to main	Infinite (git history)
Audit Logs	Daily	2 years
Encryption Keys	On change only	Indefinite (offsite)

2.3 Backup Execution¶

Automated Backup (Daily 02:00 UTC)¶

# Runs: dashboard/scripts/backup.sh
- Database: SQLite → gzip → timestamp
- Reports: tar.gz all DOCX/XLSX/PDF
- Engagements: tar.gz all findings + metadata
- Manifest: Include metadata for restore
- Upload: To configured target (S3, SFTP, or local)
- Cleanup: Delete backups > 30 days old

Incremental Backup (Hourly)¶

Database journaling (SQLite WAL mode)
Write-ahead logs prevent data loss < 1 hour

Point-in-Time Recovery¶

Full daily backup + transaction logs
Allows recovery to any specific time within 24 hours

2.4 Backup Storage¶

Local Storage (On-Premises)¶

Location: Separate physical server/NAS
Capacity: 2TB (sufficient for 1-month retention)
Redundancy: RAID-5 (survives single disk failure)
Network: Isolated from internet (no direct access)

Off-Site Storage (AWS S3 or SFTP)¶

Redundancy: Cross-region replication (S3)
Encryption: AES-256-GCM (KMS or server-side)
Lifecycle: Move to Glacier after 30 days (cost optimization)
Retention: 1 year for engagements, 30 days for daily backups
Versioning: Keep last 5 versions of each backup

Encryption Key Backup¶

Master Keys: Backed up in AWS KMS or Hardware Security Module (HSM)
Escrow: Copy stored offline in physical vault
Rotation: Keys rotated annually; old keys retained for decryption

3. Recovery Procedures¶

3.1 Scenario A: Database Corruption (RTO: 1 hour)¶

Symptoms: - Application error when accessing database - Integrity check fails (PRAGMA integrity_check) - Duplicate key violations

Steps: 1. Stop Application: docker-compose down 2. Identify Latest Good Backup: Check manifest for timestamp 3. Restore Database:

cd backups/
gunzip -c bedefended_backup_20260317_020000_dashboard.db.gz > data/db/dashboard.db

4. Verify Integrity: sqlite3 data/db/dashboard.db "PRAGMA integrity_check;" 5. Restart Application: docker-compose up -d 6. Monitor: Check logs for 30 minutes 7. Notify: Customers informed of recovery; offer replay of lost transactions (< 1 hour data)

Verification Checklist: - [ ] Database integrity check passed - [ ] Users can log in - [ ] Last 5 engagements visible in dashboard - [ ] Audit logs show recovery timestamp

3.2 Scenario B: Complete System Failure (RTO: 4 hours)¶

Symptoms: - Hardware failure (disk, motherboard) - Ransomware encryption - Catastrophic corruption

Steps: 1. Provision New Server: Spin up replacement VM (15 min) 2. Install Prerequisites: Docker, docker-compose, Python (15 min) 3. Restore Application Code: git clone from GitHub (10 min) 4. Restore Database: From latest off-site backup (15 min) 5. Restore Reports: Untar reports directory (10 min) 6. Restore Engagements: Untar engagement directory (20 min) 7. Restore Configuration: Copy encryption keys from vault (5 min) 8. Start Services: docker-compose up -d (5 min) 9. Validation: Run test user login, access old engagement (10 min) 10. Notify Customers: Downtime notification + estimated recovery time

Total Time: ~2 hours (RTO < 4 hours achieved)

Verification Checklist: - [ ] Server online, accessible from internet - [ ] HTTPS working (certificates valid) - [ ] Database contains latest engagement data - [ ] Reports directory restored with all files - [ ] Users can log in and access engagements - [ ] Audit logs show recovery - [ ] No data loss (or < 1 hour loss, documented)

3.3 Scenario C: Ransomware Encryption (RTO: 2 hours)¶

Detection: - Sudden file encryption noticed - Ransom note on server

Response (DO NOT PAY RANSOM): 1. Isolate System: Disconnect from network immediately 2. Preserve Evidence: Keep encrypted files for forensics 3. Check Backups: Are local/off-site backups encrypted too? - If Yes → Ransomware likely in backup pipeline; restore from oldest clean backup (1+ weeks old) - If No → Restore from latest clean backup 4. Restore from Backup: Follow Scenario B procedure 5. Check for Persistence: Scan rebuilt system for backdoors (Rootkit Hunter, Lynis) 6. Change Credentials: All passwords, API keys, TLS certs 7. Report to Law Enforcement: FBI/Europol/local police

Prevention: - Backups stored offline (not accessible from application server) - Immutable backup storage (WORM — Write Once, Read Many) if using object storage - Automated anomaly detection (excessive write activity → alert)

3.4 Scenario D: Accidental Data Deletion (RTO: 30 min)¶

Symptoms: - Critical database table accidentally dropped - Report files mass-deleted

Steps: 1. Point-in-Time Recovery: Use transaction logs to recover to 1 hour ago

# SQLite: Restore from backup + replay WAL
cp data/db/dashboard.db.backup data/db/dashboard.db
# WAL automatically replayed on next connection

2. Verify: Check specific record that was deleted 3. Restart: Application comes back online 4. Root Cause: Investigate how deletion happened (audit logs)

Prevention: - Backups immutable (no delete permission even for admins) - Soft-deletes only (logical deletion, physical deletion after 30 days) - Deletion confirmation (require 2 admins for destructive operations)

4. Testing & Validation¶

4.1 Backup Verification (Weekly)¶

Automated test:

# Runs: backup-restore-test.sh
- Restore latest backup to test database
- Run integrity checks
- Verify record counts match production
- Delete test database

Success Criteria: - [ ] Backup extraction succeeds - [ ] Database integrity check passes - [ ] Record counts within 1% of production - [ ] No errors in extraction process

4.2 Disaster Recovery Drill (Quarterly)¶

Process: 1. Schedule: Q1/Q2/Q3/Q4, off-peak hours 2. Scenario: Rotate through A/B/C/D each quarter 3. Participants: Full team (eng, ops, support) 4. Timeline: Track time to restore (measure against RTO) 5. Validation: Functional test of restored system 6. Debrief: Identify improvements, update procedures

Q1 2026 Drill: Complete system failure recovery (Scenario B) - Target RTO: 4 hours - Actual RTO: [To be measured] - Issues Found: [TBD] - Improvements: [TBD]

4.3 Annual Full Test¶

Every March: - Restore complete system from backup - Run full test suite - Validate all engagements/reports accessible - Document results + any issues - Update runbooks based on findings

Success Criteria: - ✓ 100% of tests pass - ✓ RTO < 4 hours verified - ✓ Zero data loss (RPO met) - ✓ All staff trained on procedures

5. Monitoring & Alerting¶

5.1 Backup Health Alerts¶

Alert	Threshold	Action
Backup Failed	Any failed backup	Immediate investigation; manual retry
Backup Late	Backup not completed by 03:00 UTC	Alert on-call; run manual backup
Backup Size Anomaly	>50% variance from average	Check for data explosion or corruption
Backup Storage Full	>90% disk capacity	Expand storage; clean old backups
Restore Test Failed	Any failed test	Investigate backup; restore to verify

5.2 Monitoring Dashboard¶

Last Backup Time: Should be < 24 hours ago
Backup Size Trend: Should be relatively stable (±20%)
Storage Utilization: Should be < 70%
Restore Test Status: Latest test should show "PASSED"

5.3 Daily Health Check¶

# Automated: Runs 03:30 UTC daily
- Verify latest backup file exists
- Check file size is reasonable
- Verify restore test passed
- Email summary to ops team

6. Retention & Deletion¶

6.1 Data Retention Schedule¶

Backup Type	Retention	Deletion Method
Daily DB backup	30 days	Automatic (cron job)
Daily reports backup	30 days	Automatic
Engagement backup	1 year + 30 days post-retention	Automatic
Encryption keys	Indefinite	Manual (vault only)
Audit logs	2 years	Automated purge script

6.2 Deletion Process¶

Backups deleted via secure overwrite (NIST SP 800-88)
Deleted backups logged in manifest (date, size, method)
Encrypted files destroyed (keys first, then encrypted backups)

7. Disaster Recovery Plan Contacts¶

Role	Name	Email	Phone
Primary Backup Admin	[NAME]	[EMAIL]	[PHONE]
Secondary Backup Admin	[NAME]	[EMAIL]	[PHONE]
CTO (Escalation)	[CTO NAME]	[EMAIL]	[PHONE]
Infrastructure	[LEAD NAME]	[EMAIL]	[PHONE]

8. Documentation & Updates¶

Version: 1.0
Last Tested: [TBD — first drill scheduled Q1 2026]
Next Review: 2027-03-17
Change Log: Maintained in git (docs/operations/backup-recovery.md)

Updates Required When: - RTO/RPO changes - New backup system introduced - Incident lessons learned - Annual drill findings - Regulatory changes

Document Version: 1.0 | Effective: 2026-03-17 | Compliance: GDPR, HIPAA, ISO 27001, NIST-800-53