Backup & Disaster Recovery Plan¶
Compliance: GDPR (Art. 32), HIPAA (45 CFR §164.308(a)(7)), ISO 27001 (A.12.3), NIST-800-53 (CP-9, CP-10)
1. Overview¶
1.1 Purpose¶
Ensure rapid recovery of critical systems and data in case of: - Hardware failure - Ransomware/data destruction - Natural disaster - Accidental data deletion - Cybersecurity incident
1.2 Key Objectives¶
- RTO (Recovery Time Objective): ≤ 4 hours for production systems
- RPO (Recovery Point Objective): ≤ 1 hour (data loss tolerance)
- Annual Recovery Test: 100% success rate
- Documentation: Complete runbooks for every recovery scenario
2. Backup Strategy¶
2.1 Data Requiring Backup¶
Critical¶
- SQLite database (engagement data, user accounts, findings)
- Generated reports (DOCX, XLSX, PDF files)
- Engagement directories (context.json, findings markdown)
- Encryption keys (AWS KMS master keys)
- TLS certificates (for HTTPS)
Important¶
- Application code (git repository, but also database export)
- Configuration files (docker-compose.yml, .env)
- Audit logs
Optional¶
- Development/test data (can be regenerated)
2.2 Backup Frequency¶
| Data Type | Frequency | Retention |
|---|---|---|
| SQLite Database | Daily (incremental every 1 hour) | 30 days |
| Reports Directory | Daily | 30 days |
| Engagements | Daily | 1 year |
| Application Code | Every commit to main | Infinite (git history) |
| Audit Logs | Daily | 2 years |
| Encryption Keys | On change only | Indefinite (offsite) |
2.3 Backup Execution¶
Automated Backup (Daily 02:00 UTC)¶
# Runs: dashboard/scripts/backup.sh
- Database: SQLite → gzip → timestamp
- Reports: tar.gz all DOCX/XLSX/PDF
- Engagements: tar.gz all findings + metadata
- Manifest: Include metadata for restore
- Upload: To configured target (S3, SFTP, or local)
- Cleanup: Delete backups > 30 days old
Incremental Backup (Hourly)¶
- Database journaling (SQLite WAL mode)
- Write-ahead logs prevent data loss < 1 hour
Point-in-Time Recovery¶
- Full daily backup + transaction logs
- Allows recovery to any specific time within 24 hours
2.4 Backup Storage¶
Local Storage (On-Premises)¶
- Location: Separate physical server/NAS
- Capacity: 2TB (sufficient for 1-month retention)
- Redundancy: RAID-5 (survives single disk failure)
- Network: Isolated from internet (no direct access)
Off-Site Storage (AWS S3 or SFTP)¶
- Redundancy: Cross-region replication (S3)
- Encryption: AES-256-GCM (KMS or server-side)
- Lifecycle: Move to Glacier after 30 days (cost optimization)
- Retention: 1 year for engagements, 30 days for daily backups
- Versioning: Keep last 5 versions of each backup
Encryption Key Backup¶
- Master Keys: Backed up in AWS KMS or Hardware Security Module (HSM)
- Escrow: Copy stored offline in physical vault
- Rotation: Keys rotated annually; old keys retained for decryption
3. Recovery Procedures¶
3.1 Scenario A: Database Corruption (RTO: 1 hour)¶
Symptoms:
- Application error when accessing database
- Integrity check fails (PRAGMA integrity_check)
- Duplicate key violations
Steps:
1. Stop Application: docker-compose down
2. Identify Latest Good Backup: Check manifest for timestamp
3. Restore Database:
sqlite3 data/db/dashboard.db "PRAGMA integrity_check;"
5. Restart Application: docker-compose up -d
6. Monitor: Check logs for 30 minutes
7. Notify: Customers informed of recovery; offer replay of lost transactions (< 1 hour data)
Verification Checklist: - [ ] Database integrity check passed - [ ] Users can log in - [ ] Last 5 engagements visible in dashboard - [ ] Audit logs show recovery timestamp
3.2 Scenario B: Complete System Failure (RTO: 4 hours)¶
Symptoms: - Hardware failure (disk, motherboard) - Ransomware encryption - Catastrophic corruption
Steps:
1. Provision New Server: Spin up replacement VM (15 min)
2. Install Prerequisites: Docker, docker-compose, Python (15 min)
3. Restore Application Code: git clone from GitHub (10 min)
4. Restore Database: From latest off-site backup (15 min)
5. Restore Reports: Untar reports directory (10 min)
6. Restore Engagements: Untar engagement directory (20 min)
7. Restore Configuration: Copy encryption keys from vault (5 min)
8. Start Services: docker-compose up -d (5 min)
9. Validation: Run test user login, access old engagement (10 min)
10. Notify Customers: Downtime notification + estimated recovery time
Total Time: ~2 hours (RTO < 4 hours achieved)
Verification Checklist: - [ ] Server online, accessible from internet - [ ] HTTPS working (certificates valid) - [ ] Database contains latest engagement data - [ ] Reports directory restored with all files - [ ] Users can log in and access engagements - [ ] Audit logs show recovery - [ ] No data loss (or < 1 hour loss, documented)
3.3 Scenario C: Ransomware Encryption (RTO: 2 hours)¶
Detection: - Sudden file encryption noticed - Ransom note on server
Response (DO NOT PAY RANSOM): 1. Isolate System: Disconnect from network immediately 2. Preserve Evidence: Keep encrypted files for forensics 3. Check Backups: Are local/off-site backups encrypted too? - If Yes → Ransomware likely in backup pipeline; restore from oldest clean backup (1+ weeks old) - If No → Restore from latest clean backup 4. Restore from Backup: Follow Scenario B procedure 5. Check for Persistence: Scan rebuilt system for backdoors (Rootkit Hunter, Lynis) 6. Change Credentials: All passwords, API keys, TLS certs 7. Report to Law Enforcement: FBI/Europol/local police
Prevention: - Backups stored offline (not accessible from application server) - Immutable backup storage (WORM — Write Once, Read Many) if using object storage - Automated anomaly detection (excessive write activity → alert)
3.4 Scenario D: Accidental Data Deletion (RTO: 30 min)¶
Symptoms: - Critical database table accidentally dropped - Report files mass-deleted
Steps: 1. Point-in-Time Recovery: Use transaction logs to recover to 1 hour ago
# SQLite: Restore from backup + replay WAL
cp data/db/dashboard.db.backup data/db/dashboard.db
# WAL automatically replayed on next connection
Prevention: - Backups immutable (no delete permission even for admins) - Soft-deletes only (logical deletion, physical deletion after 30 days) - Deletion confirmation (require 2 admins for destructive operations)
4. Testing & Validation¶
4.1 Backup Verification (Weekly)¶
Automated test:
# Runs: backup-restore-test.sh
- Restore latest backup to test database
- Run integrity checks
- Verify record counts match production
- Delete test database
Success Criteria: - [ ] Backup extraction succeeds - [ ] Database integrity check passes - [ ] Record counts within 1% of production - [ ] No errors in extraction process
4.2 Disaster Recovery Drill (Quarterly)¶
Process: 1. Schedule: Q1/Q2/Q3/Q4, off-peak hours 2. Scenario: Rotate through A/B/C/D each quarter 3. Participants: Full team (eng, ops, support) 4. Timeline: Track time to restore (measure against RTO) 5. Validation: Functional test of restored system 6. Debrief: Identify improvements, update procedures
Q1 2026 Drill: Complete system failure recovery (Scenario B) - Target RTO: 4 hours - Actual RTO: [To be measured] - Issues Found: [TBD] - Improvements: [TBD]
4.3 Annual Full Test¶
Every March: - Restore complete system from backup - Run full test suite - Validate all engagements/reports accessible - Document results + any issues - Update runbooks based on findings
Success Criteria: - ✓ 100% of tests pass - ✓ RTO < 4 hours verified - ✓ Zero data loss (RPO met) - ✓ All staff trained on procedures
5. Monitoring & Alerting¶
5.1 Backup Health Alerts¶
| Alert | Threshold | Action |
|---|---|---|
| Backup Failed | Any failed backup | Immediate investigation; manual retry |
| Backup Late | Backup not completed by 03:00 UTC | Alert on-call; run manual backup |
| Backup Size Anomaly | >50% variance from average | Check for data explosion or corruption |
| Backup Storage Full | >90% disk capacity | Expand storage; clean old backups |
| Restore Test Failed | Any failed test | Investigate backup; restore to verify |
5.2 Monitoring Dashboard¶
- Last Backup Time: Should be < 24 hours ago
- Backup Size Trend: Should be relatively stable (±20%)
- Storage Utilization: Should be < 70%
- Restore Test Status: Latest test should show "PASSED"
5.3 Daily Health Check¶
# Automated: Runs 03:30 UTC daily
- Verify latest backup file exists
- Check file size is reasonable
- Verify restore test passed
- Email summary to ops team
6. Retention & Deletion¶
6.1 Data Retention Schedule¶
| Backup Type | Retention | Deletion Method |
|---|---|---|
| Daily DB backup | 30 days | Automatic (cron job) |
| Daily reports backup | 30 days | Automatic |
| Engagement backup | 1 year + 30 days post-retention | Automatic |
| Encryption keys | Indefinite | Manual (vault only) |
| Audit logs | 2 years | Automated purge script |
6.2 Deletion Process¶
- Backups deleted via secure overwrite (NIST SP 800-88)
- Deleted backups logged in manifest (date, size, method)
- Encrypted files destroyed (keys first, then encrypted backups)
7. Disaster Recovery Plan Contacts¶
| Role | Name | Phone | |
|---|---|---|---|
| Primary Backup Admin | [NAME] | [EMAIL] | [PHONE] |
| Secondary Backup Admin | [NAME] | [EMAIL] | [PHONE] |
| CTO (Escalation) | [CTO NAME] | [EMAIL] | [PHONE] |
| Infrastructure | [LEAD NAME] | [EMAIL] | [PHONE] |
8. Documentation & Updates¶
- Version: 1.0
- Last Tested: [TBD — first drill scheduled Q1 2026]
- Next Review: 2027-03-17
- Change Log: Maintained in git (docs/operations/backup-recovery.md)
Updates Required When: - RTO/RPO changes - New backup system introduced - Incident lessons learned - Annual drill findings - Regulatory changes
Document Version: 1.0 | Effective: 2026-03-17 | Compliance: GDPR, HIPAA, ISO 27001, NIST-800-53