Skip to content

Backup & Disaster Recovery Plan

Compliance: GDPR (Art. 32), HIPAA (45 CFR §164.308(a)(7)), ISO 27001 (A.12.3), NIST-800-53 (CP-9, CP-10)


1. Overview

1.1 Purpose

Ensure rapid recovery of critical systems and data in case of: - Hardware failure - Ransomware/data destruction - Natural disaster - Accidental data deletion - Cybersecurity incident

1.2 Key Objectives

  • RTO (Recovery Time Objective): ≤ 4 hours for production systems
  • RPO (Recovery Point Objective): ≤ 1 hour (data loss tolerance)
  • Annual Recovery Test: 100% success rate
  • Documentation: Complete runbooks for every recovery scenario

2. Backup Strategy

2.1 Data Requiring Backup

Critical

  • SQLite database (engagement data, user accounts, findings)
  • Generated reports (DOCX, XLSX, PDF files)
  • Engagement directories (context.json, findings markdown)
  • Encryption keys (AWS KMS master keys)
  • TLS certificates (for HTTPS)

Important

  • Application code (git repository, but also database export)
  • Configuration files (docker-compose.yml, .env)
  • Audit logs

Optional

  • Development/test data (can be regenerated)

2.2 Backup Frequency

Data Type Frequency Retention
SQLite Database Daily (incremental every 1 hour) 30 days
Reports Directory Daily 30 days
Engagements Daily 1 year
Application Code Every commit to main Infinite (git history)
Audit Logs Daily 2 years
Encryption Keys On change only Indefinite (offsite)

2.3 Backup Execution

Automated Backup (Daily 02:00 UTC)

# Runs: dashboard/scripts/backup.sh
- Database: SQLite  gzip  timestamp
- Reports: tar.gz all DOCX/XLSX/PDF
- Engagements: tar.gz all findings + metadata
- Manifest: Include metadata for restore
- Upload: To configured target (S3, SFTP, or local)
- Cleanup: Delete backups > 30 days old

Incremental Backup (Hourly)

  • Database journaling (SQLite WAL mode)
  • Write-ahead logs prevent data loss < 1 hour

Point-in-Time Recovery

  • Full daily backup + transaction logs
  • Allows recovery to any specific time within 24 hours

2.4 Backup Storage

Local Storage (On-Premises)

  • Location: Separate physical server/NAS
  • Capacity: 2TB (sufficient for 1-month retention)
  • Redundancy: RAID-5 (survives single disk failure)
  • Network: Isolated from internet (no direct access)

Off-Site Storage (AWS S3 or SFTP)

  • Redundancy: Cross-region replication (S3)
  • Encryption: AES-256-GCM (KMS or server-side)
  • Lifecycle: Move to Glacier after 30 days (cost optimization)
  • Retention: 1 year for engagements, 30 days for daily backups
  • Versioning: Keep last 5 versions of each backup

Encryption Key Backup

  • Master Keys: Backed up in AWS KMS or Hardware Security Module (HSM)
  • Escrow: Copy stored offline in physical vault
  • Rotation: Keys rotated annually; old keys retained for decryption

3. Recovery Procedures

3.1 Scenario A: Database Corruption (RTO: 1 hour)

Symptoms: - Application error when accessing database - Integrity check fails (PRAGMA integrity_check) - Duplicate key violations

Steps: 1. Stop Application: docker-compose down 2. Identify Latest Good Backup: Check manifest for timestamp 3. Restore Database:

cd backups/
gunzip -c bedefended_backup_20260317_020000_dashboard.db.gz > data/db/dashboard.db
4. Verify Integrity: sqlite3 data/db/dashboard.db "PRAGMA integrity_check;" 5. Restart Application: docker-compose up -d 6. Monitor: Check logs for 30 minutes 7. Notify: Customers informed of recovery; offer replay of lost transactions (< 1 hour data)

Verification Checklist: - [ ] Database integrity check passed - [ ] Users can log in - [ ] Last 5 engagements visible in dashboard - [ ] Audit logs show recovery timestamp

3.2 Scenario B: Complete System Failure (RTO: 4 hours)

Symptoms: - Hardware failure (disk, motherboard) - Ransomware encryption - Catastrophic corruption

Steps: 1. Provision New Server: Spin up replacement VM (15 min) 2. Install Prerequisites: Docker, docker-compose, Python (15 min) 3. Restore Application Code: git clone from GitHub (10 min) 4. Restore Database: From latest off-site backup (15 min) 5. Restore Reports: Untar reports directory (10 min) 6. Restore Engagements: Untar engagement directory (20 min) 7. Restore Configuration: Copy encryption keys from vault (5 min) 8. Start Services: docker-compose up -d (5 min) 9. Validation: Run test user login, access old engagement (10 min) 10. Notify Customers: Downtime notification + estimated recovery time

Total Time: ~2 hours (RTO < 4 hours achieved)

Verification Checklist: - [ ] Server online, accessible from internet - [ ] HTTPS working (certificates valid) - [ ] Database contains latest engagement data - [ ] Reports directory restored with all files - [ ] Users can log in and access engagements - [ ] Audit logs show recovery - [ ] No data loss (or < 1 hour loss, documented)

3.3 Scenario C: Ransomware Encryption (RTO: 2 hours)

Detection: - Sudden file encryption noticed - Ransom note on server

Response (DO NOT PAY RANSOM): 1. Isolate System: Disconnect from network immediately 2. Preserve Evidence: Keep encrypted files for forensics 3. Check Backups: Are local/off-site backups encrypted too? - If Yes → Ransomware likely in backup pipeline; restore from oldest clean backup (1+ weeks old) - If No → Restore from latest clean backup 4. Restore from Backup: Follow Scenario B procedure 5. Check for Persistence: Scan rebuilt system for backdoors (Rootkit Hunter, Lynis) 6. Change Credentials: All passwords, API keys, TLS certs 7. Report to Law Enforcement: FBI/Europol/local police

Prevention: - Backups stored offline (not accessible from application server) - Immutable backup storage (WORM — Write Once, Read Many) if using object storage - Automated anomaly detection (excessive write activity → alert)

3.4 Scenario D: Accidental Data Deletion (RTO: 30 min)

Symptoms: - Critical database table accidentally dropped - Report files mass-deleted

Steps: 1. Point-in-Time Recovery: Use transaction logs to recover to 1 hour ago

# SQLite: Restore from backup + replay WAL
cp data/db/dashboard.db.backup data/db/dashboard.db
# WAL automatically replayed on next connection
2. Verify: Check specific record that was deleted 3. Restart: Application comes back online 4. Root Cause: Investigate how deletion happened (audit logs)

Prevention: - Backups immutable (no delete permission even for admins) - Soft-deletes only (logical deletion, physical deletion after 30 days) - Deletion confirmation (require 2 admins for destructive operations)


4. Testing & Validation

4.1 Backup Verification (Weekly)

Automated test:

# Runs: backup-restore-test.sh
- Restore latest backup to test database
- Run integrity checks
- Verify record counts match production
- Delete test database

Success Criteria: - [ ] Backup extraction succeeds - [ ] Database integrity check passes - [ ] Record counts within 1% of production - [ ] No errors in extraction process

4.2 Disaster Recovery Drill (Quarterly)

Process: 1. Schedule: Q1/Q2/Q3/Q4, off-peak hours 2. Scenario: Rotate through A/B/C/D each quarter 3. Participants: Full team (eng, ops, support) 4. Timeline: Track time to restore (measure against RTO) 5. Validation: Functional test of restored system 6. Debrief: Identify improvements, update procedures

Q1 2026 Drill: Complete system failure recovery (Scenario B) - Target RTO: 4 hours - Actual RTO: [To be measured] - Issues Found: [TBD] - Improvements: [TBD]

4.3 Annual Full Test

Every March: - Restore complete system from backup - Run full test suite - Validate all engagements/reports accessible - Document results + any issues - Update runbooks based on findings

Success Criteria: - ✓ 100% of tests pass - ✓ RTO < 4 hours verified - ✓ Zero data loss (RPO met) - ✓ All staff trained on procedures


5. Monitoring & Alerting

5.1 Backup Health Alerts

Alert Threshold Action
Backup Failed Any failed backup Immediate investigation; manual retry
Backup Late Backup not completed by 03:00 UTC Alert on-call; run manual backup
Backup Size Anomaly >50% variance from average Check for data explosion or corruption
Backup Storage Full >90% disk capacity Expand storage; clean old backups
Restore Test Failed Any failed test Investigate backup; restore to verify

5.2 Monitoring Dashboard

  • Last Backup Time: Should be < 24 hours ago
  • Backup Size Trend: Should be relatively stable (±20%)
  • Storage Utilization: Should be < 70%
  • Restore Test Status: Latest test should show "PASSED"

5.3 Daily Health Check

# Automated: Runs 03:30 UTC daily
- Verify latest backup file exists
- Check file size is reasonable
- Verify restore test passed
- Email summary to ops team

6. Retention & Deletion

6.1 Data Retention Schedule

Backup Type Retention Deletion Method
Daily DB backup 30 days Automatic (cron job)
Daily reports backup 30 days Automatic
Engagement backup 1 year + 30 days post-retention Automatic
Encryption keys Indefinite Manual (vault only)
Audit logs 2 years Automated purge script

6.2 Deletion Process

  • Backups deleted via secure overwrite (NIST SP 800-88)
  • Deleted backups logged in manifest (date, size, method)
  • Encrypted files destroyed (keys first, then encrypted backups)

7. Disaster Recovery Plan Contacts

Role Name Email Phone
Primary Backup Admin [NAME] [EMAIL] [PHONE]
Secondary Backup Admin [NAME] [EMAIL] [PHONE]
CTO (Escalation) [CTO NAME] [EMAIL] [PHONE]
Infrastructure [LEAD NAME] [EMAIL] [PHONE]

8. Documentation & Updates

  • Version: 1.0
  • Last Tested: [TBD — first drill scheduled Q1 2026]
  • Next Review: 2027-03-17
  • Change Log: Maintained in git (docs/operations/backup-recovery.md)

Updates Required When: - RTO/RPO changes - New backup system introduced - Incident lessons learned - Annual drill findings - Regulatory changes


Document Version: 1.0 | Effective: 2026-03-17 | Compliance: GDPR, HIPAA, ISO 27001, NIST-800-53