Database Disaster Recovery: RPO, RTO, Cross-Region Replication


Database Disaster Recovery: RPO, RTO, Cross-Region Replication

Disaster recovery (DR) ensures your database can survive catastrophic events: region outages, data corruption, accidental deletions, or ransomware attacks. Unlike high availability, which handles component failures, DR addresses large-scale disasters.

RPO and RTO

Two metrics define DR requirements:

**Recovery Point Objective (RPO)**: The maximum acceptable data loss measured in time. An RPO of 1 hour means you can lose at most 1 hour of data.

**Recovery Time Objective (RTO)**: The maximum acceptable downtime. An RTO of 4 hours means the database must be operational within 4 hours of the disaster.

| Scenario | RPO | RTO | Strategy | |----------|-----|-----|----------| | Internal tool | 24 hours | 24 hours | Daily backups, restore | | E-commerce | 5 minutes | 1 hour | Cross-region replication | | Financial trading | 0 (zero loss) | 5 minutes | Synchronous replication + DR site |

Cross-Region Replication

PostgreSQL Logical Replication Across Regions




-- On primary (us-east-1)


CREATE PUBLICATION dr_pub FOR ALL TABLES;




-- On standby (us-west-2)


CREATE SUBSCRIPTION dr_sub


CONNECTION 'host=primary-us-east-1.example.com port=5432 dbname=proddb'


PUBLICATION dr_pub


WITH (copy_data = true, connect = true, create_slot = true);





Logical replication works across regions with asynchronous delivery. Monitor lag carefully:




SELECT pg_size_pretty(


pg_wal_lsn_diff(


pg_current_wal_lsn(),


replay_lsn


)


) AS replication_lag


FROM pg_stat_replication


WHERE application_name = 'dr_sub';





AWS RDS Cross-Region Read Replicas




# Create cross-region read replica


aws rds create-db-instance-read-replica \


--db-instance-identifier mydb-dr \


--source-db-instance-identifier mydb \


--region us-west-2 \


--db-instance-class db.r6g.large




# Promote to standalone for DR


aws rds promote-read-replica \


--db-instance-identifier mydb-dr \


--region us-west-2





Multi-Region with Patroni

Patroni can manage clusters across regions with careful configuration:




# DR site configuration


scope: myapp


namespace: /service/


name: pg-dr-node-1




consul:


host: dr-consul.service.consul:8500


# Separate DCS for DR isolation




tags:


nofailover: true # DR site should not automatically become primary





Backup-Based DR

For cost-sensitive environments, backups plus WAL archiving to S3 provide DR:




# Continuous WAL archiving to cross-region S3 bucket


archive_command = 'aws s3 cp %p s3://myapp-wal-dr/region/us-east-1/%f'




# DR restore procedure


pg_restore --dbname=proddb /backups/dr/latest_full.dump


pg_receivewal --directory /backups/dr/wal





Recovery Workflow




#!/bin/bash


# Dr: restore to us-west-2




# 1. Restore latest full backup


pgbackrest --stanza=prod --db-path=/var/lib/postgresql/dr restore




# 2. Set recovery target


cat >> /var/lib/postgresql/dr/postgresql.conf << EOF


restore_command = 'aws s3 cp s3://myapp-wal-dr/region/us-east-1/%f %p'


recovery_target_time = '2026-05-12 10:00:00 UTC'


recovery_target_action = promote


EOF




# 3. Start and recover


pg_ctl start -D /var/lib/postgresql/dr




# 4. Verify data integrity


psql -c "SELECT count(*) FROM critical_table;"


psql -c "SELECT max(created_at) FROM orders;"





Backup Testing

Backups are worthless until proven restorable. Regular testing is mandatory.

Automated Restore Test




#!/bin/bash


# Weekly restore test


set -euo pipefail




TEST_DIR=/tmp/dr_test_$(date +%Y%m%d)


LOG_FILE=$TEST_DIR/restore.log




mkdir -p $TEST_DIR




echo "=== DR Restore Test $(date) ===" >> $LOG_FILE




# Full restore


pgbackrest --stanza=prod --db-path=$TEST_DIR/data restore >> $LOG_FILE 2>&1




# Start database


pg_ctl -D $TEST_DIR/data -l $TEST_DIR/pg.log start >> $LOG_FILE 2>&1


sleep 5




# Verify


echo "Database size:"


psql -p 5433 -c "SELECT pg_size_pretty(pg_database_size('proddb'));"




echo "Row counts:"


psql -p 5433 -c "


SELECT 'users' as tbl, count(*) FROM users


UNION ALL


SELECT 'orders', count(*) FROM orders


UNION ALL


SELECT 'payments', count(*) FROM payments;


"




echo "Max dates (data freshness):"


psql -p 5433 -c "


SELECT 'users' as tbl, max(created_at) FROM users


UNION ALL


SELECT 'orders', max(created_at) FROM orders;


"




# Cleanup


pg_ctl -D $TEST_DIR/data stop >> $LOG_FILE 2>&1


rm -rf $TEST_DIR




echo "=== Test Complete ===" >> $LOG_FILE





Schedule this via cron:




0 2 * * 0 /usr/local/bin/dr_restore_test.sh





DR Plan Components

A complete DR plan should document:


* **Contact list**: Who to contact and escalation paths.

2\. **RTO and RPO targets**: Specific to each data tier. 3\. **Runbook**: Step-by-step recovery procedures. 4\. **DR site details**: Region, connection strings, credentials. 5\. **Validation steps**: How to verify the recovery succeeded. 6\. **Communication plan**: Internal and external notifications. 7\. **Post-mortem process**: How to document and improve.

Disaster Scenarios and Mitigations

| Scenario | Mitigation | RPO Impact | |----------|------------|------------| | Region outage | Cross-region replica promotion | RPO = replication lag | | Accidental DROP TABLE | PITR to before the statement | RPO = time since last WAL backup | | Ransomware | Immutable WAL backups | RPO depends on backup frequency | | Data corruption | Replay WAL; keep multiple backups | Dependent on detection time |

Testing DR with Chaos Engineering




# Simulate region failure: block traffic to primary


iptables -A INPUT -s dr-test-region -j DROP




# Trigger DR failover script


./dr_failover.sh --target us-west-2




# Verify applications work from DR region


curl -f https://dr-api.myapp.com/health




# Fail back


./dr_failback.sh --target us-east-1




# Clean up


iptables -D INPUT -s dr-test-region -j DROP





Run DR drills quarterly at minimum. Document every drill outcome and update the runbook with lessons learned. A DR plan that has never been tested is not a plan; it is a hope.