Database Disaster Recovery: RPO, RTO, Cross-Region Replication

Disaster recovery (DR) ensures your database can survive catastrophic events: region outages, data corruption, accidental deletions, or ransomware attacks. Unlike high availability, which handles component failures, DR addresses large-scale disasters.

RPO and RTO

Two metrics define DR requirements:

**Recovery Point Objective (RPO)**: The maximum acceptable data loss measured in time. An RPO of 1 hour means you can lose at most 1 hour of data.

**Recovery Time Objective (RTO)**: The maximum acceptable downtime. An RTO of 4 hours means the database must be operational within 4 hours of the disaster.

| Scenario | RPO | RTO | Strategy | |----------|-----|-----|----------| | Internal tool | 24 hours | 24 hours | Daily backups, restore | | E-commerce | 5 minutes | 1 hour | Cross-region replication | | Financial trading | 0 (zero loss) | 5 minutes | Synchronous replication + DR site |

Cross-Region Replication

PostgreSQL Logical Replication Across Regions

-- On primary (us-east-1)

CREATE PUBLICATION dr_pub FOR ALL TABLES;

-- On standby (us-west-2)

CREATE SUBSCRIPTION dr_sub

CONNECTION 'host=primary-us-east-1.example.com port=5432 dbname=proddb'

PUBLICATION dr_pub

WITH (copy_data = true, connect = true, create_slot = true);

Logical replication works across regions with asynchronous delivery. Monitor lag carefully:

SELECT pg_size_pretty(

pg_wal_lsn_diff(

pg_current_wal_lsn(),

replay_lsn

)

) AS replication_lag

FROM pg_stat_replication

WHERE application_name = 'dr_sub';

AWS RDS Cross-Region Read Replicas

# Create cross-region read replica

aws rds create-db-instance-read-replica \

--db-instance-identifier mydb-dr \

--source-db-instance-identifier mydb \

--region us-west-2 \

--db-instance-class db.r6g.large

# Promote to standalone for DR

aws rds promote-read-replica \

--db-instance-identifier mydb-dr \

--region us-west-2

Multi-Region with Patroni

Patroni can manage clusters across regions with careful configuration:

# DR site configuration

scope: myapp

namespace: /service/

name: pg-dr-node-1

consul:

host: dr-consul.service.consul:8500

# Separate DCS for DR isolation

tags:

nofailover: true # DR site should not automatically become primary

Backup-Based DR

For cost-sensitive environments, backups plus WAL archiving to S3 provide DR:

# Continuous WAL archiving to cross-region S3 bucket

archive_command = 'aws s3 cp %p s3://myapp-wal-dr/region/us-east-1/%f'

# DR restore procedure

pg_restore --dbname=proddb /backups/dr/latest_full.dump

pg_receivewal --directory /backups/dr/wal

Recovery Workflow

#!/bin/bash

# Dr: restore to us-west-2

# 1. Restore latest full backup

pgbackrest --stanza=prod --db-path=/var/lib/postgresql/dr restore

# 2. Set recovery target

cat >> /var/lib/postgresql/dr/postgresql.conf << EOF

restore_command = 'aws s3 cp s3://myapp-wal-dr/region/us-east-1/%f %p'

recovery_target_time = '2026-05-12 10:00:00 UTC'

recovery_target_action = promote

EOF

# 3. Start and recover

pg_ctl start -D /var/lib/postgresql/dr

# 4. Verify data integrity

psql -c "SELECT count(*) FROM critical_table;"

psql -c "SELECT max(created_at) FROM orders;"

Backup Testing

Backups are worthless until proven restorable. Regular testing is mandatory.

Automated Restore Test

#!/bin/bash

# Weekly restore test

set -euo pipefail

TEST_DIR=/tmp/dr_test_$(date +%Y%m%d)

LOG_FILE=$TEST_DIR/restore.log

mkdir -p $TEST_DIR

echo "=== DR Restore Test $(date) ===" >> $LOG_FILE

# Full restore

pgbackrest --stanza=prod --db-path=$TEST_DIR/data restore >> $LOG_FILE 2>&1

# Start database

pg_ctl -D $TEST_DIR/data -l $TEST_DIR/pg.log start >> $LOG_FILE 2>&1

sleep 5

# Verify

echo "Database size:"

psql -p 5433 -c "SELECT pg_size_pretty(pg_database_size('proddb'));"

echo "Row counts:"

psql -p 5433 -c "

SELECT 'users' as tbl, count(*) FROM users

UNION ALL

SELECT 'orders', count(*) FROM orders

UNION ALL

SELECT 'payments', count(*) FROM payments;

echo "Max dates (data freshness):"

psql -p 5433 -c "

SELECT 'users' as tbl, max(created_at) FROM users

UNION ALL

SELECT 'orders', max(created_at) FROM orders;

# Cleanup

pg_ctl -D $TEST_DIR/data stop >> $LOG_FILE 2>&1

rm -rf $TEST_DIR

echo "=== Test Complete ===" >> $LOG_FILE

Schedule this via cron:

0 2 * * 0 /usr/local/bin/dr_restore_test.sh

DR Plan Components

A complete DR plan should document:

* **Contact list**: Who to contact and escalation paths.

2\. **RTO and RPO targets**: Specific to each data tier. 3\. **Runbook**: Step-by-step recovery procedures. 4\. **DR site details**: Region, connection strings, credentials. 5\. **Validation steps**: How to verify the recovery succeeded. 6\. **Communication plan**: Internal and external notifications. 7\. **Post-mortem process**: How to document and improve.

Disaster Scenarios and Mitigations

| Scenario | Mitigation | RPO Impact | |----------|------------|------------| | Region outage | Cross-region replica promotion | RPO = replication lag | | Accidental DROP TABLE | PITR to before the statement | RPO = time since last WAL backup | | Ransomware | Immutable WAL backups | RPO depends on backup frequency | | Data corruption | Replay WAL; keep multiple backups | Dependent on detection time |

Testing DR with Chaos Engineering

# Simulate region failure: block traffic to primary

iptables -A INPUT -s dr-test-region -j DROP

# Trigger DR failover script

./dr_failover.sh --target us-west-2

# Verify applications work from DR region

curl -f https://dr-api.myapp.com/health

# Fail back

./dr_failback.sh --target us-east-1

# Clean up

iptables -D INPUT -s dr-test-region -j DROP

Run DR drills quarterly at minimum. Document every drill outcome and update the runbook with lessons learned. A DR plan that has never been tested is not a plan; it is a hope.

Database Disaster Recovery: RPO, RTO, Cross-Region Replication

Database Disaster Recovery: RPO, RTO, Cross-Region Replication

Related Articles