Database High Availability: Failover, Standby Types, Health Checks


Database High Availability: Failover, Standby Types, Health Checks

High availability (HA) ensures your database remains accessible despite hardware failures, network partitions, or software crashes. This article covers failover mechanisms, standby types, health monitoring, and automated recovery.

Availability Metrics

Availability is measured in "nines":

| Uptime | Downtime/Year | Classification | |--------|---------------|----------------| | 99% | 87.6 hours | Basic | | 99.9% (three nines) | 8.76 hours | Standard | | 99.99% (four nines) | 52.56 minutes | Enterprise | | 99.999% (five nines) | 5.26 minutes | Mission-critical |

Achieving higher availability requires increasingly sophisticated (and expensive) infrastructure.

Standby Types

Hot Standby

A hot standby accepts read queries while replicating from the primary. PostgreSQL's `hot_standby = on` enables this:




# standby postgresql.conf


hot_standby = on


hot_standby_feedback = on






* **Failover time**: Seconds (already running, just promote).

* **Resource usage**: Full database server resources.

* **Use case**: Production read replicas with fast failover.


Warm Standby

A warm standby is running but does not accept connections until promoted:




# warm standby: accept no connections


hot_standby = off






* **Failover time**: Tens of seconds (start postmaster + recovery).

* **Resource usage**: Moderate (receives WAL but no query processing).

* **Use case**: Cost-sensitive environments.


Cold Standby

A cold standby is an offline copy of the data. It must be started from scratch during failover:


* **Failover time**: Minutes to hours (restore from backup, start database).

* **Resource usage**: Minimal (only storage).

* **Use case**: Disaster recovery only.


Failover Strategies

Manual Failover

The operator promotes a standby and redirects applications:




# Promote standby


pg_ctl promote -D /var/lib/postgresql/data




# Update application DNS or connection string


# This may take minutes and requires human intervention





**Pros**: Simple, operator has full control. **Cons**: Slow (minutes), error-prone under pressure.

Automated Failover with Patroni

Patroni is the most popular HA solution for PostgreSQL. It uses distributed consensus (etcd, Consul, or Zookeeper) to elect a new leader:




# patroni.yml


scope: myapp-db


namespace: /service/


name: pg-node-1




consul:


host: consul.service.consul:8500


register_service: true




postgresql:


listen: 0.0.0.0:5432


connect_address: pg-node-1:5432


data_dir: /var/lib/postgresql/data


bin_dir: /usr/lib/postgresql/16/bin


authentication:


replication:


username: replicator


password: secure_password


superuser:


username: postgres


password: secure_password


parameters:


max_connections: 200


use_pg_rewind: true


use_slots: true




watchdog:


mode: required


device: /dev/watchdog





Health Checks

Patroni performs regular health checks:


* **pg_isready**: Checks if PostgreSQL is accepting connections.

2\. **Replication lag**: If lag exceeds a threshold, the standby is not eligible for promotion. 3\. **Consensus health**: The node must be reachable to the DCS (etcd/Consul).




# Manual health check


pg_isready -h localhost -p 5432


# localhost:5432 - accepting connections




# Check replication status


psql -c "SELECT * FROM pg_stat_replication;"





Failover Flow


* Patroni detects primary is down (missed health check + no DCS lock).

2\. Patroni holds a leader election via DCS. 3\. The most advanced replica wins (most recent WAL position). 4\. The replica is promoted to primary. 5\. The old primary, when it recovers, is reconfigured as a replica.

Connection Draining and Routing

During failover, in-flight transactions are lost. Applications must handle this:




from tenacity import retry, stop_after_attempt, wait_exponential




@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1))


def execute_with_retry(conn, query, params=None):


try:


with conn.cursor() as cur:


cur.execute(query, params or ())


conn.commit()


except (psycopg2.OperationalError, psycopg2.InterfaceError) as e:


# Reconnect and retry


conn = psycopg2.connect(dsn)


raise





Smart DNS (Route53, CloudDNS)




import boto3




def failover_to_standby(region, standby_dns):


r53 = boto3.client('route53')


r53.change_resource_record_sets(


HostedZoneId='ZONE_ID',


ChangeBatch={


'Changes': [{


'Action': 'UPSERT',


'ResourceRecordSet': {


'Name': 'db.myapp.com',


'Type': 'CNAME',


'TTL': 30,


'ResourceRecords': [{'Value': standby_dns}]


}


}]


}


)





Low TTL (30 seconds) ensures fast propagation.

PostgreSQL High Availability Stack

A production HA setup combines:


* **Streaming replication**: 2-3 synchronous standbys.

2\. **Patroni**: Automated failover and cluster management. 3\. **etcd/Consul**: Distributed consensus for leader election. 4\. **HAProxy or PgBouncer**: Connection routing to the current primary. 5\. **Watchdog**: Fencing to prevent split-brain.




# HAProxy configuration


frontend pg_frontend


bind *:5000


default_backend pg_backend




backend pg_backend


option httpchk GET /master


server pg-node-1 pg-node-1:5432 check port 8008 inter 2s


server pg-node-2 pg-node-2:5432 check port 8008 inter 2s


server pg-node-3 pg-node-3:5432 check port 8008 inter 2s





Testing HA

Regular failover testing is essential:




# Test: Kill the primary


ssh pg-node-1 "systemctl stop postgresql"




# Expected:


# - Patroni promotes pg-node-2 within 30 seconds


# - HAProxy routes to pg-node-2


# - Applications reconnect and continue




# Verify:


psql -h haproxy -p 5000 -c "SELECT inet_server_addr();"


# Should show pg-node-2's IP





Common Pitfalls


* **Split-brain**: Two nodes both think they are primary. Prevented by fencing (STONITH) and watchdog.

* **Replication lag**: A promoted standby may lose recent transactions if synchronous replication is not configured.

* **Connection timeouts**: Application connection pools must be configured to detect and recover from failover quickly.

* **Untested failover**: The first failover should not be a real incident. Test monthly.


High availability is not a product you buy; it is a property of your system's architecture and operations. Invest in automation, test regularly, and accept that no system is 100% available.