Database High Availability: Failover, Standby Types, Health Checks

High availability (HA) ensures your database remains accessible despite hardware failures, network partitions, or software crashes. This article covers failover mechanisms, standby types, health monitoring, and automated recovery.

Availability Metrics

Availability is measured in "nines":

| Uptime | Downtime/Year | Classification | |--------|---------------|----------------| | 99% | 87.6 hours | Basic | | 99.9% (three nines) | 8.76 hours | Standard | | 99.99% (four nines) | 52.56 minutes | Enterprise | | 99.999% (five nines) | 5.26 minutes | Mission-critical |

Achieving higher availability requires increasingly sophisticated (and expensive) infrastructure.

Standby Types

Hot Standby

A hot standby accepts read queries while replicating from the primary. PostgreSQL's `hot_standby = on` enables this:

# standby postgresql.conf

hot_standby = on

hot_standby_feedback = on

* **Failover time**: Seconds (already running, just promote).

* **Resource usage**: Full database server resources.

* **Use case**: Production read replicas with fast failover.

Warm Standby

A warm standby is running but does not accept connections until promoted:

# warm standby: accept no connections

hot_standby = off

* **Failover time**: Tens of seconds (start postmaster + recovery).

* **Resource usage**: Moderate (receives WAL but no query processing).

* **Use case**: Cost-sensitive environments.

Cold Standby

A cold standby is an offline copy of the data. It must be started from scratch during failover:

* **Failover time**: Minutes to hours (restore from backup, start database).

* **Resource usage**: Minimal (only storage).

* **Use case**: Disaster recovery only.

Failover Strategies

Manual Failover

The operator promotes a standby and redirects applications:

# Promote standby

pg_ctl promote -D /var/lib/postgresql/data

# Update application DNS or connection string

# This may take minutes and requires human intervention

**Pros**: Simple, operator has full control. **Cons**: Slow (minutes), error-prone under pressure.

Automated Failover with Patroni

Patroni is the most popular HA solution for PostgreSQL. It uses distributed consensus (etcd, Consul, or Zookeeper) to elect a new leader:

# patroni.yml

scope: myapp-db

namespace: /service/

name: pg-node-1

consul:

host: consul.service.consul:8500

register_service: true

postgresql:

listen: 0.0.0.0:5432

connect_address: pg-node-1:5432

data_dir: /var/lib/postgresql/data

bin_dir: /usr/lib/postgresql/16/bin

authentication:

replication:

username: replicator

password: secure_password

superuser:

username: postgres

password: secure_password

parameters:

max_connections: 200

use_pg_rewind: true

use_slots: true

watchdog:

mode: required

device: /dev/watchdog

Health Checks

Patroni performs regular health checks:

* **pg_isready**: Checks if PostgreSQL is accepting connections.

2\. **Replication lag**: If lag exceeds a threshold, the standby is not eligible for promotion. 3\. **Consensus health**: The node must be reachable to the DCS (etcd/Consul).

# Manual health check

pg_isready -h localhost -p 5432

# localhost:5432 - accepting connections

# Check replication status

psql -c "SELECT * FROM pg_stat_replication;"

Failover Flow

* Patroni detects primary is down (missed health check + no DCS lock).

2\. Patroni holds a leader election via DCS. 3\. The most advanced replica wins (most recent WAL position). 4\. The replica is promoted to primary. 5\. The old primary, when it recovers, is reconfigured as a replica.

Connection Draining and Routing

During failover, in-flight transactions are lost. Applications must handle this:

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1))

def execute_with_retry(conn, query, params=None):

try:

with conn.cursor() as cur:

cur.execute(query, params or ())

conn.commit()

except (psycopg2.OperationalError, psycopg2.InterfaceError) as e:

# Reconnect and retry

conn = psycopg2.connect(dsn)

raise

Smart DNS (Route53, CloudDNS)

import boto3

def failover_to_standby(region, standby_dns):

r53 = boto3.client('route53')

r53.change_resource_record_sets(

HostedZoneId='ZONE_ID',

ChangeBatch={

'Changes': [{

'Action': 'UPSERT',

'ResourceRecordSet': {

'Name': 'db.myapp.com',

'Type': 'CNAME',

'TTL': 30,

'ResourceRecords': [{'Value': standby_dns}]

}

}]

}

)

Low TTL (30 seconds) ensures fast propagation.

PostgreSQL High Availability Stack

A production HA setup combines:

* **Streaming replication**: 2-3 synchronous standbys.

2\. **Patroni**: Automated failover and cluster management. 3\. **etcd/Consul**: Distributed consensus for leader election. 4\. **HAProxy or PgBouncer**: Connection routing to the current primary. 5\. **Watchdog**: Fencing to prevent split-brain.

# HAProxy configuration

frontend pg_frontend

bind *:5000

default_backend pg_backend

backend pg_backend

option httpchk GET /master

server pg-node-1 pg-node-1:5432 check port 8008 inter 2s

server pg-node-2 pg-node-2:5432 check port 8008 inter 2s

server pg-node-3 pg-node-3:5432 check port 8008 inter 2s

Testing HA

Regular failover testing is essential:

# Test: Kill the primary

ssh pg-node-1 "systemctl stop postgresql"

# Expected:

# - Patroni promotes pg-node-2 within 30 seconds

# - HAProxy routes to pg-node-2

# - Applications reconnect and continue

# Verify:

psql -h haproxy -p 5000 -c "SELECT inet_server_addr();"

# Should show pg-node-2's IP

Common Pitfalls

* **Split-brain**: Two nodes both think they are primary. Prevented by fencing (STONITH) and watchdog.

* **Replication lag**: A promoted standby may lose recent transactions if synchronous replication is not configured.

* **Connection timeouts**: Application connection pools must be configured to detect and recover from failover quickly.

* **Untested failover**: The first failover should not be a real incident. Test monthly.

High availability is not a product you buy; it is a property of your system's architecture and operations. Invest in automation, test regularly, and accept that no system is 100% available.

Database High Availability: Failover, Standby Types, Health Checks

Database High Availability: Failover, Standby Types, Health Checks

Related Articles