Database Monitoring and Performance Alerting

Why Monitor Databases?

Database monitoring catches problems before they become incidents. Track key metrics and alert on anomalies.

Key Metrics

Query Performance

-- PostgreSQL slow queries

SELECT query, mean_exec_time, calls

FROM pg_stat_statements

ORDER BY mean_exec_time DESC LIMIT 10;

-- Active queries

SELECT pid, state, query_start, query

FROM pg_stat_activity

WHERE state = 'active';

Connection Pools

Monitor active vs idle connections. Alert when connection count exceeds 80% of max_connections.

Disk and Memory

Track cache hit ratio (aim for 99%+), disk usage, and IOPS. Low cache hit ratio indicates the working set does not fit in memory.

Replication Lag

SELECT application_name,

pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes,

now() - pg_last_xact_replay_timestamp() AS lag_time

FROM pg_stat_replication;

Prometheus Setup

# prometheus.yml

scrape_configs:

- job_name: 'postgres'

static_configs:

- targets: ['postgres_exporter:9187']

Alert Thresholds

| Metric | Warning | Critical | |--------|---------|----------| | Cache hit ratio | < 97% | < 95% | | Connections | > 80% | > 90% | | Replication lag | > 30s | > 300s | | Disk usage | > 80% | > 90% |

Conclusion

Track QPS, latency, connections, cache hit ratio, and replication lag. Use Prometheus and Grafana for collection and visualization. Set meaningful alert thresholds and avoid alert fatigue.

Database Monitoring and Performance Alerting

Database Monitoring and Performance Alerting

Related Articles