Database Monitoring and Performance Alerting
Why Monitor Databases?
Database monitoring catches problems before they become incidents. Track key metrics and alert on anomalies.
Key Metrics
Query Performance
-- PostgreSQL slow queries
SELECT query, mean_exec_time, calls
FROM pg_stat_statements
ORDER BY mean_exec_time DESC LIMIT 10;
-- Active queries
SELECT pid, state, query_start, query
FROM pg_stat_activity
WHERE state = 'active';
Connection Pools
Monitor active vs idle connections. Alert when connection count exceeds 80% of max_connections.
Disk and Memory
Track cache hit ratio (aim for 99%+), disk usage, and IOPS. Low cache hit ratio indicates the working set does not fit in memory.
Replication Lag
SELECT application_name,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes,
now() - pg_last_xact_replay_timestamp() AS lag_time
FROM pg_stat_replication;
Prometheus Setup
# prometheus.yml
scrape_configs:
- job_name: 'postgres'
static_configs:
- targets: ['postgres_exporter:9187']
Alert Thresholds
| Metric | Warning | Critical | |--------|---------|----------| | Cache hit ratio | < 97% | < 95% | | Connections | > 80% | > 90% | | Replication lag | > 30s | > 300s | | Disk usage | > 80% | > 90% |
Conclusion
Track QPS, latency, connections, cache hit ratio, and replication lag. Use Prometheus and Grafana for collection and visualization. Set meaningful alert thresholds and avoid alert fatigue.