Database monitoring tracks the health, performance, and availability of database systems. Effective monitoring catches problems before they become incidents. This article covers the key metrics to track, tools for collection and visualization, and strategies for setting meaningful alert thresholds.
Key Database Metrics
Queries Per Second (QPS)
QPS measures the throughput of your database. Monitoring QPS helps you understand traffic patterns and capacity requirements.
-- MySQL: Current QPS
SHOW GLOBAL STATUS LIKE 'Questions';
-- PostgreSQL: Queries per second
SELECT queries_per_second()
FROM pg_stat_database
WHERE datname = 'your_database';
Query Latency
Latency measures how long queries take to execute. Latency spikes degrade user experience and indicate problems.
**Percentile metrics matter more than averages**: p50 (median) shows typical performance. p99 shows the worst-performing queries. A 10ms p50 with a 30-second p99 means 1% of users have a terrible experience.
-- PostgreSQL: Query latency by query
SELECT
queryid,
query,
mean_exec_time as avg_ms,
p99_exec_time as p99_ms,
calls
FROM pg_stat_statements
ORDER BY p99_exec_time DESC
LIMIT 10;
Active Connections
Connection count indicates how many clients are simultaneously connected. Running out of connections causes application failures.
-- PostgreSQL: Active connection count
SELECT
state,
count(*) as connections
FROM pg_stat_activity
WHERE backend_type = 'client backend'
GROUP BY state;
-- Check for idle in transaction (potential blocking)
SELECT count(*) FROM pg_stat_activity
WHERE state = 'idle in transaction'
AND xact_start < now() - interval '5 minutes';
Cache Hit Ratio
Cache hit ratio measures how often the database finds data in memory versus reading from disk. Higher is better.
-- PostgreSQL: Buffer cache hit ratio
SELECT
sum(blks_hit) / nullif(sum(blks_hit + blks_read), 0) * 100 as hit_ratio
FROM pg_stat_database
WHERE datname = 'your_database';
-- MySQL: InnoDB buffer pool hit ratio
SHOW STATUS LIKE 'InnoDB_buffer_pool_read_%';
Disk I/O
Disk I/O metrics reveal storage performance issues.
Replication Lag
For databases with read replicas, replication lag measures how far behind replicas are from the primary.
-- PostgreSQL: Replication lag
SELECT
application_name,
pg_size_pretty(pg_wal_lsn_diff(
pg_current_wal_lsn(),
flush_lsn
)) as lag_bytes,
now() - pg_last_xact_replay_timestamp() as lag_time
FROM pg_stat_replication;
-- MySQL: Replication lag
SHOW SLAVE STATUS\G
-- Look at Seconds_Behind_Master
Prometheus Exporters
Prometheus is the dominant metrics collection system for infrastructure monitoring. Database exporters translate database metrics into Prometheus format.
PostgreSQL Exporter
# docker-compose.yml for PostgreSQL monitoring
version: '3'
services:
postgres:
image: postgres:16
environment:
POSTGRES_DB: myapp
ports:
- "5432:5432"
postgres_exporter:
image: prometheuscommunity/postgres-exporter:v0.15.0
environment:
DATA_SOURCE_NAME: "postgresql://user:password@postgres:5432/myapp?sslmode=disable"
ports:
- "9187:9187"
prometheus:
image: prom/prometheus:v2.53.0
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
ports:
- "9090:9090"
# prometheus.yml scrape configuration
scrape_configs:
- job_name: 'postgres'
static_configs:
- targets: ['postgres_exporter:9187']
metrics_path: /metrics
MySQL Exporter
mysql_exporter:
image: prom/mysqld-exporter:v0.15.1
environment:
DATA_SOURCE_NAME: "exporter:password@(mysql:3306)/"
ports:
- "9104:9104"
Grafana Dashboards
Grafana visualizes Prometheus metrics in configurable dashboards. The community provides pre-built dashboards for most databases.
Dashboard Sections
A good database dashboard includes these panels:
2. **Query Performance**: Top N queries by duration, calls, and rows read.
3. **Resource Usage**: CPU, memory, disk I/O, network throughput.
4. **Slow Queries**: Count of queries exceeding a latency threshold over time.
5. **Error Rate**: Database errors (deadlocks, constraint violations, connection timeouts).
// Grafana panel JSON (simplified)
{
"title": "Query Latency (p99)",
"type": "graph",
"targets": [
{
"expr": "pg_stat_statements_p99_ms{datname='myapp'}",
"legendFormat": "{{queryid}}"
}
],
"yaxes": [
{ "label": "milliseconds", "format": "ms" }
],
"thresholds": [
{ "value": 100, "op": "gt", "color": "orange" },
{ "value": 1000, "op": "gt", "color": "red" }
]
}
Alert Thresholds
Setting the right alert thresholds is harder than it sounds. Too many alerts cause alert fatigue. Too few alerts miss critical issues.
Threshold Strategy
**Static thresholds**: Fixed values based on capacity. Simple but do not adapt to usage patterns.
Alert if:
- Active connections > 80% of max_connections
- Replication lag > 60 seconds
- Disk usage > 85%
- Cache hit ratio < 97%
**Dynamic thresholds**: Alert when metrics deviate from their historical baseline. Better for bursty workloads.
# Dynamic threshold calculation
def calculate_dynamic_threshold(hourly_metric, deviation_factor=2.0):
"""Alert when metric deviates by more than N standard deviations."""
mean = hourly_metric.mean()
std = hourly_metric.std()
upper_threshold = mean + (std * deviation_factor)
lower_threshold = max(0, mean - (std * deviation_factor))
return upper_threshold, lower_threshold
**Rate-of-change thresholds**: Alert on rapid changes, not absolute values. Useful for detecting anomalies.
Alert if:
- QPS change > 300% compared to same time last week
- Replication lag increased by 10 seconds in 1 minute
Alert Severity Levels
| Severity | Response Time | Example |
|----------|---------------|---------|
| Critical | Immediate (24x7) | Database down, replication broken |
| High | 1 hour | P99 latency > 1 second, disk > 90% |
| Medium | 1 day | Cache hit ratio trending down |
| Low | 1 week | Database version EOL approaching |
Reducing Alert Fatigue
2. **Implement hysteresis**: Require the condition to persist for N minutes before alerting.
3. **Suppress during maintenance**: Automatically suppress alerts during scheduled windows.
4. **Group related alerts**: A single "database slow" alert is better than 50 individual slow-query alerts.
5. **Auto-resolve**: Clear alerts when the metric returns to normal.
Database-Specific Monitoring Tools
pg_stat_statements (PostgreSQL)
The pg_stat_statements extension tracks query execution statistics. It identifies which queries consume the most resources.
-- Enable the extension
CREATE EXTENSION pg_stat_statements;
-- Find the most expensive queries
SELECT
query,
total_exec_time / calls as avg_ms,
calls,
rows,
shared_blks_hit / (shared_blks_hit + shared_blks_read) * 100 as cache_hit_pct
FROM pg_stat_statements
WHERE query NOT LIKE '%pg_%'
ORDER BY total_exec_time DESC
LIMIT 10;
Performance Schema (MySQL)
MySQL Performance Schema provides detailed instrumentation of query execution.
-- Top queries by total latency
SELECT
digest_text as query,
count_star as execution_count,
round(sum_timer_wait / 1000000000, 2) as total_latency_ms
FROM performance_schema.events_statements_summary_by_digest
ORDER BY sum_timer_wait DESC
LIMIT 10;
Conclusion
Database monitoring is a continuous practice, not a one-time setup. Track QPS, latency, connections, cache hit ratio, and replication lag. Collect these metrics with Prometheus exporters, visualize them in Grafana dashboards, and set intelligent alert thresholds. Start with static thresholds for critical metrics and move to dynamic thresholds as you collect more data. The goal is to detect problems before users do.