Database monitoring tracks the health, performance, and availability of database systems. Effective monitoring catches problems before they become incidents. This article covers the key metrics to track, tools for collection and visualization, and strategies for setting meaningful alert thresholds.


Key Database Metrics


Queries Per Second (QPS)


QPS measures the throughput of your database. Monitoring QPS helps you understand traffic patterns and capacity requirements.


  • **Spike detection**: Sudden QPS increases may indicate a traffic surge or a runaway query.
  • **Capacity planning**: If QPS consistently approaches your database's limit, it is time to scale.
  • **Baseline shifts**: A permanent QPS increase suggests organic growth that may require infrastructure changes.

  • 
    -- MySQL: Current QPS
    
    SHOW GLOBAL STATUS LIKE 'Questions';
    
    
    
    -- PostgreSQL: Queries per second
    
    SELECT queries_per_second()
    
    FROM pg_stat_database
    
    WHERE datname = 'your_database';
    
    

    Query Latency


    Latency measures how long queries take to execute. Latency spikes degrade user experience and indicate problems.


    **Percentile metrics matter more than averages**: p50 (median) shows typical performance. p99 shows the worst-performing queries. A 10ms p50 with a 30-second p99 means 1% of users have a terrible experience.


    
    -- PostgreSQL: Query latency by query
    
    SELECT
    
        queryid,
    
        query,
    
        mean_exec_time as avg_ms,
    
        p99_exec_time as p99_ms,
    
        calls
    
    FROM pg_stat_statements
    
    ORDER BY p99_exec_time DESC
    
    LIMIT 10;
    
    

    Active Connections


    Connection count indicates how many clients are simultaneously connected. Running out of connections causes application failures.


  • **Connection pool saturation**: When connection count reaches the max_pool limit, new connections wait or fail.
  • **Active vs idle**: High idle connections with low active connections indicates inefficient connection pool configuration.
  • **Connection churn**: Rapid connect/disconnect cycles add overhead.

  • 
    -- PostgreSQL: Active connection count
    
    SELECT
    
        state,
    
        count(*) as connections
    
    FROM pg_stat_activity
    
    WHERE backend_type = 'client backend'
    
    GROUP BY state;
    
    
    
    -- Check for idle in transaction (potential blocking)
    
    SELECT count(*) FROM pg_stat_activity
    
    WHERE state = 'idle in transaction'
    
    AND xact_start < now() - interval '5 minutes';
    
    

    Cache Hit Ratio


    Cache hit ratio measures how often the database finds data in memory versus reading from disk. Higher is better.


  • **Buffer cache hit ratio**: 99%+ is healthy for most workloads. Below 95% indicates the working set does not fit in memory.
  • **Index cache hit ratio**: Especially important for databases with large indexes.

  • 
    -- PostgreSQL: Buffer cache hit ratio
    
    SELECT
    
        sum(blks_hit) / nullif(sum(blks_hit + blks_read), 0) * 100 as hit_ratio
    
    FROM pg_stat_database
    
    WHERE datname = 'your_database';
    
    
    
    -- MySQL: InnoDB buffer pool hit ratio
    
    SHOW STATUS LIKE 'InnoDB_buffer_pool_read_%';
    
    

    Disk I/O


    Disk I/O metrics reveal storage performance issues.


  • **IOPS**: Input/output operations per second. Cloud databases have IOPS limits based on volume size.
  • **Throughput**: Data read/written per second.
  • **Latency**: Time per I/O operation. Under 10ms is healthy. Over 50ms indicates problems.
  • **Queue depth**: Number of I/O operations waiting. High queue depth suggests the disk subsystem is overwhelmed.

  • Replication Lag


    For databases with read replicas, replication lag measures how far behind replicas are from the primary.


    
    -- PostgreSQL: Replication lag
    
    SELECT
    
        application_name,
    
        pg_size_pretty(pg_wal_lsn_diff(
    
            pg_current_wal_lsn(),
    
            flush_lsn
    
        )) as lag_bytes,
    
        now() - pg_last_xact_replay_timestamp() as lag_time
    
    FROM pg_stat_replication;
    
    
    
    -- MySQL: Replication lag
    
    SHOW SLAVE STATUS\G
    
    -- Look at Seconds_Behind_Master
    
    

    Prometheus Exporters


    Prometheus is the dominant metrics collection system for infrastructure monitoring. Database exporters translate database metrics into Prometheus format.


    PostgreSQL Exporter


    
    # docker-compose.yml for PostgreSQL monitoring
    
    version: '3'
    
    services:
    
      postgres:
    
        image: postgres:16
    
        environment:
    
          POSTGRES_DB: myapp
    
        ports:
    
          - "5432:5432"
    
    
    
      postgres_exporter:
    
        image: prometheuscommunity/postgres-exporter:v0.15.0
    
        environment:
    
          DATA_SOURCE_NAME: "postgresql://user:password@postgres:5432/myapp?sslmode=disable"
    
        ports:
    
          - "9187:9187"
    
    
    
      prometheus:
    
        image: prom/prometheus:v2.53.0
    
        volumes:
    
          - ./prometheus.yml:/etc/prometheus/prometheus.yml
    
        ports:
    
          - "9090:9090"
    
    

    
    # prometheus.yml scrape configuration
    
    scrape_configs:
    
      - job_name: 'postgres'
    
        static_configs:
    
          - targets: ['postgres_exporter:9187']
    
        metrics_path: /metrics
    
    

    MySQL Exporter


    
      mysql_exporter:
    
        image: prom/mysqld-exporter:v0.15.1
    
        environment:
    
          DATA_SOURCE_NAME: "exporter:password@(mysql:3306)/"
    
        ports:
    
          - "9104:9104"
    
    

    Grafana Dashboards


    Grafana visualizes Prometheus metrics in configurable dashboards. The community provides pre-built dashboards for most databases.


    Dashboard Sections


    A good database dashboard includes these panels:


  • **Overview**: QPS, active connections, cache hit ratio, replication lag.
  • 2. **Query Performance**: Top N queries by duration, calls, and rows read.

    3. **Resource Usage**: CPU, memory, disk I/O, network throughput.

    4. **Slow Queries**: Count of queries exceeding a latency threshold over time.

    5. **Error Rate**: Database errors (deadlocks, constraint violations, connection timeouts).


    
    // Grafana panel JSON (simplified)
    
    {
    
      "title": "Query Latency (p99)",
    
      "type": "graph",
    
      "targets": [
    
        {
    
          "expr": "pg_stat_statements_p99_ms{datname='myapp'}",
    
          "legendFormat": "{{queryid}}"
    
        }
    
      ],
    
      "yaxes": [
    
        { "label": "milliseconds", "format": "ms" }
    
      ],
    
      "thresholds": [
    
        { "value": 100, "op": "gt", "color": "orange" },
    
        { "value": 1000, "op": "gt", "color": "red" }
    
      ]
    
    }
    
    

    Alert Thresholds


    Setting the right alert thresholds is harder than it sounds. Too many alerts cause alert fatigue. Too few alerts miss critical issues.


    Threshold Strategy


    **Static thresholds**: Fixed values based on capacity. Simple but do not adapt to usage patterns.


    
    Alert if:
    
      - Active connections > 80% of max_connections
    
      - Replication lag > 60 seconds
    
      - Disk usage > 85%
    
      - Cache hit ratio < 97%
    
    

    **Dynamic thresholds**: Alert when metrics deviate from their historical baseline. Better for bursty workloads.


    
    # Dynamic threshold calculation
    
    def calculate_dynamic_threshold(hourly_metric, deviation_factor=2.0):
    
        """Alert when metric deviates by more than N standard deviations."""
    
        mean = hourly_metric.mean()
    
        std = hourly_metric.std()
    
        upper_threshold = mean + (std * deviation_factor)
    
        lower_threshold = max(0, mean - (std * deviation_factor))
    
        return upper_threshold, lower_threshold
    
    

    **Rate-of-change thresholds**: Alert on rapid changes, not absolute values. Useful for detecting anomalies.


    
    Alert if:
    
      - QPS change > 300% compared to same time last week
    
      - Replication lag increased by 10 seconds in 1 minute
    
    

    Alert Severity Levels


    | Severity | Response Time | Example |

    |----------|---------------|---------|

    | Critical | Immediate (24x7) | Database down, replication broken |

    | High | 1 hour | P99 latency > 1 second, disk > 90% |

    | Medium | 1 day | Cache hit ratio trending down |

    | Low | 1 week | Database version EOL approaching |


    Reducing Alert Fatigue


  • **Use multi-condition alerts**: Alert only when A and B are both true. Example: latency > 1s AND QPS > baseline.
  • 2. **Implement hysteresis**: Require the condition to persist for N minutes before alerting.

    3. **Suppress during maintenance**: Automatically suppress alerts during scheduled windows.

    4. **Group related alerts**: A single "database slow" alert is better than 50 individual slow-query alerts.

    5. **Auto-resolve**: Clear alerts when the metric returns to normal.


    Database-Specific Monitoring Tools


    pg_stat_statements (PostgreSQL)


    The pg_stat_statements extension tracks query execution statistics. It identifies which queries consume the most resources.


    
    -- Enable the extension
    
    CREATE EXTENSION pg_stat_statements;
    
    
    
    -- Find the most expensive queries
    
    SELECT
    
        query,
    
        total_exec_time / calls as avg_ms,
    
        calls,
    
        rows,
    
        shared_blks_hit / (shared_blks_hit + shared_blks_read) * 100 as cache_hit_pct
    
    FROM pg_stat_statements
    
    WHERE query NOT LIKE '%pg_%'
    
    ORDER BY total_exec_time DESC
    
    LIMIT 10;
    
    

    Performance Schema (MySQL)


    MySQL Performance Schema provides detailed instrumentation of query execution.


    
    -- Top queries by total latency
    
    SELECT
    
        digest_text as query,
    
        count_star as execution_count,
    
        round(sum_timer_wait / 1000000000, 2) as total_latency_ms
    
    FROM performance_schema.events_statements_summary_by_digest
    
    ORDER BY sum_timer_wait DESC
    
    LIMIT 10;
    
    

    Conclusion


    Database monitoring is a continuous practice, not a one-time setup. Track QPS, latency, connections, cache hit ratio, and replication lag. Collect these metrics with Prometheus exporters, visualize them in Grafana dashboards, and set intelligent alert thresholds. Start with static thresholds for critical metrics and move to dynamic thresholds as you collect more data. The goal is to detect problems before users do.