Database Capacity Planning: Sizing, Growth Forecasting, and Scaling
Database Capacity Planning: Sizing, Growth Forecasting, and Scaling
Capacity planning ensures your database has enough resources to handle current and future workloads without over-provisioning. It is a continuous process that combines monitoring, forecasting, and proactive scaling.
Key Capacity Metrics
Storage
Storage is the most predictable resource to plan. Track:
-- Database sizes
SELECT datname,
pg_size_pretty(pg_database_size(datname)) AS size
FROM pg_database
ORDER BY pg_database_size(datname) DESC;
-- Table sizes (top 10)
SELECT relname AS table_name,
pg_size_pretty(pg_total_relation_size(relid)) AS total_size,
pg_size_pretty(pg_relation_size(relid)) AS table_size,
pg_size_pretty(pg_indexes_size(relid)) AS index_size
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC
LIMIT 10;
-- Growth by day
SELECT date(created_at) AS day,
count(*) AS rows_added,
count(*) * 200 AS estimated_bytes -- rough estimate
FROM orders
WHERE created_at > now() - interval '30 days'
GROUP BY day
ORDER BY day;
Compute (CPU)
CPU usage correlates with query complexity and concurrency:
-- Queries with highest total CPU time
SELECT queryid, query,
total_exec_time,
calls,
mean_exec_time,
rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;
Monitor: CPU utilization %, replication CPU usage, autovacuum CPU usage.
Memory
-- Shared buffers usage
SELECT name, setting, unit,
current_setting(name)::numeric / pg_size_pretty('') AS ratio
FROM pg_settings
WHERE name IN ('shared_buffers', 'effective_cache_size',
'work_mem', 'maintenance_work_mem');
-- Hit ratio (should be >99%)
SELECT 'shared_buffers' AS area,
sum(blks_hit)::float / (sum(blks_hit) + sum(blks_read)) AS hit_ratio
FROM pg_stat_database;
Connections
SELECT max_conn.setting AS max_connections,
used_conn.count AS used_connections,
used_conn.count::float / max_conn.setting::int AS utilization_pct
FROM (SELECT setting FROM pg_settings WHERE name = 'max_connections') max_conn,
(SELECT count(*) AS count FROM pg_stat_activity) used_conn;
Growth Forecasting
Simple Linear Model
import psycopg2
from datetime import datetime, timedelta
import numpy as np
conn = psycopg2.connect("dbname=mydb")
cur = conn.cursor()
# Get daily row counts for last 90 days
cur.execute("""
SELECT date(created_at) AS day, count(*) AS rows
FROM orders
WHERE created_at > now() - interval '90 days'
GROUP BY day
ORDER BY day
""")
data = cur.fetchall()
days = np.array([(row[0] - data[0][0]).days for row in data])
rows = np.array([row[1] for row in data])
# Linear regression
coefficients = np.polyfit(days, rows, 1)
daily_growth = coefficients[0]
# Forecast: 90 days out
forecast_days = 90
current_total = sum(rows)
forecast_total = current_total + daily_growth * forecast_days
print(f"Daily growth: {daily_growth:.0f} rows")
print(f"Current monthly row count: {current_total}")
print(f"Forecast in 90 days: {forecast_total:.0f} rows")
Projecting Storage
avg_row_size_bytes = 250 # From pgstattuple
bytes_per_day = daily_growth * avg_row_size_bytes
gb_per_month = bytes_per_day * 30 / (1024**3)
current_gb = 10 # Current database size
months_to_full = (50 - current_gb) / gb_per_month # Assuming 50 GB limit
print(f"Growth: {gb_per_month:.1f} GB/month")
print(f"Time to 50 GB: {months_to_full:.0f} months")
Scaling Strategies
Vertical Scaling (Scale Up)
Increase the size of the existing database instance:
# AWS RDS
aws rds modify-db-instance \
--db-instance-identifier mydb \
--db-instance-class db.r6g.xlarge \
--apply-immediately
# Or via Terraform
resource "aws_db_instance" "mydb" {
instance_class = "db.r6g.xlarge" # Was db.r6g.large
allocated_storage = 200 # Was 100
}
**Pros**: Simple, no application changes. **Cons**: Instance size limits, downtime for some changes, no infinite scaling.
Horizontal Scaling (Scale Out)
Add read replicas for read workloads:
# Add read replica
aws rds create-db-instance-read-replica \
--db-instance-identifier mydb-replica-1 \
--source-db-instance-identifier mydb \
--db-instance-class db.r6g.large
For write scaling, consider sharding (Citus):
-- Citus distributed table
SELECT create_distributed_table('orders', 'user_id');
-- Queries automatically route to correct shard
SELECT * FROM orders WHERE user_id = 42;
Autoscaling Configuration
Aurora Serverless
# Automatic scaling based on load
Resources:
MyDBCluster:
Type: AWS::RDS::DBCluster
Properties:
Engine: aurora-postgresql
ServerlessV2ScalingConfiguration:
MinCapacity: 0.5
MaxCapacity: 64
Storage Autoscaling
# Enable storage autoscaling on RDS
aws rds modify-db-instance \
--db-instance-identifier mydb \
--max-allocated-storage 1000 \
--apply-immediately
Monitoring Dashboard
Build a capacity monitoring dashboard with these key metrics:
| Metric | Alert Threshold | Action | |--------|-----------------|--------| | Storage utilization | >80% | Increase storage or archive data | | CPU utilization | >80% for 5 minutes | Scale up or optimize queries | | Connection utilization | >80% | Increase max_connections or add pooler | | Replication lag | >60 seconds | Investigate replica or network | | IOPS utilization | >80% of provisioned | Increase IOPS provision | | Cache hit ratio | <99% | Increase shared_buffers |
Seasonal Capacity
Many workloads are not uniform. Plan for seasonal peaks:
# Example: Black Friday capacity plan
# Normal: 4 vCPU, 16 GB RAM, 1000 connections
# Black Friday: 16 vCPU, 64 GB RAM, 5000 connections
# Pre-scale 2 weeks before
- date: 2026-11-10
action: scale_up
target: db.r6g.4xlarge
# Post-scale 1 week after
- date: 2026-12-07
action: scale_down
target: db.r6g.xlarge
The Planning Cycle
* **Monitor**: Continuously collect metrics.
2\. **Analyze**: Identify trends and anomalies. 3\. **Forecast**: Project future resource needs. 4\. **Plan**: Schedule scaling or optimization. 5\. **Execute**: Apply changes during maintenance windows. 6\. **Review**: Validate that changes had the expected effect.
Capacity planning is not a one-time exercise. Review your database capacity monthly, adjust forecasts quarterly, and always maintain headroom for unexpected traffic spikes. Over-provision by 20-30% for production databases to absorb traffic surges without degradation.