A robust monitoring and alerting system is the backbone of reliable production infrastructure. Without it, you are flying blind -- discovering outages only when users complain. This guide covers setting up a complete monitoring stack and designing effective alert rules.
The Four Golden Signals
Google's SRE book defines four key metrics for user-facing systems:
2. **Traffic** -- Request rate (RPS, QPS) or throughput.
3. **Errors** -- Rate of failed requests (5xx, timeouts, explicit error responses).
4. **Saturation** -- How full the service is (CPU, memory, queue depth).
Every monitoring system should capture these four signals for each service.
Metrics Collection Stack
The Prometheus ecosystem has become the standard for metrics collection:
Application → Metrics Export → Prometheus → Grafana
↑ ↓
Node Exporter Alertmanager
↑ ↓
System Notification
Metrics Channels
Install Prometheus and configure it to scrape targets:
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'node'
static_configs:
- targets: ['localhost:9100']
- job_name: 'app'
static_configs:
- targets: ['localhost:3000']
Use `scrape_interval` of 15 seconds for most metrics. For high-cardinality metrics (e.g., per-request tracing), use a longer interval or sample.
Application Instrumentation
Export application metrics in Prometheus format:
// Node.js with prom-client
const client = require('prom-client');
const httpRequestDuration = new client.Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration in seconds',
labelNames: ['method', 'route', 'status'],
buckets: [0.01, 0.05, 0.1, 0.5, 1, 5]
});
// Record metrics in middleware
app.use((req, res, next) => {
const end = httpRequestDuration.startTimer();
res.on('finish', () => {
end({ method: req.method, route: req.route?.path || 'unknown', status: res.statusCode });
});
next();
});
Use Histogram metrics for latency, Counter for request counts, and Gauge for current resource usage. Avoid unbounded label cardinality.
Centralized Logging
The ELK stack (Elasticsearch, Logstash, Kibana) remains popular, but the Grafana Loki stack is simpler and cheaper for log aggregation:
# docker-compose for Loki + Promtail
services:
loki:
image: grafana/loki:3.0
ports: ["3100:3100"]
promtail:
image: grafana/promtail:3.0
volumes:
- /var/log:/var/log
- ./promtail.yml:/etc/promtail/promtail.yml
grafana:
image: grafana/grafana:latest
ports: ["3000:3000"]
Promtail tails log files, adds labels, and ships them to Loki. Grafana queries both Prometheus (metrics) and Loki (logs) in a unified dashboard.
Effective Alerting Rules
Design alerts that are actionable and meaningful:
# prometheus-alerts.yml
groups:
- name: application
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate ({{ $value | humanizePercentage }})"
- alert: HighLatency
expr: |
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket[5m])
) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "p99 latency is {{ $value }}s"
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: critical
Alert Design Principles
Notification Channels
Route alerts through Alertmanager:
# alertmanager.yml
route:
receiver: 'team-page'
routes:
- match:
severity: warning
receiver: 'team-slack'
receivers:
- name: 'team-page'
pagerduty_configs:
- routing_key: '...'
- name: 'team-slack'
slack_configs:
- api_url: 'https://hooks.slack.com/services/...'
channel: '#alerts'
Critical alerts go to PagerDuty or Opsgenie for immediate attention. Warnings go to Slack for team awareness.
Dashboard Best Practices
Effective Grafana dashboards follow these principles:
Synthetic Monitoring
Complement real-user monitoring with synthetic checks:
# blackbox-exporter targets
modules:
http_2xx:
prober: http
http:
preferred_ip_protocol: "ip4"
valid_status_codes: [200, 201, 204]
Run synthetic checks from multiple geographic locations to detect regional outages.
Summary
A complete monitoring stack requires metrics (Prometheus), logs (Loki), dashboards (Grafana), and alerting (Alertmanager). Instrument your applications with the four golden signals, design alerts that fire on symptoms not causes, and ensure every alert has a clear path to resolution. Start simple with Prometheus and Grafana, then add log aggregation and synthetic monitoring as your infrastructure grows.