Alerting Strategies for Production Systems
Alerting is the mechanism that converts observability data into human action. A well-designed alerting strategy ensures that the right people are notified at the right time with sufficient context to take effective action. Poor alerting — too many, too few, or poorly targeted — degrades operational effectiveness and drives engineer burnout. The goal is not to minimize alerts but to make every alert actionable.
Alert fatigue is the primary symptom of alerting dysfunction. When engineers receive alerts that require no action (stale alerts, self-resolving conditions, duplicate notifications), they learn to ignore them. Eventually, critical alerts get buried in noise. The cure is ruthless alert hygiene: every alert that pages someone must require human judgment or action. If an alert fires and the engineer consistently finds nothing to do, the alert should be deleted or downgraded to a warning.
Threshold tuning is a continuous process, not a one-time configuration. Static thresholds are rarely correct for long — traffic patterns shift, deployments change characteristics, and business cycles create different performance profiles. Dynamic thresholding using historical baselines accounts for predictable patterns (daily seasonality, weekly cycles). An alert should fire when the current metric deviates significantly from the expected value at this time of day, not when it exceeds an arbitrary absolute value.
The alert severity hierarchy should be clearly defined and consistently applied. P0 (Critical): customer-facing outage or data loss, requires immediate response, pages an engineer. P1 (High): significant degradation, may become critical if unresolved, pages during business hours. P2 (Medium): non-critical degradation, creates a ticket for next business day review. P3 (Low): informational, logged for trend analysis. Each severity level has different response SLAs, escalation paths, and notification channels.
Multi-condition alerts reduce false positives. Instead of alerting when CPU exceeds 90%, alert when CPU exceeds 90% AND the 5-minute average latency exceeds the p99 baseline by 50%. The AND condition ensures that the metric matters — the system is showing signs of actual performance degradation, not just a transient spike. Multi-condition alerts are more specific but introduce sensitivity to the condition evaluation window — both conditions must overlap in time.
Alert duration and evaluation windows prevent flapping. An alert should fire only after the condition has persisted for a minimum duration (e.g., 5 minutes). Brief spikes that self-resolve are common in distributed systems and rarely merit page-outs. Similarly, after an alert fires, it should have a minimum duration before it can re-fire (cooldown). This prevents repeated page-outs for a flapping condition.
On-call rotations balance incident response capability with engineer well-being. The optimal rotation length is 7-14 days — long enough for context accumulation, short enough to prevent burnout. Follow-the-sun rotations across global teams provide 24-hour coverage without overnight pages. Secondary on-call provides backup for primary overflow. Incident commander and operations lead roles separate tactical incident management from technical debugging during major incidents.
The escalation path ensures that alerts are never ignored. If the primary on-call does not acknowledge within the acknowledgment timeout (5-10 minutes), the alert escalates to the secondary. If the secondary does not respond, it escalates to the engineering manager or incident commander. The escalation policy should be documented and tested regularly through scheduled drills. Automated escalation enables confidence that every P0 alert will receive attention.
Runbooks accompany critical alerts. Each alert that can trigger a page must have an associated runbook containing: what the alert means, where to find the relevant dashboards and logs, common causes and diagnostic steps, remediation actions, and escalation criteria. Runbooks should be living documents — updated after every incident with lessons learned and after any infrastructure change that affects the alert.
Alert response SLAs define expectations. A P0 alert should be acknowledged within 5 minutes and have initial remediation action within 15 minutes. P1 alerts: acknowledge within 15 minutes, action within 60 minutes during business hours. These SLAs must be realistically achievable — if the team consistently misses SLA targets, either the SLAs or the alerting infrastructure needs adjustment. Alert response metrics should be tracked and reviewed in regular operations reviews.
Silencing rules provide controlled noise reduction. Scheduled maintenance windows suppress alerts for planned changes. Dependencies allow alert suppression based on upstream alerts — if the database is down, do not page for "product service connection timeout," because the root cause is already covered. Override and escalation allow emergency overrides for critical situations. Silencing rules should be temporary and require a documented reason.