Page on user pain; chart on suspect causes.
Alerting: Symptom vs Cause: page on user pain
design for the day something breaks
Burn-rate alerts catch slow-degradation events too.
Pages are expensive — they wake people up. Alert only on user-visible symptoms tied to SLOs. Cause-based alerts go on dashboards; investigators discover them after a symptom alert.
Burn-rate alerts: fast burn (1h) for major incidents, slow burn (24h) for chronic budget consumption.
Two-window alert reduces flaps: alert when both 5m and 1h burn rates exceed thresholds.
Alert fatigue is the #1 reason real outages get missed; tune ruthlessly.
Latency p99 alert flapping.