Module 9 · Protocols, Security, ObservabilityDay 08925 min

Alerting: Symptom vs Cause

Page on user pain; chart on suspect causes.

← Previous Next →

Day 089

Alerting: Symptom vs Cause

25m

focus

Symptom: SLO burn

service

Page

client

Causes (CPU, GC, DB)

service

Dashboard

client

Signal path

Symptom alerts on top, causes underneath

Symptom: SLO burn

service

flow

Page

client

Causes (CPU, GC, DB)

service

flow

Dashboard

client

Memory hook

Alerting: Symptom vs Cause: page on user pain

Mental model

design for the day something breaks

Design lens

Burn-rate alerts catch slow-degradation events too.

Recall anchors

Symptom (page)Cause (chart)Burn rate

Why it matters

Pages are expensive — they wake people up. Alert only on user-visible symptoms tied to SLOs. Cause-based alerts go on dashboards; investigators discover them after a symptom alert.

1Alert on symptoms (SLO burn) not causes (CPU).
2Use multi-window burn-rate alerts.
3Avoid alert fatigue.

Deep dive

Burn-rate alerts: fast burn (1h) for major incidents, slow burn (24h) for chronic budget consumption.

Two-window alert reduces flaps: alert when both 5m and 1h burn rates exceed thresholds.

Alert fatigue is the #1 reason real outages get missed; tune ruthlessly.

Demo / scenario

Latency p99 alert flapping.

Replace with SLO burn-rate alert.
Tune two-window: 5m × 1h.
Pages drop 80%; real ones still page.
Cause-based dashboards remain for investigation.

Tradeoffs

Burn-rate alerts catch slow-degradation events too.
Multi-window adds complexity.
Tuning needs historical data.

Diagram

Symptom alerts on top, causes underneath.

Mind map

Check yourself

Loading quiz…

Sources & further reading

Google SRE — alerting on SLOs

← Previous Next →