Module 1 · Foundations & MethodDay 00425 min

SLA, SLO, SLI, and Error Budgets

How you express 'reliable enough' in numbers — and govern by them.

← Previous Next →

Day 004

SLA, SLO, SLI, and Error Budgets

25m

focus

Users

client

SLI

service

SLO

service

Error Budget

datastore

Signal path

From SLI measurements to SLO target to error bu...

Users

client

flow

SLI

service

SLI

service

flow

SLO

service

SLO

service

flow

Error Budget

datastore

Memory hook

SLA, SLO, SLI, and Error Budgets: how you express 'reliable enough' in numbers

Mental model

frame the problem before drawing the system

Design lens

Stricter SLO = less velocity, more reliability work.

Recall anchors

SLISLOSLA

Why it matters

An SLI is what you measure (success rate, latency). An SLO is the target you commit to internally (99.9% success). An SLA is the contract you sign externally with a customer, usually with refunds attached. The error budget is 1 − SLO; spend it on velocity until it runs out.

1Distinguish SLI, SLO, SLA cleanly.
2Compute an error budget and use it as a control variable.
3Recognize over-provisioned reliability targets.

Deep dive

Pick SLIs that match user experience, not server health. 'Successful checkout in under 1s' beats 'CPU < 80%'. SLIs should be expressible as 'good events / valid events'.

SLO targets should be lower than user expectations, not higher. Aiming at 100% is a sign of inexperience: it's expensive, blocks experimentation, and trains users to expect what you cannot sustainably deliver.

Error budgets translate reliability into product velocity. If the SLO is 99.9% over 30 days, you can be down ~43 minutes/month. Below budget: ship faster, take risks. Out of budget: freeze, fix reliability work first. This is the SRE control loop.

Demo / scenario

Service has 99.95% checkout SLO. Last 30 days: 99.91% success.

SLI: successful_checkouts / valid_checkouts.
SLO: 99.95% over 30 days → budget = 0.05% × ~1M reqs = 500 errors.
Actual: 0.09% errors = 900 errors.
Budget overspent by 400 → freeze risky deploys, prioritize root-cause fixes.

Tradeoffs

Stricter SLO = less velocity, more reliability work.
Looser SLO = more shipping risk, may lose customers.
Per-region SLOs reveal failures hidden by global averages.

Diagram

From SLI measurements to SLO target to error budget governance.

Mind map

Check yourself

Loading quiz…

Sources & further reading

Google SRE Book — Service Level Objectives

← Previous Next →