How you express 'reliable enough' in numbers — and govern by them.
SLA, SLO, SLI, and Error Budgets: how you express 'reliable enough' in numbers
frame the problem before drawing the system
Stricter SLO = less velocity, more reliability work.
An SLI is what you measure (success rate, latency). An SLO is the target you commit to internally (99.9% success). An SLA is the contract you sign externally with a customer, usually with refunds attached. The error budget is 1 − SLO; spend it on velocity until it runs out.
Pick SLIs that match user experience, not server health. 'Successful checkout in under 1s' beats 'CPU < 80%'. SLIs should be expressible as 'good events / valid events'.
SLO targets should be lower than user expectations, not higher. Aiming at 100% is a sign of inexperience: it's expensive, blocks experimentation, and trains users to expect what you cannot sustainably deliver.
Error budgets translate reliability into product velocity. If the SLO is 99.9% over 30 days, you can be down ~43 minutes/month. Below budget: ship faster, take risks. Out of budget: freeze, fix reliability work first. This is the SRE control loop.
Service has 99.95% checkout SLO. Last 30 days: 99.91% success.