100 Days · Interactive Roadmap
A focused roadmap inspired by donnemartin/system-design-primer. Each day: a concise lesson, a hand-drawn diagram, a mind map, and a 5-question quiz. Progress lives in your browser.
The curriculum moves from framing and estimation to data, async work, operations, and full design drills. Each module leaves a visual cue you can recall in an interview.
How to think about, scope, and discuss large systems.
Why interviews and real engineering both demand the same skill.
A repeatable five-step framework that beats raw cleverness.
What the system does vs how well it does it.
How you express 'reliable enough' in numbers — and govern by them.
How to ballpark scale in 60 seconds and avoid being wrong by 100x.
How to reverse-engineer a deployed product into its design.
Pick a corner of the triangle and own it.
Boxes and arrows are not decoration — they are the load-bearing language.
Numbers every engineer should know — and how to estimate.
The physics ceiling on what your system can do in a millisecond.
Two different curves with different bottlenecks.
Bigger box vs more boxes.
Provision for peak — and prove it.
Where state lives is the most important architectural choice.
There is exactly one bottleneck at any moment — find it.
p99 is your real product.
Read-heavy, write-heavy, fan-out, fan-in — each picks a different stack.
Different units — each names a different bottleneck.
Every component on a path gets a slice of the latency budget.
CAP, PACELC, replication, and the cost of correctness.
Under network partition, you must pick consistency or availability.
Even without a partition, consistency costs latency.
Linearizable reads see the latest write — at a cost.
All replicas converge — eventually.
Useful 'middle' consistency models that match user expectations cheaply.
Single-leader, multi-leader, leaderless — pick a coordinator strategy.
Tune consistency vs availability per operation.
When concurrent writes diverge, you need a deterministic merge.
How nodes agree on a single value despite failures.
Distributed transactions: the strict way and the practical way.
How traffic finds your servers — and how to spread it.
How a name becomes an IP — and where each step can fail.
Use DNS to send users to the right region.
Move bytes close to users so origin only answers misses.
Cache headers are a contract; invalidation is its escape hatch.
One IP, many physical locations, the network picks the closest.
Same picture, different verbs.
Bytes and ports vs URLs and headers.
Rotate broken backends out before users notice.
Route by key without thrashing the map when nodes change.
Encrypt the wire, identify the parties.
Service boundaries, contracts, discovery, and the mesh.
Resources, verbs, and statelessness as a contract.
Ask for exactly what you need — and own the cost.
Strongly typed, binary, fast — and the default for service-to-service.
How to evolve a contract without breaking clients.
Cursors beat offsets at scale.
Retries are inevitable; design as if every call may run twice.
Cap the firehose before it floods the basement.
One entrypoint for cross-cutting concerns.
One deployable vs many — both are valid; pick consciously.
How services find each other — and how the mesh helps.
Schemas, indexes, replication, and sharding for SQL.
Why SQL still wins for most transactional work.
Normalize until it hurts; denormalize until it works.
B-trees: the workhorse of SQL.
Order matters; covering avoids the heap.
Read the plan; the plan is the truth.
Streaming and logical replication compared.
Cheap reads — at the cost of staleness.
Connections are expensive — share them.
When one DB isn't enough — split by key.
Add columns, change types, split tables — without downtime.
Picking the right datastore for the workload.
JSON-shaped data, flexible schemas, easy aggregations.
Fastest possible lookups when keys are everything.
Cassandra/HBase: write-optimized, partitioned, big.
Built for append-only timestamped data.
Inverted indexes turn 'find me X' into milliseconds.
How a search engine actually finds matches.
When relationships are the thing you query.
Cheap, durable, infinite — your bucket of bytes.
Cheap raw bytes vs fast structured queries — and the middle.
There's no 'best DB' — only best fit for the workload.
Hide latency, smooth spikes, and decouple services.
Cache at every layer that pays for itself.
Cache-aside, write-through, write-behind — different shapes, different tradeoffs.
The hardest problem in computer science (only half-jokingly).
When the cache is full, who loses?
Redis and Memcached at scale.
When the cache misses, everyone misses at once.
Decouple producers and consumers with durable queues.
Multiple subscribers vs one consumer per message.
Pick at-least-once + idempotent — the only honest answer.
Atomic publish-with-state, deduped consumption.
Signal slowness upstream — don't drown.
Treat data in motion as a first-class citizen.
What prod really runs on — and how you survive it.
How the web's transport got faster — and what it means for your design.
Push, not poll.
Climb the realtime ladder by problem, not preference.
Hash with a slow function; add a second factor.
Who can do what, decided where.
Delegated authorization vs identity — and why both matter.
Don't put secrets in env files; rotate them like passwords.
The three pillars of observability.
Page on user pain; chart on suspect causes.
Rehearse failures so you don't learn them at 3 AM.
Apply everything: ten classic system design problems.
Cheap to build, instructive at scale.
Fan-out on write vs read — and the celebrity problem.
Photo upload pipeline + feed + likes — visually heavy.
Realtime messaging with delivery guarantees and presence.
Geo-indexing + matching + pricing — at city scale.
Video ingest, encoding, CDN, recommendations.
Block-level sync with conflict resolution.
Politeness + breadth + dedup at internet scale.
Distributed token bucket with consistency hazards.
Build Dynamo from first principles.