Back to roadmap
Module 10 · End-to-End Design DrillsDay 09845 min

Drill: Web Crawler

Politeness + breadth + dedup at internet scale.

Day 098

Drill: Web Crawler

Seeds
client
Frontier
queue
Workers
service
Extract
service
Store
datastore
Signal path
Crawler pipeline
Seeds
client
flow
Frontier
queue
Frontier
queue
flow
Workers
service
Workers
service
flow
Extract
service
Memory hook

Web Crawler: politeness + breadth + dedup at internet scale

Mental model

compose the pieces into a design story

Design lens

Politeness can leave you slow on tiny domains.

Recall anchors
FrontierPolitenessDedup

Why it matters

Crawlers fetch pages, extract links, and queue new ones. Scale demands a distributed URL frontier with politeness (per-domain rate limits), dedup, and durable extracted content storage.

Deep dive

Frontier: priority queue per domain.

Dedup: URL-level bloom filter; content-level via hash.

Robots.txt and crawl-delay are mandatory.

Content store: HTML in object store; metadata in graph or warehouse.

Demo / scenario

Seed list of 100 domains; expand to 1B URLs.

  1. Frontier sharded by hostname.
  2. Workers fetch per-host rate-limited.
  3. Bloom filter avoids re-queueing.
  4. Extracted text → S3; links → frontier; metadata → DB.

Tradeoffs

  • Politeness can leave you slow on tiny domains.
  • Bloom filters have false positives.
  • Recrawl strategy (which/how often) is its own product.

Diagram

Seeds
Frontier
Workers
Extract
Store
Crawler pipeline.

Mind map

Check yourself

Loading quiz…

Sources & further reading