Back to roadmap
Module 7 · NoSQL, Search, Graph, ObjectDay 06425 min

Inverted Indexes Up Close

How a search engine actually finds matches.

Day 064

Inverted Indexes Up Close

term: cat
datastore
term: dog
datastore
[d2, d5, d9]
datastore
[d1, d5]
datastore
Signal path
Posting lists per term
term: cat
datastore
flow
[d2, d5, d9]
datastore
term: dog
datastore
flow
[d1, d5]
datastore
Memory hook

Inverted Indexes Up Close: how a search engine actually finds matches

Mental model

match the datastore to the access pattern

Design lens

Stemming can over-collapse terms.

Recall anchors
TokenizePosting listsRanking

Why it matters

An inverted index tokenizes text, normalizes (lowercase, stem), and stores 'term → list of (doc_id, freq, positions)'. Queries intersect posting lists and rank with TF-IDF or BM25.

Deep dive

TF-IDF: term frequency × inverse document frequency. Common terms get less weight.

Phrase queries use position info; proximity queries use windowing.

Index size grows with vocab × postings; compress with VByte / FOR.

Demo / scenario

Implement a tiny inverted index.

  1. Tokenize each doc; lowercase, stem.
  2. Build map term → list of doc IDs.
  3. Query: tokenize query, intersect posting lists.
  4. Rank by sum of TF-IDF.

Tradeoffs

  • Stemming can over-collapse terms.
  • Stopwords reduce index size but hurt recall.
  • BM25 usually beats TF-IDF in modern engines.

Diagram

term: cat
term: dog
[d2, d5, d9]
[d1, d5]
Posting lists per term.

Mind map

Check yourself

Loading quiz…

Sources & further reading