Module 7 · NoSQL, Search, Graph, ObjectDay 06425 min

Inverted Indexes Up Close

How a search engine actually finds matches.

← Previous Next →

Day 064

Inverted Indexes Up Close

25m

focus

term: cat

datastore

term: dog

datastore

[d2, d5, d9]

datastore

[d1, d5]

datastore

Signal path

Posting lists per term

term: cat

datastore

flow

[d2, d5, d9]

datastore

term: dog

datastore

flow

[d1, d5]

datastore

Memory hook

Inverted Indexes Up Close: how a search engine actually finds matches

Mental model

match the datastore to the access pattern

Design lens

Stemming can over-collapse terms.

Recall anchors

TokenizePosting listsRanking

Why it matters

An inverted index tokenizes text, normalizes (lowercase, stem), and stores 'term → list of (doc_id, freq, positions)'. Queries intersect posting lists and rank with TF-IDF or BM25.

1Walk through tokenization and posting lists.
2Compute simple TF-IDF.
3Reason about index size.

Deep dive

TF-IDF: term frequency × inverse document frequency. Common terms get less weight.

Phrase queries use position info; proximity queries use windowing.

Index size grows with vocab × postings; compress with VByte / FOR.

Demo / scenario

Implement a tiny inverted index.

Tokenize each doc; lowercase, stem.
Build map term → list of doc IDs.
Query: tokenize query, intersect posting lists.
Rank by sum of TF-IDF.

Tradeoffs

Stemming can over-collapse terms.
Stopwords reduce index size but hurt recall.
BM25 usually beats TF-IDF in modern engines.

Diagram

Posting lists per term.

Mind map

Check yourself

Loading quiz…

Sources & further reading

Manning IR — Introduction to Information Retrieval

← Previous Next →