Two different curves with different bottlenecks.
Throughput vs Latency: two different curves with different bottlenecks
make the invisible limits visible
Bigger pool → memory + downstream pressure.
Latency is how long one request takes. Throughput is how many requests finish per second. They are not the same: a system can have low latency at low load and collapse at high throughput, or sustain high throughput while individual requests crawl.
Little's Law: average concurrent requests L = arrival rate λ × average latency W. If you can hold 100 in-flight and each takes 50 ms, max throughput is 2,000/s. Tighten W or grow L.
Throughput often improves with batching, queueing, parallelism. Latency often improves with caching, locality, smaller payloads. Tactics differ.
Capacity ceilings are hit when L grows but W also grows nonlinearly — queueing kicks in and tail latency explodes well before raw QPS limit.
Service does 200 ms request, 50 concurrent slots.