← Back to Design & Development
Distributed Systems · Deep Dive

Caching Strategies

Every cache pattern, eviction policy, and failure mode that shows up in distributed systems — told as a story, with the gotchas every senior engineer learns the hard way.

Companion read

This page covers the patterns and failure modes — cache-aside, write-through, eviction, the five named failure scenarios. For the orthogonal view that designs the cache cluster itself (sharding, consistent hashing, replication, multi-region), pair it with the HLD deep-dive on Distributed Cache.

Distributed Cache HLD →
01 · Why caching exists

The 1 ms vs 200 ms problem

Sarah opens Instagram. Her feed appears in under a second — 30 posts, each with a photo, a username, a comment count. Now imagine each of those numbers required a fresh database query. Thirty round-trips, each maybe 5–20 ms inside the data center. At scale, with millions of Sarahs all scrolling at the same second, the database simply melts. Caching is the answer to that meltdown — but it is also the source of about half the production outages in your career. So we'd better get it right.

The one-liner: a cache is a smaller, faster store that sits in front of a slower, larger store, holding copies of the hot data. The trade you make is: extra memory + extra failure modes in exchange for lower latency + less load on the source of truth.

The numbers that drive every cache decision

Before any pattern makes sense, you have to feel these numbers in your bones. They're the reason caching is not optional at scale.

OperationLatency (typical)How it relates
L1 CPU cache~1 nsThe reference point.
Main memory (RAM)~100 ns100× slower than L1.
In-process cache hit~1 µsSame machine, same JVM/process — basically RAM.
Redis on the same data-center LAN~0.5–1 msNetwork + serialization.
SSD read~100 µsFaster than network for a colocated DB.
Database query (indexed)~5–20 msNetwork + parse + plan + disk.
Cross-region request~70–200 msSpeed-of-light tax across continents.
Cache miss + cold DB query~50–500 msWhat every cache strategy is trying to avoid.

Read that table twice. A Redis hit is roughly 20× faster than a database query. A process-local cache hit is 5000× faster. Caching is not a "nice to have" optimization — at scale it is the only thing keeping the response time inside the second budget your users will tolerate before they leave.

The two laws of caching

Law 1 — Locality is real

In almost every real system, a small fraction of the data gets a huge fraction of the traffic. Twitter found that ~20% of users generate ~80% of tweet reads; an e-commerce catalog sees the top 1% of SKUs account for over half the page views. This is the Pareto / Zipfian distribution, and it's why caches work at all — there is a "hot set" worth caching.

Law 2 — There are only two hard problems

Phil Karlton's line: "There are only two hard things in computer science: cache invalidation and naming things." He was half-joking, but the joke holds up. Every caching strategy you'll read about below is, in some way, a different answer to "how do I know when the cached copy has gone stale?"

02 · The cache layers

Where caches actually live

When Sarah's browser shows her feed, the bytes she's seeing may have passed through up to five caches in series before reaching her eyes. Each layer is a chance to short-circuit a request and never touch the database. Knowing where each layer sits — and what it's good at — is half the battle.

flowchart LR B([Sarah's browser]) CDN[CDN edge
e.g. Cloudflare] GW[API gateway
+ HTTP cache] APP[App server
in-process LRU] DC[Distributed cache
Redis cluster] DB[(Primary DB
buffer pool)] B -->|HTTP| CDN CDN -->|miss| GW GW -->|miss| APP APP -->|miss| DC DC -->|miss| DB style B fill:#171d27,stroke:#4dfeee,color:#d4dae5 style CDN fill:#e8743b,stroke:#e8743b,color:#fff style GW fill:#4a90d9,stroke:#4a90d9,color:#fff style APP fill:#38b265,stroke:#38b265,color:#fff style DC fill:#9b72cf,stroke:#9b72cf,color:#fff style DB fill:#d4a838,stroke:#d4a838,color:#fff

① Browser / client cache

Lives on Sarah's laptop. The browser keeps copies of HTML, JS, CSS, images keyed by URL, governed by HTTP headers — Cache-Control: max-age=3600, ETag, Last-Modified. Fastest possible cache because the request never even leaves her machine. Biggest risk: you can't invalidate it remotely. If you ship a bad CSS file and a user has cached it for 24 hours, you can't tell their browser "throw that away" — you have to wait it out, or change the URL (cache-busting style.abc123.css).

② CDN (edge cache)

A globally-distributed network of servers — Cloudflare, Fastly, Akamai, AWS CloudFront — that sits between the user and your origin. Used to be only for static assets, but modern CDNs cache dynamic API responses too via cache keys and short TTLs. Why it exists: if Sarah is in Mumbai and your servers are in Virginia, every cached response saves ~200 ms of trans-Atlantic round-trip. Trade-off: harder to invalidate (you have to call the CDN's purge API), and you pay per-GB egress.

③ Reverse-proxy / gateway cache

An HTTP cache living in front of your application servers — usually nginx, Varnish, or an API gateway like Kong/Envoy. It caches based on the URL + headers. Why it exists: protects your app from duplicate requests within the same data center, and gives you a place to do rate-limiting and TLS termination in the same hop. Without it, every public API request hits a Tomcat/Node worker, even for an identical request that arrived 2 ms earlier.

④ In-process / application cache

A HashMap (or Caffeine / Guava cache) inside the JVM or Node process. Fastest server-side cache — nanosecond access, no network. Why it exists: for data that's small, hot, and shared by many requests on the same box (config flags, currency-conversion rates, the user-roles table). Big catch: each app server has its own copy, so they drift. Invalidating "the cache" actually means invalidating N caches at once. Use only for things where eventual consistency on the order of seconds is fine.

⑤ Distributed cache

The classic "Redis cluster" or "Memcached fleet" — a shared, network-attached key-value store living on its own machines, separate from your app. Why it exists: gives every app server a single consistent view of the cached data, so when one server writes, all servers see the change. Trade-off: ~1 ms network hop per read (vs nanoseconds for in-process) and a new failure domain you have to monitor.

⑥ Database internal cache

Postgres has the buffer pool. MySQL has the InnoDB buffer pool. MongoDB has the WiredTiger cache. These are RAM regions inside the database itself that hold recently-used pages. Why it exists: even a perfectly-tuned external cache will miss sometimes — when it does, you want the DB to answer from RAM, not spin up the disk. Lesson: "I'm caching in Redis so the DB cache doesn't matter" is wrong. The DB cache is your safety net on a Redis miss. Size it generously.

So what: good distributed systems use multiple cache layers, not just one. Each layer protects the one behind it. A Redis miss should never reach an unhappy database — that's a sign your DB cache is undersized or your query is unindexed.
03 · The six patterns

The six caching patterns every system uses

"Caching strategy" is shorthand for two related decisions: how does the cache get filled? and what happens on a write? There are six classic patterns, and almost every cache in production is one of them — or a hybrid of two. We'll walk through each with a sequence diagram, the code you'd actually write, and the trap that catches engineers the first time they try it.

Pattern 1 — Cache-Aside (a.k.a. Lazy Loading)

The most common pattern in the wild. The application code talks to both the cache and the database — the cache is "off to the side", and the app is the conductor.

sequenceDiagram autonumber actor App participant Cache as Redis participant DB as Database Note over App,DB: READ path App->>Cache: GET user:42 alt cache hit Cache-->>App: { name: "Sarah", ... } else cache miss Cache-->>App: nil App->>DB: SELECT * FROM users WHERE id=42 DB-->>App: row App->>Cache: SET user:42 = {...} EX 300 end Note over App,DB: WRITE path App->>DB: UPDATE users SET ... WHERE id=42 App->>Cache: DEL user:42
Java · Cache-Aside
public User getUser(long id) {
  String key = "user:" + id;
  User cached = redis.get(key, User.class);
  if (cached != null) return cached;        // HIT

  User fromDb = userRepo.findById(id);          // MISS → DB
  if (fromDb != null) {
    redis.setex(key, 300, fromDb);            // fill cache with 5-min TTL
  }
  return fromDb;
}

public void updateUser(User u) {
  userRepo.save(u);                              // write DB first
  redis.del("user:" + u.getId());                // invalidate cache
}

✅ Why it's popular

  • Cache only fills with data that's actually requested — no wasted memory on cold data.
  • If Redis dies, the app still works (just slower). Failure mode is graceful.
  • App owns the contract — easy to reason about, easy to test.

⚠️ The traps

  • First miss is always slow — cold cache after deploy = brief storm of DB traffic.
  • Stale data window — if DB write succeeds but DEL fails (network blip), cache holds old value until TTL.
  • Race condition: reader A misses, queries DB; writer B updates DB and deletes cache; reader A then writes its (now stale) DB result back into cache. Mitigation: use SET with versioning, or use Redis NX + short TTLs.

Pattern 2 — Read-Through

Same shape as Cache-Aside but with the cache (or a cache library) doing the DB load itself instead of the app. The app talks only to the cache; the cache transparently fetches from DB on miss.

sequenceDiagram autonumber actor App participant Cache as Cache (read-through) participant DB as Database App->>Cache: GET user:42 alt hit Cache-->>App: data else miss Cache->>DB: SELECT * FROM users WHERE id=42 DB-->>Cache: row Cache->>Cache: store + set TTL Cache-->>App: data end

The "cache" here usually means a library that wraps both — Caffeine's LoadingCache, AWS DAX in front of DynamoDB, or Spring's @Cacheable. Raj (the new backend dev) often confuses Read-Through with Cache-Aside, but the distinction matters: in Read-Through, the cache knows how to load from the DB. In Cache-Aside, only the app does.

Java · Spring @Cacheable (read-through via Spring's caching abstraction)
@Cacheable(value = "users", key = "#id")
public User getUser(long id) {
  // Spring checks cache first. On miss, this body runs and the result is auto-stored.
  return userRepo.findById(id);
}
When to pick Read-Through over Cache-Aside: when you have many call sites for the same query and want to centralize the "miss → DB → store" logic in one place. The downside is you're now coupled to the cache library's behavior — debugging is harder, and not every cache supports it (Memcached, for example, does not).

Pattern 3 — Write-Through

On a write, the app updates the cache, which synchronously writes to the DB before returning. The cache and the DB are always consistent — but the write is as slow as the slower of the two.

sequenceDiagram autonumber actor App participant Cache participant DB App->>Cache: PUT user:42 = {...} Cache->>DB: UPDATE users ... DB-->>Cache: ok Cache-->>App: ok

✅ Strengths

  • Cache is never stale — every read after a write sees the new value.
  • No need to manually invalidate on writes (the write IS the update).
  • Great for read-heavy workloads where the latest value matters more than write latency (user profile updates, configuration).

⚠️ Weaknesses

  • Every write pays the DB latency — you don't save anything on the write path.
  • If the cache stores data the DB doesn't (cached derived values, joins), Write-Through doesn't fit — the cache and DB have different schemas.
  • Cold reads are still slow (write-through doesn't help reads of data that was never written through the cache).

Pattern 4 — Write-Behind (a.k.a. Write-Back)

The app writes only to the cache. The cache asynchronously flushes batched writes to the DB later — every 100 ms, every 1000 writes, whatever the policy says.

sequenceDiagram autonumber actor App participant Cache participant DB App->>Cache: PUT user:42 (1ms response) Cache-->>App: ok Note over Cache: buffers write in memory Cache->>DB: BATCH UPDATE (every 100 ms) DB-->>Cache: ok

This is what high-throughput write systems use — think Cassandra's commit log + memtable, or Kafka before it became its own thing. The user-facing write returns in 1 ms; the durable write to the source-of-truth happens out of band.

The danger: if the cache node crashes before the buffered writes are flushed, you lose data. Production Write-Behind systems mitigate this by writing to a replicated log first (Kafka, Raft journal) — so the "cache" isn't really volatile RAM, it's RAM plus a durable write-ahead log. Plain in-memory Write-Behind is fine for analytics counters but never for money or anything you can't reconstruct.

Pattern 5 — Write-Around

On a write, skip the cache entirely. Write straight to the DB. Reads still go through the cache (Cache-Aside style).

sequenceDiagram autonumber actor App participant Cache participant DB Note over App,DB: WRITE bypasses cache App->>DB: UPDATE users ... DB-->>App: ok Note over App,DB: Later READ fills cache App->>Cache: GET user:42 Cache-->>App: nil (miss) App->>DB: SELECT ... DB-->>App: row App->>Cache: SET user:42

Why would you ever not populate the cache on write? Because some writes are writes you'll never read. Imagine an audit-log table — billions of inserts, almost no one reads them. Filling the cache with audit-log rows just evicts the data people actually want. Write-Around says "let the cache stay focused on hot reads."

Pattern 6 — Refresh-Ahead

The cache proactively refreshes a key before it expires, based on predicted access. If a key has TTL 60 seconds and is being read every second, the cache might refresh it at the 50-second mark so users never see a miss.

sequenceDiagram autonumber actor App participant Cache participant DB loop hot key keeps being read App->>Cache: GET trending_posts Cache-->>App: data (always fresh) end Note over Cache: at TTL-10s, async refresh Cache->>DB: SELECT * FROM trending_posts DB-->>Cache: rows Cache->>Cache: update key, reset TTL
When this shines: latency-critical reads on predictable hot keys — the home-page leaderboard, the top-10 trending tags, the current FX rate. Users never hit a cold miss. Trade-off: the refresh might be wasted work if the key is no longer hot. Don't apply blindly to every key.

Side-by-side comparison

PatternRead pathWrite pathPick when
Cache-AsideApp checks cache → DB on missApp writes DB, deletes cacheGeneral-purpose; default choice
Read-ThroughApp asks cache; cache loads from DB(same as Cache-Aside or Write-Through)Many call sites; want to centralize load logic
Write-ThroughReads hit cacheApp writes cache → cache writes DB synchronouslyRead-heavy + want zero staleness
Write-BehindReads hit cacheApp writes cache only; cache flushes to DB asyncWrite-heavy; can tolerate small data-loss window
Write-AroundCache-Aside readWrites go straight to DB, skip cacheWrites are rarely re-read (logs, events)
Refresh-AheadCache refreshes hot keys before expiry(orthogonal — combine with any write pattern)Hot, predictable keys; latency budget is tight
04 · Eviction

Eviction policies — when the cache is full, what gets thrown out?

A cache is, by definition, smaller than the data set it's caching. So when it fills up, something has to go. The eviction policy is the rule that picks the victim. The right choice depends on the shape of your access pattern, not on which one sounds smartest.

LRU — Least Recently Used

Evict the item that hasn't been touched for the longest time. Implementation: a doubly-linked list ordered by recency, plus a HashMap for O(1) lookup.

Best for: workloads with temporal locality — if you used a key recently, you'll probably use it again soon (typical of web traffic).

Bad at: scan-heavy workloads. A single one-pass scan of a billion rows will completely poison an LRU cache, evicting all your hot data with cold scan data.

LFU — Least Frequently Used

Evict the item with the lowest access count. Implementation: a frequency counter per key, plus a min-heap or counting structure.

Best for: workloads where some items are evergreen-popular (a celebrity's profile page) and you don't want them displaced by a temporary burst.

Bad at: shifts in popularity — yesterday's popular item is sticky even after no one wants it. Pure LFU is rarely used; modern systems use aging LFU (decay the counts over time) or W-TinyLFU (a window into recent activity).

FIFO — First In, First Out

Evict the oldest inserted item, regardless of how often it's been read. Implementation: a plain queue.

Best for: simple cases where insertion order tracks usefulness — log buffers, event streams.

Bad at: most things. A hot, popular item inserted early gets evicted before a cold, rarely-touched item inserted later.

TTL — Time To Live

Every entry has an expiry. After its TTL passes, it's considered stale and evicted (or refreshed on the next read).

Best for: data with a known freshness budget — auth tokens (15 min), session data (24 h), rate-limit counters (1 min sliding).

Pairs with: any other policy. Most production caches are "TTL + LRU" — TTL handles staleness, LRU handles capacity pressure.

Random

Evict a randomly-chosen entry. Yes, really.

Why this isn't crazy: Redis defaults to allkeys-random as an option, and it's surprisingly competitive with LRU on Zipfian workloads — and it's much cheaper because there's no recency tracking to maintain.

Useful when: bookkeeping overhead of LRU is too high for your throughput target.

ARC / W-TinyLFU — the modern picks

ARC (Adaptive Replacement Cache): keeps two lists (recently used, frequently used) and adjusts the split based on which is producing more hits. Used by ZFS.

W-TinyLFU: what Caffeine (the Java cache library Spring uses under the hood) implements. Combines a frequency sketch with a small "window" LRU. Beats plain LRU on almost every real workload. If you're using Caffeine, you already get this for free.

Practical advice: for 90% of services, use TTL + LRU (Redis's volatile-lru or allkeys-lru). For in-process Java caches use Caffeine (W-TinyLFU). Reach for LFU only when you have a known popularity-distribution problem (a celebrity gravity-well). Don't pick an exotic policy because it sounds clever — pick the one that matches your workload's access pattern.
05 · Going distributed

One cache box isn't enough — going distributed

One Redis box can serve maybe ~100k ops/sec and hold maybe 100 GB of data. At Instagram scale (~500M daily users, billions of cache reads per minute), no single box could possibly cope. So we need to shard the cache across many boxes. And the way you shard determines whether adding a server is a quiet Tuesday or a 3 AM page.

Strategy A — Replication (every node has a copy)

Every cache node holds the entire dataset. On a read, ask any node. On a write, broadcast to all of them.

flowchart LR C[Client] --> N1[Node 1
full dataset] C --> N2[Node 2
full dataset] C --> N3[Node 3
full dataset] W[Writer] -.broadcast.-> N1 W -.broadcast.-> N2 W -.broadcast.-> N3 style N1 fill:#9b72cf,stroke:#9b72cf,color:#fff style N2 fill:#9b72cf,stroke:#9b72cf,color:#fff style N3 fill:#9b72cf,stroke:#9b72cf,color:#fff

Pros: any node can answer any query — great read scaling, simple failover. Cons: total capacity = capacity of a single node (replication doesn't add storage, only read throughput), and write amplification grows linearly with replicas. Used by: Redis read replicas, NGINX shared-cache.

Strategy B — Partitioning via consistent hashing

Split the keyspace across nodes. Each key lives on one node only. Lookups need a deterministic rule: "given key K, which node holds it?"

The naive answer — node_index = hash(K) % N — has a brutal failure mode. Add one node and almost every key remaps. Going from 4 to 5 nodes invalidates ~80% of cache entries. A cache that empties every time you scale isn't a cache, it's a heart attack.

The fix is consistent hashing. Imagine a ring labelled 0 to 2^32. Hash each node's name onto the ring; hash each key onto the ring; a key belongs to the first node you encounter walking clockwise from the key's position. Add a node? Only the keys between its position and the next node need to move — typically 1/N of the dataset. That's a controlled flinch instead of a catastrophe.

flowchart LR K1[key A
hash=12] --> R(((Ring))) K2[key B
hash=87] --> R K3[key C
hash=45] --> R R --> N1[Node 1
position 25] R --> N2[Node 2
position 60] R --> N3[Node 3
position 100] style R fill:#9b72cf,stroke:#9b72cf,color:#fff

One more refinement: in practice you don't put each node on the ring once — you put it on ~100–200 virtual nodes, spread evenly. Why? Because with 4 physical nodes you'd get unlucky hash positions and one node would get 40% of traffic while another gets 10%. With 400 virtual nodes the distribution smooths out. This is what Cassandra calls "vnodes" and what every modern hash-partitioned system does.

Hybrid — partitioning + replication

Production systems combine both. Each key lives on one primary shard (partitioned via consistent hashing for capacity), and that shard is replicated to 2 or 3 nodes (for read scaling and failover). Lose a node? The replica takes over. Need more capacity? Add a shard.

So what: for any cache cluster larger than 2 nodes, default to consistent hashing with virtual nodes + replication factor ≥ 2. Plain modulo-hashing is a footgun. Plain replication doesn't scale capacity. The combination is what Redis Cluster, Memcached (with a client-side consistent-hasher), and DAX all use under the hood.
06 · Invalidation

Invalidation — the hardest problem in caching

You wrote the new value to the database. Six caches across three data centers still have the old one. How do you tell every one of them? This is the question every senior engineer has tripped on at least once. Below are the five techniques in real use — most production systems combine several.

① TTL (the lazy invalidation)

Every entry has a short expiry — 30 seconds, 5 minutes, an hour. After it expires, the next reader gets a miss and re-fetches the fresh value. You never explicitly invalidate — you just let staleness time out.

Pros: dead simple, works across data centers without any coordination, no broken-pipe failure modes. Cons: readers see stale data for up to TTL seconds. Pick TTL based on what staleness your product can tolerate — a product price is "1 minute fine", a flash-sale discount is "1 second fine", an admin-set feature flag is "instant please".

② Explicit invalidate on write

The standard Cache-Aside move: after the DB write succeeds, the app calls DEL user:42. The next reader will miss and refill from the new DB row.

Pros: tight consistency — staleness window is just the time between DB-commit and cache-DEL (microseconds). Cons: if the DEL fails (network blip, Redis hiccup), the cache stays stale until the TTL fires. Belt-and-braces approach: always combine with a TTL, so the worst case is bounded.

③ Pub/sub invalidation

The writer publishes an event ("user:42 changed") onto a channel — Redis pub/sub, Kafka, or AWS SNS. Every app server with an in-process cache subscribes, and on receiving the message clears its local entry.

Why this exists: in-process caches don't have a single "DEL endpoint" — there are N servers, each with their own copy. Pub/sub fans the invalidation out automatically. Cons: at-least-once delivery means duplicate invalidations (cheap); message loss means stale data (expensive — so use a delivery system that doesn't drop, or combine with TTL).

④ Versioning / cache stamping

Don't update the value — add a version to the key. user:42:v1 becomes user:42:v2. The application reads through a small "current version" indirection. Old keys age out via TTL.

Best for: CDN caches (cache-bust by changing the URL). When you deploy a new CSS file, ship it at style.abc123.css instead of mutating style.css. Browser caches naturally evict the old version because nothing references it.

⑤ Stale-While-Revalidate (SWR)

On a stale read, serve the stale value immediately and refresh asynchronously. The user never waits for the DB; the cache repairs itself in the background.

Pros: consistently low p99 latency — no one ever pays the cold-miss cost on the request path. Cons: readers see slightly-stale data (usually fine for product listings, news feeds, etc.). Implemented in HTTP/1.1's stale-while-revalidate directive, Next.js's useSWR hook, and Caffeine's refreshAfterWrite.

⑥ Bonus — Read-Your-Writes consistency

A subtle problem: Sarah updates her name from "Sarah" to "Sara" and the next page reload still shows "Sarah". Even if the cache is mostly fresh, her own session sees the old value. Fix: after a user's own write, pin that user's next few reads to bypass the cache (or read from the DB primary, not the replica). Cheap to implement, makes the UX feel instant.

The rule: never rely on just one invalidation mechanism. Production combos: (a) TTL as the safety net (so nothing stays stale forever) + (b) explicit invalidation on write (so the common case is fast) + (c) pub/sub fan-out (when you have many in-process caches to keep in step). Each one covers the other's failure mode.
07 · Failure modes

The five famous cache failure modes

The patterns above all work fine on the happy path. The senior-engineer test is: what happens when traffic spikes, the cache cold-starts, or someone queries a non-existent ID a million times a second? These are the five failures every interviewer wants you to name.

① Thundering Herd (a.k.a. Cache Stampede)

The scene: a popular cache key — say, homepage:top-posts — expires at exactly 11:00:00. In the millisecond after expiry, 10,000 concurrent requests all miss, all hit the database, all compute the same expensive aggregation, and all write it back. The database falls over. The cache is now warm again — for the next 60 seconds, until the next expiry, when the herd thunders again.

The fix — single-flight via mutex
public List<Post> topPosts() {
  List<Post> cached = redis.get("top-posts");
  if (cached != null) return cached;

  // Only ONE thread per JVM gets through this lock per key
  synchronized (lockFor("top-posts")) {
    cached = redis.get("top-posts");            // double-check after lock
    if (cached != null) return cached;
    List<Post> fresh = expensiveQuery();
    redis.setex("top-posts", 60, fresh);
    return fresh;
  }
}

For multi-server protection use Redis SETNX as a distributed lock, or better — use a Probabilistic Early Expiration (PEE) scheme: each reader has a small probability of triggering a refresh before the TTL expires, proportional to how close to expiry the key is. By the time the TTL actually hits, the value has already been refreshed by one lucky reader.

② Cache Avalanche (mass expiry)

The scene: at 11:00 AM, after a deploy, the cache fills with thousands of keys all set to TTL = 1 hour. At 12:00 PM, every key expires within the same minute. Suddenly the database is hammered by every key in the system simultaneously.

The fix — jittered TTLs: never set every key to exactly the same TTL. Add a small random jitter — instead of EX 3600, use EX 3600 + rand(0..300). Now the expirations are spread across a 5-minute window instead of a 1-second window. Cheap, effective, and almost no one remembers to do it until they've been bitten.

③ Cache Penetration (queries for things that don't exist)

The scene: an attacker (or a buggy client) queries user:99999999 a million times a second. The cache misses (no such user). The DB returns "not found". The cache doesn't store a "negative" answer. Every single one of those million requests hits the DB.

Fix A — Negative caching

When the DB says "not found", cache that negative result too: SET user:99999999 = NULL EX 30. Now subsequent identical queries hit the cache and skip the DB entirely. Use a short TTL — you don't want to lock out a legit insertion.

Fix B — Bloom filter

Before checking the cache, check a Bloom filter (a compact probabilistic data structure that answers "does this key definitely not exist?"). If the Bloom filter says no, skip both cache and DB. Used at scale by Cassandra and the LinkedIn newsfeed.

④ Hot Key (one key, all the traffic)

The scene: Cristiano Ronaldo's Instagram profile gets 50% of all reads to your sharded cache. Whichever shard holds his key gets 50% of all traffic. That shard's CPU spikes to 100% while the other shards yawn at 5%.

The fixes:

  • Replicate the hot key — store ronaldo:0, ronaldo:1, …, ronaldo:9 on different shards. Reads pick a random suffix.
  • Pin the hot key in an in-process cache on every app server. Network-free reads, but reads are now "eventually consistent" for that key.
  • Identify hot keys automatically — Redis 7+ has --hotkeys stats. Tools like Twitter's Pelikan or Netflix's EVCache do this in production.

⑤ Big Key (one value too large)

The scene: someone caches a 100 MB JSON blob under one key. Every read serializes 100 MB across the network. Every write blocks the cache (Redis is single-threaded — a big write stalls every other request). Memory fragments. The eviction policy is forced to dump 99 small useful entries to make room for one giant one.

Fix: split big values. If you must cache a large object, break it into chunks (thread:42:posts:0, thread:42:posts:1, …) or use a Redis hash/list with per-field operations so you can read just the bit you need.

The pattern across all five: a cache is an optimization, not a contract. It can disappear, fill up wrong, get hammered, or contain the wrong thing. The DB should always survive a cache that misbehaves — add rate-limiting at the app, set query timeouts, alert on cache-miss spikes. If your DB falls over because the cache had a bad day, the cache was load-bearing — and that's an architecture smell.
08 · Real architecture

Putting it together — a real cache architecture

Let's walk through how a real production cache evolves, the same way a team would build it. We start from the simplest possible thing, watch it break, fix the break, and repeat until we land on a shape that actually survives a Black Friday.

Pass 1 — The naive design

One service, one Redis box, one Postgres. App talks to Redis on every read, falls back to Postgres on miss. This is what your first Cache-Aside looks like.

flowchart LR C[Web client] --> A[App server] A --> R[(Redis)] A --> D[(Postgres)] style A fill:#38b265,stroke:#38b265,color:#fff style R fill:#e8743b,stroke:#e8743b,color:#fff style D fill:#d4a838,stroke:#d4a838,color:#fff

Works beautifully at 100 RPS. At 10k RPS, three things break:

  • Redis becomes a single point of failure — if it crashes, every read falls through to Postgres, which immediately tips over.
  • Redis fills up. 100 GB box; the working set is 200 GB. Eviction thrash.
  • One hot key (Ronaldo) saturates the network card on the Redis box.

Pass 2 — The mental model: hot vs warm vs cold

Before scaling anything, name the split. We have three tiers of data, each with very different access shapes — and we should serve them from three different places.

🔥 Hot tier

~1% of keys, ~70% of reads. Lives in in-process Caffeine cache on each app server. Network-free reads, microsecond latency. Eventually consistent across servers, but that's fine for a 30-second window.

♨️ Warm tier

~20% of keys, ~25% of reads. Lives in a Redis cluster, partitioned via consistent hashing, replicated 2x. Sub-millisecond reads, shared across all app servers.

🧊 Cold tier

~79% of keys, ~5% of reads. Lives only in Postgres (or your DB of choice). We don't waste cache memory on data that's barely read. Cache penetration here is handled by a Bloom filter at the app layer.

Each tier has its own eviction policy, its own consistency story, and its own monitoring. Hot uses W-TinyLFU with short TTL + pub/sub invalidation. Warm uses LRU + TTL + jitter + explicit invalidation. Cold uses whatever your DB does (buffer pool LRU).

Pass 3 — The production shape

Now we wire it all together. Numbers in the diagram match the cards below.

flowchart LR CL([① Browser
HTTP cache]) CDN[② CDN edge
Cloudflare] LB[③ Load balancer
nginx / ALB] APP[④ App server
in-process Caffeine] BL[⑤ Bloom filter
per-app] RC[⑥ Redis cluster
consistent hashing
RF=2] HK[⑦ Hot-key replicator
ronaldo:0..9] PS[⑧ Pub/sub bus
Redis / Kafka] DB[(⑨ Postgres primary
+ read replicas)] M[⑩ Cache-metrics
Prometheus] CL --> CDN --> LB --> APP APP -->|step 1| BL APP -->|step 2 in-process| APP APP -->|step 3| RC RC -.hot key.-> HK APP -->|step 4 on miss| DB APP -.invalidate.-> PS PS -.fan out.-> APP APP -.metrics.-> M RC -.metrics.-> M style CL fill:#171d27,stroke:#4dfeee,color:#d4dae5 style CDN fill:#e8743b,stroke:#e8743b,color:#fff style LB fill:#4a90d9,stroke:#4a90d9,color:#fff style APP fill:#38b265,stroke:#38b265,color:#fff style BL fill:#9b72cf,stroke:#9b72cf,color:#fff style RC fill:#e8743b,stroke:#e8743b,color:#fff style HK fill:#3cbfbf,stroke:#3cbfbf,color:#fff style PS fill:#d4a838,stroke:#d4a838,color:#fff style DB fill:#e05252,stroke:#e05252,color:#fff style M fill:#9b72cf,stroke:#9b72cf,color:#fff

Component-by-component — what each numbered box does

Use the numbers in the diagram above to find the matching card below. Each card answers four things: what it is, why it exists, what would break without it, and where it sits in the flow.

Browser cache

The user's own browser caching HTML/JS/CSS/images per HTTP headers. It's the cheapest cache in the system — bytes never leave the laptop. Controlled by Cache-Control, ETag, Last-Modified.

Solves: repeat visits to the same page would otherwise re-download megabytes of static assets every time. Without it, every Sarah loading her feed would burn fresh bandwidth on the same logo and CSS files.

CDN edge

A geographically-distributed cache provided by Cloudflare/Fastly/Akamai. Caches static assets and, with the right cache keys, also short-TTL dynamic API responses. Sarah in Mumbai hits the Mumbai PoP; Raj in São Paulo hits São Paulo.

Solves: the speed-of-light tax. Without a CDN, every cross-continent request adds 150–250 ms of network latency. With it, that's pre-paid on the edge near the user.

Load balancer / gateway

nginx, AWS ALB, or Envoy. Distributes incoming requests across app servers. Some LBs also do an HTTP cache here (Varnish-style) for identical requests within a few seconds.

Solves: the question "which app server gets this request?" Without it, you can't horizontally scale your app tier. The cache here, when present, absorbs micro-bursts of identical requests.

App server + in-process cache

The Java/Node/Go process that runs your business logic. Embeds a Caffeine (or similar) cache holding the hot 1% of keys. Sub-microsecond reads, but each server has its own copy.

Solves: the 70% of traffic that's all hitting the same handful of keys (Ronaldo, the trending tag, the system config). A Redis trip is ~1 ms; this is <1 µs. At 10k RPS that latency cut is the difference between a one-server and a five-server fleet.

Bloom filter (per-app)

A small probabilistic structure that answers "is this key definitely not in the dataset?" Checked before Redis and the DB.

Solves: cache penetration attacks — millions of queries for user:99999999 that don't exist. Without it, each one becomes a DB miss and brings down Postgres. With it, the app rejects them in microseconds without ever touching the DB.

Redis cluster

The shared warm tier. Partitioned via consistent hashing across N shards, each replicated 2× for failover. Holds ~20% of keys responsible for ~25% of reads. Sub-millisecond access from any app server.

Solves: the consistency problem of in-process caches — when Sarah updates her profile on app-server-A, app-server-B needs to know. Redis is the shared source of truth for cached data. Also handles the keys too large/cold for in-process but too hot for the DB.

Hot-key replicator

A small layer that mirrors known hot keys across multiple Redis shards under suffixed names (ronaldo:0, ronaldo:1, …, ronaldo:9). Reads pick a random suffix.

Solves: the hot-shard problem. Without this, one Redis box gets 50% of the cluster's network traffic and CPU. With this, the load spreads evenly. Typically driven by an auto-detector that watches access counts and auto-mirrors any key crossing a threshold.

Pub/sub invalidation bus

A Redis channel or Kafka topic carrying "this key changed" events. App servers subscribe and evict matching entries from their in-process Caffeine caches.

Solves: the "N copies of the same in-process cache drift apart" problem. Without it, the only way to clear an in-process cache entry on every server is to wait for its TTL to expire — and the bigger the fleet, the worse the staleness. Pub/sub fans the invalidation out in milliseconds.

Database (primary + read replicas)

The source of truth. Postgres or MySQL or whatever you're using. Read replicas absorb the read traffic that escapes the cache; the primary takes all writes. The DB's own buffer pool is also a cache — undersize it and "cache miss → cold DB query" lands a Sev1.

Solves: everything. The cache is an optimization; the DB is the contract. Tune it for cache-miss survival, not for cache-hit fastpath.

Metrics / observability

Prometheus + Grafana (or your stack) capturing: cache hit ratio, p99 latency per layer, key-count, eviction rate, hot-key counters, miss rate per shard.

Solves: "the system is slow, where do we look?" Without metrics per layer you can't tell whether the regression is in CDN miss-rate, Redis CPU, hot-key skew, or buffer-pool churn. Every senior engineer can name the cache hit ratio of their service to within 5% — if you can't, you're flying blind.

Concrete walkthrough — Sarah loads her profile page

Let's trace one request through all 10 components. Sarah opens example.com/sarah on her phone in Mumbai. Here's what happens, second by second:

  1. Her phone's ① browser cache serves the HTML shell + CSS instantly — those haven't changed.
  2. The page's JS calls /api/users/sarah. The request hits ② Cloudflare's Mumbai PoP, which has a 60-second cached copy. Cache hit, served in 5 ms.
  3. One second later, Sarah edits her bio. The PUT request bypasses the CDN (write methods aren't cached), goes through ③ the LB to ④ an app server.
  4. The app server checks ⑤ the Bloom filter — does user "sarah" definitely-not exist? No, she does, proceed.
  5. The app server writes to ⑨ Postgres primary (~10 ms).
  6. It then issues DEL user:sarah to ⑥ Redis (single shard via consistent hash).
  7. It publishes {"key":"user:sarah"} on ⑧ the pub/sub bus. Within ~50 ms, every other app server's ④ in-process cache evicts its copy of user:sarah.
  8. Sarah's next read of her own profile comes back. App server-A checks its in-process cache → miss (just evicted). Checks Redis → miss (we DEL'd it). Goes to ⑨ Postgres → gets the new bio. Stores back into Redis and into in-process cache. Total: ~20 ms.
  9. Meanwhile, in parallel, ⑦ the hot-key replicator notices that celebrity:ronaldo is being read 200k times/sec on shard-3 and starts mirroring it to shards 4 and 5 under suffixed keys. Shard-3 CPU drops from 92% to 34%.
  10. ⑩ Metrics records every hop's latency. The Grafana dashboard shows the cache hit ratio holding at 96.4% — exactly where it should be.
So what: a well-built cache architecture has defense in depth. The browser, CDN, in-process, and Redis layers each absorb a category of traffic, and the DB only sees the residue. When a layer hiccups, the layer behind it absorbs the impact. The whole stack is designed so that no single cache being slow or empty turns into a Sev1.
09 · Decision table

Picking the right strategy — a cheat sheet

The hardest part of cache design isn't knowing the patterns — it's knowing which to pick for the workload in front of you. The table below maps common scenarios to their best default. Use it as a starting point, not a law.

ScenarioBest default strategyEvictionWhy
User profile data (read-heavy, occasional writes) Cache-Aside + explicit invalidation LRU + 5 min TTL Reads dominate; staleness on writes is bounded.
Session tokens (must be fresh; bounded life) Write-Through TTL (= session length) Zero staleness; TTL handles cleanup naturally.
High-throughput counters (likes, views) Write-Behind to durable log None (always hot) 10k writes/sec to DB direct is impossible; batch them.
Audit / log inserts (rarely re-read) Write-Around Don't pollute cache with cold data.
Homepage / leaderboard (hot, predictable, latency-critical) Refresh-Ahead TTL with PEE Never let users hit a cold miss; refresh proactively.
Hot celebrity / trending entity Cache-Aside + hot-key replication LFU or pinned Distribute load across shards; favor stickiness.
"Does this entity exist?" lookups Bloom filter + negative caching Stop "user 99999999" attacks at the door.
Product price during a flash sale Cache-Aside + pub/sub invalidation 1 sec TTL Must update fast across all servers when admins change price.
Static assets (CSS, images) CDN + URL versioning Long TTL (1 year) Immutable artifacts; cache forever and bust by URL.
News feed / timeline Stale-While-Revalidate 30 sec staleness budget Snappy reads, freshness in the background.
10 · Interview Q&A

Interview Q&A — the 15 most-asked questions

These are the questions that come up in distributed-systems rounds — caching turns up in nearly every system-design interview. Each answer is short enough to deliver in 60 seconds while still showing depth.

What's the difference between Cache-Aside and Read-Through?
Cache-Aside: the application is responsible for both reading the cache and falling back to the DB on miss. The cache is a passive store. Read-Through: the application talks only to the cache; the cache itself (via a library or built-in feature) loads from the DB on miss. The functional behavior is the same; the difference is where the loading code lives. Read-Through is cleaner for many call sites; Cache-Aside is more flexible and easier to debug.
Why is consistent hashing better than hash(key) % N?
Plain modulo causes ~(N-1)/N of all keys to remap when N changes. Going from 4 → 5 nodes invalidates ~80% of cache entries — effectively wiping the cache. Consistent hashing places nodes and keys on a virtual ring; adding a node only moves keys in one arc of the ring (~1/N of the data). Combined with virtual nodes (~100 vnodes per physical node), it gives smooth, predictable scaling.
How do you avoid a thundering herd / cache stampede?
Three layered defenses: (1) Single-flight — when a key expires, only one thread per server (or one process per cluster, via Redis SETNX) reloads it; others wait for the result. (2) Probabilistic Early Expiration (PEE) — each reader probabilistically triggers a refresh before TTL based on how close to expiry the key is. (3) Stale-While-Revalidate — serve the stale value immediately and refresh asynchronously, so nobody ever waits for a cold miss.
What's cache penetration and how do you mitigate it?
Penetration is when an attacker (or buggy client) queries keys that don't exist. The cache misses, the DB returns "not found", but since the cache stores nothing, every request hits the DB. Mitigation: negative caching (cache the "not found" result with a short TTL) and/or a Bloom filter that answers "this key is definitely not in the dataset, skip everything" in microseconds.
When should you use Write-Behind?
When write throughput is too high to write straight to the source-of-truth DB synchronously — think a "like" counter that increments 10k times/sec. Buffer writes in the cache and flush in batches. Catch: you must use a durable buffer (replicated commit log, Kafka, Redis with AOF + replication) — plain in-memory Write-Behind loses data if the cache node crashes before flush. Never use pure-RAM Write-Behind for anything you can't reconstruct.
What's a hot key and how do you handle it?
A hot key is a single cache key receiving a disproportionate share of traffic (Ronaldo's profile getting 50% of reads). In a sharded cluster, this saturates one shard while the rest sit idle. Fixes: (1) replicate the key across shards under suffixed names and pick randomly on read; (2) pin it in an in-process cache on every app server; (3) detect hot keys automatically via the cache's stats (Redis 7+ has --hotkeys) and apply the above dynamically.
Why add jitter to TTLs?
If thousands of keys are set with the same TTL at the same moment (typical after a deploy or cold-start), they all expire at the same moment too. The moment they expire, every read for those keys misses, and the DB gets hammered by every key simultaneously — cache avalanche. Adding a small random jitter (EX 3600 + rand(0..300)) spreads the expirations over a window, smoothing the DB load.
What's the difference between Memcached and Redis?
Memcached is a pure in-memory key-value store. Multithreaded, very fast, but values are opaque strings/blobs. No persistence, no replication built in. Redis is a data-structure server — keys can be strings, lists, sets, sorted sets, hashes, streams, etc. Supports persistence (AOF/RDB), replication, cluster mode, pub/sub, Lua scripts, and transactions. Memcached wins on raw throughput per CPU; Redis wins on flexibility. For 90% of new projects, Redis is the right default.
What's the right cache hit ratio?
Depends on workload, but in most read-heavy services > 90% is the target. Below 80% and the cache isn't really earning its keep — most queries are missing through to the DB anyway. Below 60% and you should question whether to cache at all. Above 99% looks great but often means the cache is too tight to the workload — small access-pattern shifts can drop it fast. Aim for steady 90–98% with active monitoring.
Cache-Aside has a race condition — explain it.
Reader A misses, queries DB, gets value V1. Meanwhile, writer B updates DB to V2 and deletes cache. A then writes V1 (now stale) into the cache. The cache holds the stale value until TTL. Fix: versioned writes (include the row's version/updated_at in the cache and refuse to overwrite a newer version with an older one), or use Redis transactions/WATCH, or use the "delete-after-write + short TTL" combo so the staleness window is bounded even when the race lands.
How do you keep multiple in-process caches in sync?
You don't, exactly — by design, each is a separate copy. To bound the drift, use one of: (a) short TTL (caches converge within seconds), (b) pub/sub fan-out (writer publishes an invalidation, all servers listen and evict the local entry), or (c) versioned reads (every read also reads a tiny "current version" key from Redis; if mismatched, refresh).
Should you ever cache user-specific data?
Yes, but with care. Each user's data is its own key (e.g. user:42:feed), which means low hit rate per key and a lot of memory pressure. Make sure: (1) the TTL is short (data changes frequently), (2) you cache derived data (the rendered feed), not the underlying entities (which are cached separately), and (3) invalidation triggers on the user's own writes (read-your-writes). Otherwise you'll waste memory caching things one user reads once.
What's the CAP-like trade-off for caches?
A cache is fundamentally a freshness-vs-availability-vs-cost trade. Tight invalidation gives freshness but adds coupling (writer must reach all caches). Long TTLs give availability and low cost but worse freshness. Strong consistency with the DB defeats the cache's purpose. Most production caches accept "eventual consistency on the order of seconds" — a tiny window of staleness in exchange for huge latency wins.
What's the difference between a CDN and a reverse proxy?
Both cache HTTP responses. Difference is geography. A reverse proxy (nginx, Varnish) sits in your data center, between LB and app servers. A CDN (Cloudflare, Fastly) is a global network — your responses live at edge PoPs near users. CDNs trade some control (purge has latency, edge logic is limited) for huge latency wins on cross-region traffic. Both have a role: CDN for global, proxy for in-DC efficiency.
When should you NOT cache?
When the data changes faster than reads can amortize the cost (write > read ratio). When the data is sensitive and stale-by-a-second is unacceptable (financial transactions in flight, auth state checks). When the dataset is small enough to fit comfortably in the DB's own buffer pool — adding Redis in front of a DB that already serves from RAM is pure overhead. Caching is an answer, not a default. Measure first.
Bonus · Cheat sheet

The one-page summary

If you remember nothing else from this article, remember these.

  • Layers: Browser → CDN → Gateway → In-process → Distributed cache → DB. Each protects the one behind it.
  • Default pattern: Cache-Aside with TTL + explicit invalidation on write. Reach for others only when this doesn't fit.
  • Default eviction: TTL + LRU (or W-TinyLFU via Caffeine for in-process JVM caches).
  • Distributed: consistent hashing + virtual nodes + replication factor ≥ 2. Never plain modulo.
  • Invalidation: never rely on one mechanism. Combine TTL (safety net) + explicit DEL (common case) + pub/sub (in-process fan-out).
  • Always jitter TTLs. Mass simultaneous expiry is one query away from a Sev1.
  • Mitigate stampede via single-flight, PEE, or stale-while-revalidate.
  • Mitigate penetration via negative caching + Bloom filter.
  • Mitigate hot keys by replicating with suffixed keys, or pinning in-process.
  • The DB must always survive a cache that misbehaves. If it can't, your cache was load-bearing — an architecture smell.
  • Measure hit ratio per layer. Without numbers you're flying blind.
  • Caching is an optimization, not a contract. Treat it like one.