Every cache pattern, eviction policy, and failure mode that shows up in distributed systems — told as a story, with the gotchas every senior engineer learns the hard way.
This page covers the patterns and failure modes — cache-aside, write-through, eviction, the five named failure scenarios. For the orthogonal view that designs the cache cluster itself (sharding, consistent hashing, replication, multi-region), pair it with the HLD deep-dive on Distributed Cache.
Sarah opens Instagram. Her feed appears in under a second — 30 posts, each with a photo, a username, a comment count. Now imagine each of those numbers required a fresh database query. Thirty round-trips, each maybe 5–20 ms inside the data center. At scale, with millions of Sarahs all scrolling at the same second, the database simply melts. Caching is the answer to that meltdown — but it is also the source of about half the production outages in your career. So we'd better get it right.
Before any pattern makes sense, you have to feel these numbers in your bones. They're the reason caching is not optional at scale.
| Operation | Latency (typical) | How it relates |
|---|---|---|
| L1 CPU cache | ~1 ns | The reference point. |
| Main memory (RAM) | ~100 ns | 100× slower than L1. |
| In-process cache hit | ~1 µs | Same machine, same JVM/process — basically RAM. |
| Redis on the same data-center LAN | ~0.5–1 ms | Network + serialization. |
| SSD read | ~100 µs | Faster than network for a colocated DB. |
| Database query (indexed) | ~5–20 ms | Network + parse + plan + disk. |
| Cross-region request | ~70–200 ms | Speed-of-light tax across continents. |
| Cache miss + cold DB query | ~50–500 ms | What every cache strategy is trying to avoid. |
Read that table twice. A Redis hit is roughly 20× faster than a database query. A process-local cache hit is 5000× faster. Caching is not a "nice to have" optimization — at scale it is the only thing keeping the response time inside the second budget your users will tolerate before they leave.
In almost every real system, a small fraction of the data gets a huge fraction of the traffic. Twitter found that ~20% of users generate ~80% of tweet reads; an e-commerce catalog sees the top 1% of SKUs account for over half the page views. This is the Pareto / Zipfian distribution, and it's why caches work at all — there is a "hot set" worth caching.
Phil Karlton's line: "There are only two hard things in computer science: cache invalidation and naming things." He was half-joking, but the joke holds up. Every caching strategy you'll read about below is, in some way, a different answer to "how do I know when the cached copy has gone stale?"
When Sarah's browser shows her feed, the bytes she's seeing may have passed through up to five caches in series before reaching her eyes. Each layer is a chance to short-circuit a request and never touch the database. Knowing where each layer sits — and what it's good at — is half the battle.
Lives on Sarah's laptop. The browser keeps copies of HTML, JS, CSS, images keyed by URL, governed by HTTP headers — Cache-Control: max-age=3600, ETag, Last-Modified. Fastest possible cache because the request never even leaves her machine. Biggest risk: you can't invalidate it remotely. If you ship a bad CSS file and a user has cached it for 24 hours, you can't tell their browser "throw that away" — you have to wait it out, or change the URL (cache-busting style.abc123.css).
A globally-distributed network of servers — Cloudflare, Fastly, Akamai, AWS CloudFront — that sits between the user and your origin. Used to be only for static assets, but modern CDNs cache dynamic API responses too via cache keys and short TTLs. Why it exists: if Sarah is in Mumbai and your servers are in Virginia, every cached response saves ~200 ms of trans-Atlantic round-trip. Trade-off: harder to invalidate (you have to call the CDN's purge API), and you pay per-GB egress.
An HTTP cache living in front of your application servers — usually nginx, Varnish, or an API gateway like Kong/Envoy. It caches based on the URL + headers. Why it exists: protects your app from duplicate requests within the same data center, and gives you a place to do rate-limiting and TLS termination in the same hop. Without it, every public API request hits a Tomcat/Node worker, even for an identical request that arrived 2 ms earlier.
A HashMap (or Caffeine / Guava cache) inside the JVM or Node process. Fastest server-side cache — nanosecond access, no network. Why it exists: for data that's small, hot, and shared by many requests on the same box (config flags, currency-conversion rates, the user-roles table). Big catch: each app server has its own copy, so they drift. Invalidating "the cache" actually means invalidating N caches at once. Use only for things where eventual consistency on the order of seconds is fine.
The classic "Redis cluster" or "Memcached fleet" — a shared, network-attached key-value store living on its own machines, separate from your app. Why it exists: gives every app server a single consistent view of the cached data, so when one server writes, all servers see the change. Trade-off: ~1 ms network hop per read (vs nanoseconds for in-process) and a new failure domain you have to monitor.
Postgres has the buffer pool. MySQL has the InnoDB buffer pool. MongoDB has the WiredTiger cache. These are RAM regions inside the database itself that hold recently-used pages. Why it exists: even a perfectly-tuned external cache will miss sometimes — when it does, you want the DB to answer from RAM, not spin up the disk. Lesson: "I'm caching in Redis so the DB cache doesn't matter" is wrong. The DB cache is your safety net on a Redis miss. Size it generously.
"Caching strategy" is shorthand for two related decisions: how does the cache get filled? and what happens on a write? There are six classic patterns, and almost every cache in production is one of them — or a hybrid of two. We'll walk through each with a sequence diagram, the code you'd actually write, and the trap that catches engineers the first time they try it.
The most common pattern in the wild. The application code talks to both the cache and the database — the cache is "off to the side", and the app is the conductor.
public User getUser(long id) { String key = "user:" + id; User cached = redis.get(key, User.class); if (cached != null) return cached; // HIT User fromDb = userRepo.findById(id); // MISS → DB if (fromDb != null) { redis.setex(key, 300, fromDb); // fill cache with 5-min TTL } return fromDb; } public void updateUser(User u) { userRepo.save(u); // write DB first redis.del("user:" + u.getId()); // invalidate cache }
DEL fails (network blip), cache holds old value until TTL.SET with versioning, or use Redis NX + short TTLs.Same shape as Cache-Aside but with the cache (or a cache library) doing the DB load itself instead of the app. The app talks only to the cache; the cache transparently fetches from DB on miss.
The "cache" here usually means a library that wraps both — Caffeine's LoadingCache, AWS DAX in front of DynamoDB, or Spring's @Cacheable. Raj (the new backend dev) often confuses Read-Through with Cache-Aside, but the distinction matters: in Read-Through, the cache knows how to load from the DB. In Cache-Aside, only the app does.
@Cacheable(value = "users", key = "#id") public User getUser(long id) { // Spring checks cache first. On miss, this body runs and the result is auto-stored. return userRepo.findById(id); }
On a write, the app updates the cache, which synchronously writes to the DB before returning. The cache and the DB are always consistent — but the write is as slow as the slower of the two.
The app writes only to the cache. The cache asynchronously flushes batched writes to the DB later — every 100 ms, every 1000 writes, whatever the policy says.
This is what high-throughput write systems use — think Cassandra's commit log + memtable, or Kafka before it became its own thing. The user-facing write returns in 1 ms; the durable write to the source-of-truth happens out of band.
On a write, skip the cache entirely. Write straight to the DB. Reads still go through the cache (Cache-Aside style).
Why would you ever not populate the cache on write? Because some writes are writes you'll never read. Imagine an audit-log table — billions of inserts, almost no one reads them. Filling the cache with audit-log rows just evicts the data people actually want. Write-Around says "let the cache stay focused on hot reads."
The cache proactively refreshes a key before it expires, based on predicted access. If a key has TTL 60 seconds and is being read every second, the cache might refresh it at the 50-second mark so users never see a miss.
| Pattern | Read path | Write path | Pick when |
|---|---|---|---|
| Cache-Aside | App checks cache → DB on miss | App writes DB, deletes cache | General-purpose; default choice |
| Read-Through | App asks cache; cache loads from DB | (same as Cache-Aside or Write-Through) | Many call sites; want to centralize load logic |
| Write-Through | Reads hit cache | App writes cache → cache writes DB synchronously | Read-heavy + want zero staleness |
| Write-Behind | Reads hit cache | App writes cache only; cache flushes to DB async | Write-heavy; can tolerate small data-loss window |
| Write-Around | Cache-Aside read | Writes go straight to DB, skip cache | Writes are rarely re-read (logs, events) |
| Refresh-Ahead | Cache refreshes hot keys before expiry | (orthogonal — combine with any write pattern) | Hot, predictable keys; latency budget is tight |
A cache is, by definition, smaller than the data set it's caching. So when it fills up, something has to go. The eviction policy is the rule that picks the victim. The right choice depends on the shape of your access pattern, not on which one sounds smartest.
Evict the item that hasn't been touched for the longest time. Implementation: a doubly-linked list ordered by recency, plus a HashMap for O(1) lookup.
Best for: workloads with temporal locality — if you used a key recently, you'll probably use it again soon (typical of web traffic).
Bad at: scan-heavy workloads. A single one-pass scan of a billion rows will completely poison an LRU cache, evicting all your hot data with cold scan data.
Evict the item with the lowest access count. Implementation: a frequency counter per key, plus a min-heap or counting structure.
Best for: workloads where some items are evergreen-popular (a celebrity's profile page) and you don't want them displaced by a temporary burst.
Bad at: shifts in popularity — yesterday's popular item is sticky even after no one wants it. Pure LFU is rarely used; modern systems use aging LFU (decay the counts over time) or W-TinyLFU (a window into recent activity).
Evict the oldest inserted item, regardless of how often it's been read. Implementation: a plain queue.
Best for: simple cases where insertion order tracks usefulness — log buffers, event streams.
Bad at: most things. A hot, popular item inserted early gets evicted before a cold, rarely-touched item inserted later.
Every entry has an expiry. After its TTL passes, it's considered stale and evicted (or refreshed on the next read).
Best for: data with a known freshness budget — auth tokens (15 min), session data (24 h), rate-limit counters (1 min sliding).
Pairs with: any other policy. Most production caches are "TTL + LRU" — TTL handles staleness, LRU handles capacity pressure.
Evict a randomly-chosen entry. Yes, really.
Why this isn't crazy: Redis defaults to allkeys-random as an option, and it's surprisingly competitive with LRU on Zipfian workloads — and it's much cheaper because there's no recency tracking to maintain.
Useful when: bookkeeping overhead of LRU is too high for your throughput target.
ARC (Adaptive Replacement Cache): keeps two lists (recently used, frequently used) and adjusts the split based on which is producing more hits. Used by ZFS.
W-TinyLFU: what Caffeine (the Java cache library Spring uses under the hood) implements. Combines a frequency sketch with a small "window" LRU. Beats plain LRU on almost every real workload. If you're using Caffeine, you already get this for free.
volatile-lru or allkeys-lru). For in-process Java caches use Caffeine (W-TinyLFU). Reach for LFU only when you have a known popularity-distribution problem (a celebrity gravity-well). Don't pick an exotic policy because it sounds clever — pick the one that matches your workload's access pattern.
One Redis box can serve maybe ~100k ops/sec and hold maybe 100 GB of data. At Instagram scale (~500M daily users, billions of cache reads per minute), no single box could possibly cope. So we need to shard the cache across many boxes. And the way you shard determines whether adding a server is a quiet Tuesday or a 3 AM page.
Every cache node holds the entire dataset. On a read, ask any node. On a write, broadcast to all of them.
Pros: any node can answer any query — great read scaling, simple failover. Cons: total capacity = capacity of a single node (replication doesn't add storage, only read throughput), and write amplification grows linearly with replicas. Used by: Redis read replicas, NGINX shared-cache.
Split the keyspace across nodes. Each key lives on one node only. Lookups need a deterministic rule: "given key K, which node holds it?"
The naive answer — node_index = hash(K) % N — has a brutal failure mode. Add one node and almost every key remaps. Going from 4 to 5 nodes invalidates ~80% of cache entries. A cache that empties every time you scale isn't a cache, it's a heart attack.
The fix is consistent hashing. Imagine a ring labelled 0 to 2^32. Hash each node's name onto the ring; hash each key onto the ring; a key belongs to the first node you encounter walking clockwise from the key's position. Add a node? Only the keys between its position and the next node need to move — typically 1/N of the dataset. That's a controlled flinch instead of a catastrophe.
One more refinement: in practice you don't put each node on the ring once — you put it on ~100–200 virtual nodes, spread evenly. Why? Because with 4 physical nodes you'd get unlucky hash positions and one node would get 40% of traffic while another gets 10%. With 400 virtual nodes the distribution smooths out. This is what Cassandra calls "vnodes" and what every modern hash-partitioned system does.
Production systems combine both. Each key lives on one primary shard (partitioned via consistent hashing for capacity), and that shard is replicated to 2 or 3 nodes (for read scaling and failover). Lose a node? The replica takes over. Need more capacity? Add a shard.
You wrote the new value to the database. Six caches across three data centers still have the old one. How do you tell every one of them? This is the question every senior engineer has tripped on at least once. Below are the five techniques in real use — most production systems combine several.
Every entry has a short expiry — 30 seconds, 5 minutes, an hour. After it expires, the next reader gets a miss and re-fetches the fresh value. You never explicitly invalidate — you just let staleness time out.
Pros: dead simple, works across data centers without any coordination, no broken-pipe failure modes. Cons: readers see stale data for up to TTL seconds. Pick TTL based on what staleness your product can tolerate — a product price is "1 minute fine", a flash-sale discount is "1 second fine", an admin-set feature flag is "instant please".
The standard Cache-Aside move: after the DB write succeeds, the app calls DEL user:42. The next reader will miss and refill from the new DB row.
Pros: tight consistency — staleness window is just the time between DB-commit and cache-DEL (microseconds). Cons: if the DEL fails (network blip, Redis hiccup), the cache stays stale until the TTL fires. Belt-and-braces approach: always combine with a TTL, so the worst case is bounded.
The writer publishes an event ("user:42 changed") onto a channel — Redis pub/sub, Kafka, or AWS SNS. Every app server with an in-process cache subscribes, and on receiving the message clears its local entry.
Why this exists: in-process caches don't have a single "DEL endpoint" — there are N servers, each with their own copy. Pub/sub fans the invalidation out automatically. Cons: at-least-once delivery means duplicate invalidations (cheap); message loss means stale data (expensive — so use a delivery system that doesn't drop, or combine with TTL).
Don't update the value — add a version to the key. user:42:v1 becomes user:42:v2. The application reads through a small "current version" indirection. Old keys age out via TTL.
Best for: CDN caches (cache-bust by changing the URL). When you deploy a new CSS file, ship it at style.abc123.css instead of mutating style.css. Browser caches naturally evict the old version because nothing references it.
On a stale read, serve the stale value immediately and refresh asynchronously. The user never waits for the DB; the cache repairs itself in the background.
Pros: consistently low p99 latency — no one ever pays the cold-miss cost on the request path. Cons: readers see slightly-stale data (usually fine for product listings, news feeds, etc.). Implemented in HTTP/1.1's stale-while-revalidate directive, Next.js's useSWR hook, and Caffeine's refreshAfterWrite.
A subtle problem: Sarah updates her name from "Sarah" to "Sara" and the next page reload still shows "Sarah". Even if the cache is mostly fresh, her own session sees the old value. Fix: after a user's own write, pin that user's next few reads to bypass the cache (or read from the DB primary, not the replica). Cheap to implement, makes the UX feel instant.
The patterns above all work fine on the happy path. The senior-engineer test is: what happens when traffic spikes, the cache cold-starts, or someone queries a non-existent ID a million times a second? These are the five failures every interviewer wants you to name.
The scene: a popular cache key — say, homepage:top-posts — expires at exactly 11:00:00. In the millisecond after expiry, 10,000 concurrent requests all miss, all hit the database, all compute the same expensive aggregation, and all write it back. The database falls over. The cache is now warm again — for the next 60 seconds, until the next expiry, when the herd thunders again.
public List<Post> topPosts() { List<Post> cached = redis.get("top-posts"); if (cached != null) return cached; // Only ONE thread per JVM gets through this lock per key synchronized (lockFor("top-posts")) { cached = redis.get("top-posts"); // double-check after lock if (cached != null) return cached; List<Post> fresh = expensiveQuery(); redis.setex("top-posts", 60, fresh); return fresh; } }
For multi-server protection use Redis SETNX as a distributed lock, or better — use a Probabilistic Early Expiration (PEE) scheme: each reader has a small probability of triggering a refresh before the TTL expires, proportional to how close to expiry the key is. By the time the TTL actually hits, the value has already been refreshed by one lucky reader.
The scene: at 11:00 AM, after a deploy, the cache fills with thousands of keys all set to TTL = 1 hour. At 12:00 PM, every key expires within the same minute. Suddenly the database is hammered by every key in the system simultaneously.
The fix — jittered TTLs: never set every key to exactly the same TTL. Add a small random jitter — instead of EX 3600, use EX 3600 + rand(0..300). Now the expirations are spread across a 5-minute window instead of a 1-second window. Cheap, effective, and almost no one remembers to do it until they've been bitten.
The scene: an attacker (or a buggy client) queries user:99999999 a million times a second. The cache misses (no such user). The DB returns "not found". The cache doesn't store a "negative" answer. Every single one of those million requests hits the DB.
When the DB says "not found", cache that negative result too: SET user:99999999 = NULL EX 30. Now subsequent identical queries hit the cache and skip the DB entirely. Use a short TTL — you don't want to lock out a legit insertion.
Before checking the cache, check a Bloom filter (a compact probabilistic data structure that answers "does this key definitely not exist?"). If the Bloom filter says no, skip both cache and DB. Used at scale by Cassandra and the LinkedIn newsfeed.
The scene: Cristiano Ronaldo's Instagram profile gets 50% of all reads to your sharded cache. Whichever shard holds his key gets 50% of all traffic. That shard's CPU spikes to 100% while the other shards yawn at 5%.
The fixes:
ronaldo:0, ronaldo:1, …, ronaldo:9 on different shards. Reads pick a random suffix.--hotkeys stats. Tools like Twitter's Pelikan or Netflix's EVCache do this in production.The scene: someone caches a 100 MB JSON blob under one key. Every read serializes 100 MB across the network. Every write blocks the cache (Redis is single-threaded — a big write stalls every other request). Memory fragments. The eviction policy is forced to dump 99 small useful entries to make room for one giant one.
Fix: split big values. If you must cache a large object, break it into chunks (thread:42:posts:0, thread:42:posts:1, …) or use a Redis hash/list with per-field operations so you can read just the bit you need.
Let's walk through how a real production cache evolves, the same way a team would build it. We start from the simplest possible thing, watch it break, fix the break, and repeat until we land on a shape that actually survives a Black Friday.
One service, one Redis box, one Postgres. App talks to Redis on every read, falls back to Postgres on miss. This is what your first Cache-Aside looks like.
Works beautifully at 100 RPS. At 10k RPS, three things break:
Before scaling anything, name the split. We have three tiers of data, each with very different access shapes — and we should serve them from three different places.
~1% of keys, ~70% of reads. Lives in in-process Caffeine cache on each app server. Network-free reads, microsecond latency. Eventually consistent across servers, but that's fine for a 30-second window.
~20% of keys, ~25% of reads. Lives in a Redis cluster, partitioned via consistent hashing, replicated 2x. Sub-millisecond reads, shared across all app servers.
~79% of keys, ~5% of reads. Lives only in Postgres (or your DB of choice). We don't waste cache memory on data that's barely read. Cache penetration here is handled by a Bloom filter at the app layer.
Each tier has its own eviction policy, its own consistency story, and its own monitoring. Hot uses W-TinyLFU with short TTL + pub/sub invalidation. Warm uses LRU + TTL + jitter + explicit invalidation. Cold uses whatever your DB does (buffer pool LRU).
Now we wire it all together. Numbers in the diagram match the cards below.
Use the numbers in the diagram above to find the matching card below. Each card answers four things: what it is, why it exists, what would break without it, and where it sits in the flow.
The user's own browser caching HTML/JS/CSS/images per HTTP headers. It's the cheapest cache in the system — bytes never leave the laptop. Controlled by Cache-Control, ETag, Last-Modified.
Solves: repeat visits to the same page would otherwise re-download megabytes of static assets every time. Without it, every Sarah loading her feed would burn fresh bandwidth on the same logo and CSS files.
A geographically-distributed cache provided by Cloudflare/Fastly/Akamai. Caches static assets and, with the right cache keys, also short-TTL dynamic API responses. Sarah in Mumbai hits the Mumbai PoP; Raj in São Paulo hits São Paulo.
Solves: the speed-of-light tax. Without a CDN, every cross-continent request adds 150–250 ms of network latency. With it, that's pre-paid on the edge near the user.
nginx, AWS ALB, or Envoy. Distributes incoming requests across app servers. Some LBs also do an HTTP cache here (Varnish-style) for identical requests within a few seconds.
Solves: the question "which app server gets this request?" Without it, you can't horizontally scale your app tier. The cache here, when present, absorbs micro-bursts of identical requests.
The Java/Node/Go process that runs your business logic. Embeds a Caffeine (or similar) cache holding the hot 1% of keys. Sub-microsecond reads, but each server has its own copy.
Solves: the 70% of traffic that's all hitting the same handful of keys (Ronaldo, the trending tag, the system config). A Redis trip is ~1 ms; this is <1 µs. At 10k RPS that latency cut is the difference between a one-server and a five-server fleet.
A small probabilistic structure that answers "is this key definitely not in the dataset?" Checked before Redis and the DB.
Solves: cache penetration attacks — millions of queries for user:99999999 that don't exist. Without it, each one becomes a DB miss and brings down Postgres. With it, the app rejects them in microseconds without ever touching the DB.
The shared warm tier. Partitioned via consistent hashing across N shards, each replicated 2× for failover. Holds ~20% of keys responsible for ~25% of reads. Sub-millisecond access from any app server.
Solves: the consistency problem of in-process caches — when Sarah updates her profile on app-server-A, app-server-B needs to know. Redis is the shared source of truth for cached data. Also handles the keys too large/cold for in-process but too hot for the DB.
A small layer that mirrors known hot keys across multiple Redis shards under suffixed names (ronaldo:0, ronaldo:1, …, ronaldo:9). Reads pick a random suffix.
Solves: the hot-shard problem. Without this, one Redis box gets 50% of the cluster's network traffic and CPU. With this, the load spreads evenly. Typically driven by an auto-detector that watches access counts and auto-mirrors any key crossing a threshold.
A Redis channel or Kafka topic carrying "this key changed" events. App servers subscribe and evict matching entries from their in-process Caffeine caches.
Solves: the "N copies of the same in-process cache drift apart" problem. Without it, the only way to clear an in-process cache entry on every server is to wait for its TTL to expire — and the bigger the fleet, the worse the staleness. Pub/sub fans the invalidation out in milliseconds.
The source of truth. Postgres or MySQL or whatever you're using. Read replicas absorb the read traffic that escapes the cache; the primary takes all writes. The DB's own buffer pool is also a cache — undersize it and "cache miss → cold DB query" lands a Sev1.
Solves: everything. The cache is an optimization; the DB is the contract. Tune it for cache-miss survival, not for cache-hit fastpath.
Prometheus + Grafana (or your stack) capturing: cache hit ratio, p99 latency per layer, key-count, eviction rate, hot-key counters, miss rate per shard.
Solves: "the system is slow, where do we look?" Without metrics per layer you can't tell whether the regression is in CDN miss-rate, Redis CPU, hot-key skew, or buffer-pool churn. Every senior engineer can name the cache hit ratio of their service to within 5% — if you can't, you're flying blind.
Let's trace one request through all 10 components. Sarah opens example.com/sarah on her phone in Mumbai. Here's what happens, second by second:
/api/users/sarah. The request hits ② Cloudflare's Mumbai PoP, which has a 60-second cached copy. Cache hit, served in 5 ms.PUT request bypasses the CDN (write methods aren't cached), goes through ③ the LB to ④ an app server.DEL user:sarah to ⑥ Redis (single shard via consistent hash).{"key":"user:sarah"} on ⑧ the pub/sub bus. Within ~50 ms, every other app server's ④ in-process cache evicts its copy of user:sarah.celebrity:ronaldo is being read 200k times/sec on shard-3 and starts mirroring it to shards 4 and 5 under suffixed keys. Shard-3 CPU drops from 92% to 34%.The hardest part of cache design isn't knowing the patterns — it's knowing which to pick for the workload in front of you. The table below maps common scenarios to their best default. Use it as a starting point, not a law.
| Scenario | Best default strategy | Eviction | Why |
|---|---|---|---|
| User profile data (read-heavy, occasional writes) | Cache-Aside + explicit invalidation | LRU + 5 min TTL | Reads dominate; staleness on writes is bounded. |
| Session tokens (must be fresh; bounded life) | Write-Through | TTL (= session length) | Zero staleness; TTL handles cleanup naturally. |
| High-throughput counters (likes, views) | Write-Behind to durable log | None (always hot) | 10k writes/sec to DB direct is impossible; batch them. |
| Audit / log inserts (rarely re-read) | Write-Around | — | Don't pollute cache with cold data. |
| Homepage / leaderboard (hot, predictable, latency-critical) | Refresh-Ahead | TTL with PEE | Never let users hit a cold miss; refresh proactively. |
| Hot celebrity / trending entity | Cache-Aside + hot-key replication | LFU or pinned | Distribute load across shards; favor stickiness. |
| "Does this entity exist?" lookups | Bloom filter + negative caching | — | Stop "user 99999999" attacks at the door. |
| Product price during a flash sale | Cache-Aside + pub/sub invalidation | 1 sec TTL | Must update fast across all servers when admins change price. |
| Static assets (CSS, images) | CDN + URL versioning | Long TTL (1 year) | Immutable artifacts; cache forever and bust by URL. |
| News feed / timeline | Stale-While-Revalidate | 30 sec staleness budget | Snappy reads, freshness in the background. |
These are the questions that come up in distributed-systems rounds — caching turns up in nearly every system-design interview. Each answer is short enough to deliver in 60 seconds while still showing depth.
hash(key) % N?(N-1)/N of all keys to remap when N changes. Going from 4 → 5 nodes invalidates ~80% of cache entries — effectively wiping the cache. Consistent hashing places nodes and keys on a virtual ring; adding a node only moves keys in one arc of the ring (~1/N of the data). Combined with virtual nodes (~100 vnodes per physical node), it gives smooth, predictable scaling.SETNX) reloads it; others wait for the result. (2) Probabilistic Early Expiration (PEE) — each reader probabilistically triggers a refresh before TTL based on how close to expiry the key is. (3) Stale-While-Revalidate — serve the stale value immediately and refresh asynchronously, so nobody ever waits for a cold miss.--hotkeys) and apply the above dynamically.EX 3600 + rand(0..300)) spreads the expirations over a window, smoothing the DB load.version/updated_at in the cache and refuse to overwrite a newer version with an older one), or use Redis transactions/WATCH, or use the "delete-after-write + short TTL" combo so the staleness window is bounded even when the race lands.user:42:feed), which means low hit rate per key and a lot of memory pressure. Make sure: (1) the TTL is short (data changes frequently), (2) you cache derived data (the rendered feed), not the underlying entities (which are cached separately), and (3) invalidation triggers on the user's own writes (read-your-writes). Otherwise you'll waste memory caching things one user reads once.If you remember nothing else from this article, remember these.