RDBMS, Document, Wide-Column, Key-Value, Search Engine, Time-Series, Graph, Vector, Object Store, NewSQL — the storage vocabulary every system-design interview expects you to wield like a tool, not a toy.
Imagine Raj walks into the system-design loop at a FAANG-style company. The interviewer says "design Twitter". Raj draws a load balancer, an app server, and writes "DB" in a box. The interviewer taps the box: "which DB?" Raj says "Postgres". The interviewer asks why. Raj says "it's what I know." The interview is effectively over — not because Postgres is wrong, but because Raj treated the DB as a black box. The interviewer just learned that Raj has never actually felt the consequences of that choice in production.
Here is the secret most candidates miss: the database family is the single most consequential decision in any HLD. Every other choice — sharding, caching, replication, indexing, the API shape, even the team's on-call rotation — falls out of that one box. Pick wrong and you spend the rest of the interview (and the rest of the system's life) papering over the mismatch. Pick right and the rest of the design almost writes itself.
Knowing a database is not "I've used MongoDB". It's being able to say, in one breath, four things about it:
By the end of this page you should be able to look at any interview prompt — "design a chat app", "design a search service", "design a leaderboard" — and within thirty seconds name the family you'd start from and what you'd reach for next when the first one's limits show up. That is the muscle this page builds.
Before we look at any product names, let's build the mental framework. Every database — every single one — can be placed on four axes. Internalize these and you can derive which family fits a workload, instead of memorizing a list. This is the difference between recognizing a database and understanding it.
Imagine you're storing a customer. Is the customer one fixed-shape row (id, name, email, plan)? Or a nested document with a varying number of addresses, payment methods, and metadata blobs? Or just a string blob indexed by id? Or a node with edges to other customers? The natural shape of your record is the strongest signal of which family fits.
Fixed columns, typed, the same shape for every row. Joins across tables compose new shapes. → RDBMS.
Self-contained JSON-like tree per record, varying shape allowed. → Document DB.
Lookup by exact key, value is opaque (string, hash, list, set). → Key-Value.
One partition key + millions of column-value pairs underneath. → Wide-column.
The storage engine is the single biggest predictor of "is this DB fast at reads or writes?". Two engines dominate. B-tree: every write seeks into a sorted on-disk tree and updates in place — great for point reads and range scans, slower for high write volume because each write may rewrite a page. LSM-tree (Log-Structured Merge): every write goes to a sequential append-only log, then to an in-memory sorted buffer, then flushes to immutable files on disk that get compacted in the background — phenomenal write throughput, slightly slower reads because a key may live in several files. Inverted index: builds a "word → document list" map so full-text queries are a hash lookup, not a scan. Columnar: stores all values of one column together so analytical aggregations skip the columns they don't need.
Every distributed DB makes a trade-off when the network breaks: do you reject writes to stay consistent (CP), or accept writes and resolve conflicts later (AP)? Single-node RDBMS sidesteps this entirely by being… single-node — strong consistency by default, but you scale vertically and hope you never max out the box. Cassandra picks AP — writes always succeed, reads may see stale data for a moment. Spanner picks CP with linearizable reads via atomic clocks — at the cost of bounded latency. Knowing where your DB sits on this axis is what lets you answer "what happens during a network partition?" — the question interviewers love to drop in deep-dives.
Some databases grow by getting a bigger box (vertical scaling: Postgres, MySQL). Some shard horizontally but you manage the shards (manual sharding). Some shard automatically and you never see it (DynamoDB, Spanner, Cassandra). Some are peer-to-peer with no master at all (Cassandra, Riak). This axis decides what your on-call life looks like at 10x growth — whether you're upgrading the box at 2am or just provisioning more capacity in a console.
| Family | Shape | Engine | Consistency Default | Scale Axis |
|---|---|---|---|---|
| RDBMS | Tabular + joins | B-tree | Strong (ACID) | Vertical, then manual shard |
| Document | JSON tree | B-tree or LSM | Strong (single doc) | Horizontal (sharded clusters) |
| Wide-column | Partition + clustering | LSM-tree | Tunable, AP-default | Peer-to-peer horizontal |
| Key-value | Key → blob | Hash / LSM | Varies (Redis: per-node) | Shard by key hash |
| Search engine | Document + tokens | Inverted index | Eventually consistent (≈1s refresh) | Horizontal sharded indices |
| Time-series | (metric, ts, value) | LSM / columnar | Eventually consistent | Time-partitioned shards |
| Graph | Nodes + edges | Adjacency list + index | Strong (single instance) | Hard — graphs don't shard cleanly |
| Vector | Embedding + payload | HNSW / IVF index | Eventually consistent | Horizontal sharded indices |
| Object store | Path → blob | Distributed FS | Strong read-after-write | Effectively infinite (S3) |
| NewSQL | Tabular + joins | Distributed B-tree + Raft/Paxos | Linearizable (CP) | Auto-shard horizontal |
Before we zoom into each family, here is the whole landscape on one canvas. Read it like a periodic table: the two big buckets are SQL (left, blue) and NoSQL (right, orange), with specialized stores hanging off each (purple = search/analytics, green = blob/file, cyan = niche shapes). Numbers ①–⑩ map to the deep-dives that follow.
Sarah is building an e-commerce checkout. A customer adds two items, applies a coupon, hits "buy". Three rows must change atomically — inventory drops by 1 in each item row, the order row gets inserted, the customer's loyalty points get bumped. If any one of those fails, all of them must roll back. If two customers race for the last unit of an item, exactly one of them must win and the other must see "out of stock". This is the textbook RDBMS workload — and it's why Postgres and MySQL still power roughly every business-critical system on Earth.
The storyBorn in the 1970s out of Edgar Codd's relational model. The mental model: data lives in tables (fixed columns, typed). Tables reference each other via foreign keys. The query language (SQL) lets you compose new shapes on the fly via JOINs. The killer feature is ACID transactions — Atomicity (all-or-nothing), Consistency (constraints hold), Isolation (concurrent transactions don't see each other's half-states), Durability (committed data survives a crash). When the interviewer says "money is involved", this is the default answer.
When to pickAnything that touches money, inventory, identity, or correctness-critical state. Multi-table relationships where the join matters. Complex ad-hoc queries (analysts will write SQL you didn't anticipate). Strong consistency required. Schema is well-known and stable.
FeaturesB-tree indexes (and hash, GIN, GiST, BRIN in Postgres). MVCC (Multi-Version Concurrency Control) — readers never block writers. Foreign-key constraints. CHECK constraints. Views and materialized views. Full SQL with window functions, CTEs, recursive queries. JSONB column in Postgres makes it a half-decent document store too. Logical replication for streaming changes.
PatternsMaster + read replicas for read scaling. Connection pooling via PgBouncer (Postgres maxes out around 500-1000 simultaneous connections without it). SELECT ... FOR UPDATE for pessimistic row locks (the inventory case). Optimistic concurrency via a version column for low-contention writes. Debezium + Kafka for change-data-capture into downstream systems.
Trade-offsVertical-scale ceiling — a single Postgres box maxes out around 5K–10K complex writes/sec on commodity hardware. Sharding is genuinely hard (no native multi-shard transactions; you build the routing layer yourself). Schema migrations on huge tables can lock writes for hours without careful tooling (pg_repack, gh-ost). Joins across a sharded cluster are expensive.
Interview tipDrop these terms casually: isolation levels ("READ COMMITTED is the default; I'd bump to SERIALIZABLE for the inventory check"), SELECT FOR UPDATE for pessimistic row locks, partial indexes on WHERE deleted_at IS NULL for soft-deletes, GIN index for full-text or JSONB, logical replication for read replicas, connection pooler for >1K concurrent clients. The senior tell: explaining when you'd use SERIALIZABLE vs SELECT FOR UPDATE vs an optimistic version column.
Imagine you're building a CMS where every "article" has a wildly different shape — some have a video block, some have polls, some have nested comments three levels deep. You don't want to invent a new table for every variation. You want to drop the whole article as a JSON tree into one place, then query it back the same way. That's the document model: the record is a self-describing JSON-like document, and the DB indexes the fields inside.
The storyBorn around 2009 in the heyday of "the relational model doesn't fit the web". MongoDB stores BSON (binary JSON) documents inside collections (the document equivalent of tables). Documents in the same collection can have different fields — schema is enforced by the application, not the DB. You query with a JSON-shaped query language, and the aggregation pipeline composes operations like $match → $group → $sort. Sharded clusters with config servers handle horizontal scale.
When to pickSchema is genuinely fluid and changing rapidly (early-stage product). Data is naturally hierarchical and always read as a whole (a blog post with its embedded comments). The team has more JavaScript than SQL muscle. Increasingly rare as a default in 2026 — Postgres JSONB has eaten most of Mongo's niche and made the choice tougher to defend.
FeaturesBSON document model, aggregation pipeline, secondary indexes (single, compound, multikey, text, geo, wildcard, hashed), change streams (CDC out of the box), sharded clusters, multi-document ACID transactions (since 4.0), MongoDB Atlas as the managed offering.
PatternsEmbed related data when always read together (a user with their addresses inline). Reference (foreign-key style) when the related data is large, shared, or changes independently. Aggregation pipelines for analytics-style queries on operational data. Shard key chosen to spread writes — picking _id (monotonic ObjectIds) creates a hot shard, a classic interview gotcha.
Trade-offsWas famously eventually-consistent for years before tightening defaults. Hot documents (e.g., a global counter) suffer write conflicts because the unit of atomicity is the document. Index size grows fast — every secondary index is its own B-tree. Operationally heavier than Postgres for equivalent workloads in most teams' hands.
Interview tipBe ready to defend why you'd pick Mongo over Postgres in 2026 — usually the answer is "I wouldn't, unless the team already has Mongo expertise or the hierarchy is so deep that JSONB queries get awkward." Naming Mongo without that defense is a junior signal. Strong signals when you do pick it: discussing shard key choice ("not _id — monotonic insert makes the latest shard hot"), embed-vs-reference decision per relationship, and the read concern / write concern levels.
Imagine you're building WhatsApp's message inbox. Every user has a stream of messages — could be 50, could be 50 million. The query you always run is "give me Sarah's last 100 messages, newest first." You don't join, you don't aggregate across users, you don't need ad-hoc questions. You just want to grab one user's slice fast, even at 2 billion users. This is exactly the workload Cassandra was built for.
The storyBorn from Google's Bigtable paper (2006) and Amazon's Dynamo paper (2007). The mental model: every row has a partition key (decides which node owns it) and a clustering key (decides the sort order within that partition). A "row" can have millions of columns, and adjacent rows within the same partition are stored together on disk — so reading "the last 100 messages for user 42" is a single seek + scan. The storage engine is an LSM-tree, which means writes are append-only and blindingly fast. The cluster is peer-to-peer — no master, every node knows the ring.
When to pickWrite-heavy workloads. Massive scale (multi-TB to PB). Access pattern is "give me everything for partition X, optionally within range Y". Time-series-shaped data (sensor readings, event logs, user feeds, message inboxes, audit trails). You can design tables around queries up front. You're willing to sacrifice flexible queries for horizontal scale.
FeaturesTunable consistency per query (R + W > N for strong reads; R=1, W=1 for fastest). Wide columns — a partition can hold millions of clustered rows. LSM-tree storage (write-optimized). Peer-to-peer (no master, no SPOF). Multi-datacenter replication built in. Hinted handoff for transient node failures. DynamoDB adds managed sharding and global tables for multi-region active-active.
PatternsDesign one table per query — denormalize aggressively, because joins don't exist. Partition key = the thing you always filter by (user_id). Clustering key = sort within partition (created_at DESC). For "Sarah's last 50 messages": PRIMARY KEY ((user_id), created_at). Replication factor of 3 across 3 AZs is the standard production setup.
Trade-offsNo joins, ever. Eventually consistent by default — you ask for stronger guarantees per query at the cost of latency. Query patterns must be designed up front; ad-hoc queries without the partition key trigger scatter-gather across the whole cluster and you'll cancel them before they finish. Compaction (merging SSTables) is the operational headache at scale. Counters and lightweight transactions are second-class citizens.
Interview tip"Design the partition key around the access pattern; never query without the partition key — otherwise it's scatter-gather and that's a query you cancel before it finishes." Mention R + W > N quorum for strong consistency, hinted handoff for transient failures, SSTable compaction as the ops headache, tombstones as the deletion gotcha. For DynamoDB specifically: "single-table design" with PK + SK and GSIs for secondary access patterns.
Imagine a user logs into Netflix. Before the homepage can render, the server needs to know who this user is — name, plan, language preference, A/B test bucket — and it needs that in under a millisecond. The session_id is the key, the user blob is the value, and the question is the simplest possible: "give me the value for this key." No joins, no filters, no aggregations. Just a hash lookup at network speed. That's the key-value workload, and Redis serves billions of these per second across every modern product you use.
The storyThe simplest data model imaginable: a giant distributed hash map. Key in, value out. The value can be a string, a hash (field-value pairs), a list, a set, a sorted set (Redis only) — anything that fits in memory. Redis is the dominant in-memory choice: data lives in RAM, optionally persisted to disk via AOF/RDB snapshots, sub-millisecond latency at 100K+ ops/sec per node. Memcached is the stripped-down cousin: no data structures, no persistence, just blazing-fast in-memory key→string.
When to pickCaching (the #1 use). Session stores. Rate limiting (atomic INCR with TTL). Leaderboards (sorted sets). Real-time counters. Pub/sub (Redis Streams). Distributed locks (Redlock). Anywhere "by-key lookup" is the only access pattern and you need it in <1ms.
FeaturesRedis data structures: STRING, HASH, LIST, SET, ZSET (sorted set), STREAM, HyperLogLog, BITMAP, geo indexes. TTLs per key. Atomic operations (INCR, ZADD). Lua scripting for multi-step atomicity. Pub/sub. Redis Cluster for sharding (16384 hash slots). Persistence via AOF (append-only file) and RDB (snapshot). Sentinel for HA failover.
PatternsCache-aside (most common): app checks Redis, on miss reads DB and writes back to Redis. Write-through: writes go to Redis and DB synchronously. Write-behind: writes go to Redis, async flushed to DB. Sorted set for leaderboards: ZADD scores 1500 alice; ZREVRANGE scores 0 10. Atomic counters with TTL for rate limiting. SETNX for distributed locks (with Redlock for safer multi-node).
Trade-offsData lives in RAM — capacity is expensive. Eviction policy (LRU, LFU, TTL) matters under pressure. Persistence is best-effort, not the same guarantee as Postgres' fsync. Multi-key atomic ops need same shard (hash tags {user:42}:profile). Memcached lacks data structures, persistence, and replication — pure ephemeral cache.
Interview tipNever just say "I'll use Redis". Say: "I'll store user:{id} as a Redis HASH with 24h TTL, shard the cluster across 6 nodes via consistent hashing, set LRU eviction at 64GB/node, and use HSET for partial updates." For leaderboards: ZSET. For rate limit: INCR + EXPIRE in a Lua script. For pub/sub or queues: Redis Streams. The senior tell: knowing when Memcached suffices over Redis (pure read-through cache, no data structures, no persistence needed).
Imagine a user types "blue runnng shoes" (typo intentional) into Amazon's search box. They expect results in 150ms, ranked by relevance, with typo tolerance, with the brand and price filters on the left, and aggregated counts per category. Postgres' WHERE title ILIKE '%running shoes%' would scan every row and miss the typo. You need an inverted index — a data structure that maps every token ("blue", "running", "shoes") to the list of documents that contain it. That's Elasticsearch.
The storyAll built on Apache Lucene's inverted-index engine. The mental model: when you index a document, the engine tokenizes the text ("running shoes" → ["run", "shoe"] after stemming + stopword removal), then maps each token to a posting list of document IDs. When you query, the engine intersects the posting lists, scores each match with BM25 (a relevance formula based on term frequency and inverse document frequency), and returns the top-K. Elasticsearch wraps Lucene in a distributed cluster — sharded indices, replica shards for HA, JSON query DSL, REST API.
When to pickFull-text search with relevance ranking. Typo tolerance / fuzzy matching. Faceted search ("filter by category, see counts"). Geo queries (geo_point, geo_shape). Log aggregation and observability (the "E" in ELK / EFK stack). Aggregations across millions of docs (date histograms, percentiles, terms). Anywhere Postgres' ILIKE stops being good enough.
FeaturesInverted index (the magic). Tokenization + stemming + stopword removal + synonyms. Fuzzy matching by edit distance. BM25 relevance scoring (and now learning-to-rank). Aggregations (terms, date histograms, percentiles, cardinality). Geo queries. Index aliases for zero-downtime reindex. Hot / warm / cold tiers for log workloads. Vector search via dense_vector and ANN since 8.x.
PatternsPair with Kafka for the indexing pipeline — app writes to source-of-truth DB (Postgres), CDC stream emits to Kafka, an indexer consumer updates ES. Time-based indices for log data: one index per day (logs-2026-05-07) so dropping old logs is O(1) and recent data gets more shards. Aliases let you swap the live index without app changes. Search-side: use match for text, term for exact, bool to combine, aggregations for facets.
Trade-offsNot a source of truth — eventually consistent (1s default refresh before docs are searchable). Schema rigidity via mappings (changing a field type requires reindex). Operationally heavy: shard sizing, JVM heap tuning, hot/warm/cold tier design. Memory hungry. Lucene merges (segment compaction) are expensive at scale. License changes (Elastic vs OpenSearch fork in 2021) muddied the product picture.
Interview tip"Postgres is the source of truth; Elasticsearch is the search-side view, fed by CDC through Kafka." Name the indexing pipeline explicitly — it's the part juniors hand-wave. Drop these: inverted index, BM25, tokenizer + analyzer, shard count is forever (you can't change it without reindex), refresh interval (the 1s lag), index aliases for zero-downtime reindex.
Imagine you're storing CPU usage from 100,000 servers, sampled every 10 seconds. That's 10 million data points per second, all of shape (server_id, timestamp, metric_name, value). The questions you ask: "show me CPU on web-server-42 for the last hour", "average CPU across the fleet by minute for the last 24h", "alert me when p99 latency crosses 500ms over a 5-minute window". This is a workload no general-purpose DB likes — but specialized time-series DBs make it trivial.
The storyTime-series data has unique shape: append-only, never updated, almost always queried by a time range plus a label filter, and aggregated in time buckets. Specialized DBs exploit this — they partition by time (one chunk per hour or day), apply heavy compression on per-column runs (timestamps and values both compress beautifully when sorted by time), and ship built-in aggregation operators. Prometheus dominates monitoring: pull-based scraping, PromQL query language, single-node by default. InfluxDB is the long-term storage classic. TimescaleDB is a Postgres extension that gives you SQL on top of time-series performance. Druid is the analytics-scale variant for billion-row ad-hoc OLAP on event streams.
When to pickMetrics and observability (Prometheus is the default). IoT sensor data. Financial tick data. Application performance monitoring. Anything where the record shape is (entity, timestamp, value), writes vastly outnumber reads, and queries are "range over time + aggregate". If you'd be tempted to store it in Cassandra with a time-based clustering key, a time-series DB is probably the better answer.
FeaturesTime-partitioned chunks, automatic retention/downsampling (raw data for 7 days, 5-min averages for 90 days, hourly for 1 year). Columnar compression — 10x to 100x smaller than row-oriented storage for the same data. Built-in aggregation operators (rate, derivative, moving average, percentile). Tag-based indexing (server=web-42, region=us-east). High write throughput per node (millions of points/sec).
PatternsPair with a metrics-emission library (Prometheus client libs) that exposes a /metrics endpoint scraped every 15s. Use recording rules for precomputed aggregations on hot dashboards. Use alerting rules on PromQL expressions. For long-term retention: remote write from Prometheus to Thanos / Cortex / Mimir. For business analytics on event streams: Druid behind a Kafka consumer.
Trade-offsCardinality is the silent killer — too many unique tag combinations explode the index (a metric tagged with user_id across 10M users will OOM Prometheus). Not suited for arbitrary ad-hoc joins. Limited update / delete semantics (the data model assumes append-only). Single-node Prometheus has no built-in HA — you run two scraping the same targets and dedupe at the query layer.
Interview tip"For metrics I'd use Prometheus with a 15s scrape, recording rules for hot dashboards, and Thanos for long-term retention. The cardinality cap is the thing I'd watch — I won't tag metrics with high-cardinality fields like user_id or request_id." For business events: "Kafka → Druid for ad-hoc OLAP, vs Kafka → Cassandra for known queries." That distinction is the senior signal.
Imagine LinkedIn's "you are 3 connections away from this hiring manager" feature. To answer that in Postgres, you'd join the connections table to itself three times, scanning millions of rows. To answer it in Cassandra, you'd reshape the data into pre-computed friend-of-friend tables and pray. In a graph DB, it's literally one line of Cypher: MATCH (me)-[:KNOWS*1..3]-(target) RETURN target. When the relationships are the data — not metadata around it — graph DBs collapse queries that other families choke on.
The storyData is stored as nodes (vertices) and edges (relationships), both of which can have properties. The storage engine keeps an adjacency list per node — every edge for that node is a direct pointer, not an index lookup. So traversing "friends of friends" is following pointers, not joining tables. Cypher (Neo4j) and Gremlin (TinkerPop) are the dominant query languages — they let you express "find all paths from A to B of length up to N where each hop satisfies a predicate" in one statement.
When to pickSocial networks (friend-of-friend, shortest path). Fraud detection (anomalous paths through transaction graphs). Recommendation engines ("users who bought X also bought Y" via graph walks). Knowledge graphs (entity-relationship semantics). Identity / access control (resolving role inheritance). Anywhere a relational query needs 4+ self-joins to express what you mean.
FeaturesNative graph storage — edges are first-class pointers. Cypher / Gremlin / SPARQL query languages. Variable-length path queries ([:KNOWS*1..5]). Built-in graph algorithms (shortest path, PageRank, community detection, centrality). Property graph model — nodes and edges both carry typed properties.
PatternsUse a graph DB alongside your primary store, not instead of it. Postgres or Cassandra owns the source of truth; the graph DB stores the relationship layer derived from it via CDC. For LinkedIn-style "people you may know", the graph is rebuilt nightly and queried for serving. Don't try to model an entire schema in the graph — limit it to the relationships where graph queries pay off.
Trade-offsGraphs don't shard cleanly — the whole point is that edges cross arbitrary boundaries, so any cut splits the queries you care about. Most graph DBs are single-machine at heart, with limited horizontal scale. Slower than RDBMS for non-graph queries (simple lookups). Smaller ecosystem and tooling than SQL. Operationally exotic for most teams.
Interview tipDon't reach for a graph DB unless the query is "variable-depth traversal" or "find all paths". For "fetch my immediate friends" — that's just a Postgres join. Senior signal: knowing the line. "Up to 2 hops, Postgres handles it. Beyond that, graph DB earns its place." Drop terms: adjacency list, variable-length path, property graph, Cypher.
Imagine ChatGPT-style retrieval-augmented generation. The user asks "how does our 401k matching work?" To answer, the system needs to find the 5 most semantically similar chunks across 100,000 internal documents — not "documents containing the word 401k" (that's Elasticsearch's job) but "documents whose meaning is closest to this question". To do that, every chunk is first converted into a 768-dimensional vector (an embedding) by a model like OpenAI's text-embedding-3-large. Finding the nearest vectors in 768-dimensional space is what a vector DB does — billions of vectors, sub-100ms query time.
The storyA vector is just an array of N floats. "Nearest neighbor in N-dimensional space" is the cosine-similarity (or Euclidean / dot product) of two vectors. The hard part: doing it across billions of vectors in milliseconds, where exact brute-force search is O(N) and a death sentence. The solution: approximate nearest neighbor (ANN) indexes — HNSW (graph-based), IVF (clustering-based), or LSH (hashing-based). They trade tiny accuracy (95%+ recall) for million-fold speedup. Pinecone is the dominant managed product; pgvector is the Postgres extension that brings the workload back into the relational fold for modest scale.
When to pickSemantic search over text/images/audio. Retrieval-augmented generation (RAG) for LLM apps. Recommendation by embedding similarity ("users who liked similar items"). Image / face recognition. Anomaly detection in embedding space. Any workload where "give me the 10 most similar items to this query" is the question.
FeaturesHNSW (Hierarchical Navigable Small World) index — fast, accurate, memory-hungry. IVF (Inverted File Index) — cheaper memory, slightly slower. Filtered ANN search (vector similarity + metadata predicate like "language=en"). Hybrid search combining vector similarity with BM25 keyword scoring. Multi-tenant namespaces. Sharded indices for horizontal scale.
PatternsEmbedding pipeline: text/image → model API → vector → store. Chunking strategy matters more than the DB choice — bad chunks ruin retrieval no matter how fast your ANN is. Pair with a re-ranker (a smaller cross-encoder model that re-scores the top 50 ANN hits to improve precision). For modest scale (<10M vectors), pgvector inside Postgres avoids a separate system.
Trade-offs"Approximate" — you can't get exact answers cheaply. Embedding cost (the model call) often dwarfs the DB cost. Re-embedding when the model changes is a full reindex. Vector size × count = memory bill; 768-dim float32 vectors are 3KB each, 1B vectors = 3TB RAM. Quality of retrieval is dominated by embedding model quality, not the DB.
Interview tip"For RAG I'd embed chunks of ~500 tokens with text-embedding-3-large, store them in pgvector for <10M scale or Pinecone beyond that, use HNSW with M=16/ef=200, and add a cross-encoder re-ranker on the top 50 results before passing to the LLM." Knowing HNSW parameters, chunking strategy, and re-rankers is the senior signal.
Imagine Sarah uploads a 4K video to YouTube. That's gigabytes of bytes that no one will query inside — the questions are all "by id, give me the bytes" and "by id, stream the bytes from offset 1M to 2M". Stuffing those bytes into Postgres' BYTEA column or Cassandra's blob column is wasteful and slow. The right home is an object store: cheap, durable, infinitely scalable, designed exactly for "key → big binary blob". S3 popularized the model in 2006 and it's now the bedrock of every cloud system.
The storyAn object store is conceptually a giant key-value store where the values are large opaque blobs (kilobytes to terabytes) and the key is a path-like string (users/42/profile.jpg). Underneath: erasure-coded data spread across multiple storage devices and availability zones for 11 nines (99.999999999%) of durability. S3 also gives you metadata, versioning, lifecycle rules (auto-tier to cheaper storage classes after N days), event notifications (S3 → Lambda / SQS), and signed URLs for direct client uploads / downloads.
When to pickAnything bigger than ~1MB. User-uploaded images, videos, documents. Database backups. ML training datasets. Logs and analytics data lakes (Parquet on S3, queried by Athena / Spark). Static website assets (with CloudFront in front). Pretty much any blob in any modern system.
FeaturesStrong read-after-write consistency since 2020 (you can read the bytes immediately after PUT). Versioning. Storage tiers (Standard → Standard-IA → Glacier → Glacier Deep Archive). Lifecycle policies. Server-side encryption (SSE-S3, SSE-KMS). Cross-region replication. Multipart uploads for large files. Pre-signed URLs (direct client → S3, bypassing your servers). Event notifications. S3 Select (push SQL filtering down into the object store).
PatternsDirect client upload — your API returns a pre-signed PUT URL, the browser uploads directly to S3, your server records only metadata in Postgres. CDN in front (CloudFront / Cloudflare) for read-heavy content. Lifecycle to Glacier for cold archives (cents per TB-month). S3 event → SQS → worker for "process this upload" pipelines. Use the Postgres row to hold metadata (size, mime, owner, expiry), keep the bytes in S3.
Trade-offsLatency is ~100ms per object op — not for hot-path low-latency reads. List operations are eventually consistent and slow. You pay for requests too, not just storage. No partial updates — you re-upload the whole object. Vendor lock-in is real (S3 API has been copied but features diverge).
Interview tip"Any blob >1MB goes in S3, not the database. The DB row holds the metadata and the S3 key; clients upload directly via pre-signed URLs to bypass our servers; CloudFront caches reads at the edge." Drop these: pre-signed URL, multipart upload, lifecycle policy, S3 event → Lambda / SQS, 11 nines durability.
The painful trade-off of the 2010s: keep Postgres' ACID and live with its single-node ceiling, or move to Cassandra/DynamoDB and rewrite your app around eventual consistency. NewSQL is the bet that you shouldn't have to choose. Google's Spanner (2012) was the proof — a globally distributed SQL database with linearizable transactions across continents, made possible by atomic clocks and a clever protocol. CockroachDB, TiDB, and YugabyteDB are the open-source descendants. The pitch is intoxicating: SQL semantics, ACID transactions, horizontal scale, no sharding pain.
The storyData lives in ranges (CockroachDB) or splits (Spanner) automatically sharded across nodes. Each range is replicated 3x via the Raft (or Paxos) consensus protocol, so every write is committed only when a quorum of replicas agrees. The result: ACID transactions across rows on different machines, with linearizable consistency. Spanner uses TrueTime (atomic clocks + GPS) to bound clock skew and enable global transactions. CockroachDB and TiDB use logical clocks (HLC — hybrid logical clock) for the same effect with slightly weaker guarantees.
When to pickYou need PostgreSQL semantics but the workload exceeds a single Postgres box's headroom. Multi-region active-active with strong consistency. Compliance / regulatory workloads where eventual consistency is unacceptable. Financial systems at scale. Greenfield projects where you'd want to start on Postgres but know you'll need horizontal scale soon.
FeaturesStandard SQL (Postgres-wire-compatible in CockroachDB, MySQL-compatible in TiDB). ACID transactions across shards. Automatic range-based sharding and rebalancing. Multi-region replication with per-row geo-fencing in Spanner. Online schema changes. Time-travel queries (AS OF SYSTEM TIME). Built-in change-data-capture.
PatternsTreat it like Postgres until proven otherwise. The hot pitfall: cross-region transactions are slow (every commit waits for quorum across regions, easily 100ms+). Design schemas so most transactions live within one region; use Spanner's interleaved tables or CockroachDB's REGIONAL BY ROW to pin data to its home region.
Trade-offsLatency tax — every cross-shard transaction pays a Raft round trip (typically 5-20ms intra-region, 100ms+ cross-region). Operationally complex compared to single-node Postgres. Newer products = thinner ecosystem (fewer ORM optimizations, fewer monitoring tools). Cost is often higher per TB than Postgres or DynamoDB.
Interview tip"I'd start on Postgres. If we hit the vertical ceiling and the workload still needs ACID transactions across rows, I'd move to CockroachDB or Spanner — same SQL surface, horizontal scale, but I'd accept ~10-20ms latency for cross-shard commits." Naming Spanner without that latency caveat is a junior signal. Senior tell: knowing the latency budget for global transactions and designing the schema to minimize cross-region writes.
This is the muscle you need to develop. Walk through the tree in your head on any interview prompt and you'll land on a family within thirty seconds. Then defend the choice with workload numbers.
Here's the truth juniors miss: production systems almost never use one database. They use four or five, each owning the slice of the workload it's built for. This is called polyglot persistence and it's the single most senior-sounding concept in HLD interviews. Below are four real-world stacks — memorize the pattern, not the products.
RDBMSMySQL (Gizzard-sharded): user accounts, social graph metadata. Source of truth for identity.
Wide-ColumnManhattan (Twitter's Cassandra-like): tweets, timelines, fan-out home feeds. Write-heavy, range-scan by user.
Key-ValueRedis: hot timeline caches, rate limits, session data. Sub-millisecond reads on hot users.
SearchEarlybird (Lucene-based): real-time tweet search with relevance ranking.
Object StoreBlobstore: images, videos, media uploads.
RDBMSPostgres / MySQL via Schemaless: trip records, payments, ride history. ACID needed.
Wide-ColumnCassandra: driver location history, trip events stream.
Key-ValueRedis: "drivers near this geohash" — live driver locations in geo-sorted sets, evicted on disconnect.
SearchElasticsearch: address geocoding and search, support-team queries on past trips.
Time-SeriesM3DB / Prometheus: operational metrics across fleet of services.
Wide-ColumnCassandra: viewing history per user, A/B test assignments. Write-heavy, partitioned by user.
Key-ValueEVCache (Memcached fork): personalized homepage caches across regions. Millions of ops/sec.
SearchElasticsearch: title search, autocomplete.
Object StoreS3: video files, sub-segments, transcoded variants. CloudFront / Open Connect edges in front.
Time-SeriesAtlas (in-house): 2.5 billion metrics/min across the fleet.
RDBMSPostgres: user accounts, conversation metadata, document upload records.
VectorPinecone / pgvector: 768-dim embeddings of every document chunk. Sub-100ms ANN search.
Object StoreS3: the raw uploaded PDFs, images, audio.
Key-ValueRedis: session cache, conversation history during active chat.
SearchElasticsearch: hybrid retrieval — combine BM25 keyword match with vector similarity.
Memorize this card. If the interviewer pauses on the DB box, deliver the matching line in one breath. Each line is the seniority signal — workload + family + storage engine + key feature you'd actually use.
| Family | Engine | Reach for it when… | Killer feature to name-drop |
|---|---|---|---|
| RDBMS Postgres · MySQL | B-tree | ACID multi-table; money, identity, inventory | SERIALIZABLE isolation, SELECT FOR UPDATE, GIN/JSONB |
| Document MongoDB | B-tree on BSON | Fluid schema, nested hierarchy, when JSONB feels awkward | Aggregation pipeline, shard key choice, change streams |
| Wide-Column Cassandra · DynamoDB | LSM-tree | Write-heavy, partition-key access, time-ordered streams | Tunable R+W>N quorum, peer-to-peer, single-table design |
| Key-Value Redis · Memcached | RAM hash table | Sub-ms lookups, caching, leaderboards, rate limits | HASH / ZSET / Streams, LRU eviction, Redlock |
| Search Engine Elasticsearch | Inverted index (Lucene) | Full-text, ranking, facets, log analytics | BM25, tokenizer chain, index aliases, time-based indices |
| Time-Series Prometheus · Influx | Columnar + time-partitioned | Metrics, IoT, append-only timestamped data | PromQL, recording rules, downsampling, cardinality cap |
| Graph Neo4j · Neptune | Adjacency list | Variable-depth traversal, social/fraud/recommendation | Cypher, variable-length path, property graph |
| Vector Pinecone · pgvector | HNSW / IVF ANN | Semantic search, RAG, embedding similarity | HNSW M/ef tuning, hybrid search, re-ranker stage |
| Object Store S3 · GCS | Distributed FS + erasure coding | Anything >1MB; media, backups, data lake | Pre-signed URLs, lifecycle to Glacier, S3 events |
| NewSQL Spanner · CockroachDB | Sharded B-tree + Raft | ACID at planet scale, multi-region active-active | TrueTime / HLC, range-based sharding, regional rows |