Distributed SQL Deep Dive — CockroachDB, YugabyteDB, TiDB Under the Hood

Step 1

Why Classical RDBMS Hits a Wall

Picture Priya, the lead engineer at a five-year-old SaaS company. Her stack started simple: one PostgreSQL box, one app server, one Redis cache. For two years that box happily served 5,000 paying customers. Then the company landed an enterprise contract — fifty thousand seats, a global rollout, a multi-region SLA of 99.99%, and a compliance clause that says customer data in the EU must stay in the EU. Priya looks at her trusty Postgres primary in us-east-1 and realises three things at once: it's a single point of failure, it can't put data in two continents at the same time, and the box is already 80% CPU at peak. She has a problem that no amount of "buy a bigger machine" will fix.

This is the moment distributed SQL was invented for. The old answer was "shard the database manually" — pick a shard key, split customers across a fleet of Postgres boxes, write routing logic in the app, kiss cross-shard joins goodbye, and re-shard every two years when the split goes uneven. That answer works, but it costs an engineering team a year of feature work and a permanent on-call burden. The new answer — Spanner, CockroachDB, YugabyteDB, TiDB — is "give me back the single-database illusion, but spread it across fifty machines in three continents, and never let me notice."

The premise of this page: distributed SQL is not "Postgres but faster" — it's "Postgres-shaped SQL that horizontally scales, survives node loss without manual failover, and lets you pin data to a region — at the cost of higher per-transaction latency and a much less mature query planner." Knowing where that cost lives is the difference between picking the right tool and rewriting your stack twice.

The three forces that break a single-box RDBMS

Before we even open a vendor doc, let's name the three pressures that drive teams off vanilla Postgres or MySQL. Every feature we're going to discuss — Raft, range splits, MVCC across nodes, hybrid logical clocks — is a direct answer to one of these three. Hold these in your head as the "why" behind everything that follows.

① Vertical scaling runs out

A single Postgres box tops out at around 64 vCPUs and ~1 TB of hot working set before you start fighting WAL bottlenecks, vacuum storms, and connection-pool exhaustion. You can pay AWS for a 96-core r6i.24xlarge — but the next step (a 192-core box) doesn't exist, and the box you already have is a single point of failure.

What breaks: writes-per-second ceiling, single-node CPU, one disk's IOPS budget.

② Multi-region is fundamentally awkward

Postgres has streaming replication and logical replication, but every replica is asynchronous — you can't safely read from EU and write in US without "the order of operations is undefined" caveats. Regulators in the EU, India, and Australia now demand "data stays in the region it was generated", which a primary-in-one-region architecture literally cannot satisfy.

What breaks: tail latency for far-away users, compliance, RPO/RTO during region outage.

③ HA failover is operator-flavored

Even with Patroni, Stolon, or RDS Multi-AZ, failover is detect → promote → fence → re-point → wait. It takes 30–120 seconds, drops in-flight transactions, and depends on a control plane that itself can be partitioned. For an SLA that says "99.99%" — that's 52 minutes of downtime per year, total — every failover eats your budget.

What breaks: automation needs human-in-the-loop, MTTR, the SLA promised to the customer.

If your workload doesn't feel any of these three pressures — congratulations, you don't need this page. Stay on Postgres or MySQL, save yourself a year of operational work. The rest of this article is for teams that feel one or more of these pressures growing every quarter and have already noticed that "just shard it" is a euphemism for "rewrite your app."

So in short — single-box RDBMS hits a wall when the workload outgrows one machine, when users live in regions a long way from that machine, or when the SLA outgrows what a manual failover can deliver. Distributed SQL exists to make the database survive all three of those without surrendering the SQL interface, joins, transactions, or referential integrity you already wrote your app against.

Step 2

The Naive Scaling Paths (and Why They Hurt)

Before we look at any product, let's walk through what teams typically try before they end up on distributed SQL. Every distributed-SQL feature is a fix for one of these failure modes. Knowing the failure modes is how you justify the cost of moving — to yourself, your CTO, and the interviewer who will ask "why not just add read replicas?"

Path A — Add more read replicas

The first thing most teams try. You stand up two or three streaming-replication followers, point read-only queries at them, and breathe a sigh of relief as the primary's CPU drops. For a read-heavy workload this buys you 6–18 months — until two cracks show up. Replication lag: a user posts a comment, the read serving their next page-load lands on a 200 ms-stale replica, and the comment is missing — so engineers start "session pinning" the just-wrote user back to the primary, undoing the scale gain. Writes are still single-master: every INSERT goes to the same one box. The day your write rate doubles, replicas can't help you.

Path B — Application-level sharding

You bite the bullet and split customers across N Postgres clusters. Customer 1–1M on shard-0, 1M–2M on shard-1, and so on. A shard router in the app picks the right cluster per request. This does scale writes — linearly with shard count — but you pay a brutal tax. Cross-shard joins are dead: SELECT ... FROM orders JOIN users ON ... only works inside one shard, so half your queries become two round-trips and an app-side merge. Cross-shard transactions are dead: there is no BEGIN; UPDATE shard0; UPDATE shard1; COMMIT; with ACID guarantees, so you bolt on Sagas or two-phase-commit and pray. Re-sharding is a year-long project: when shard-0 fills up, splitting it requires an online data migration, dual-writes, cutover, and a war room.

What hurts Manual sharding — the app now owns four problems the database used to: routing, cross-shard transactions, re-sharding, and topology awareness. Every new engineer has to learn "which shard does this query hit?" before they can ship a feature. Schema changes need a fleet-wide rollout. Onboarding takes a week longer.

What distributed SQL fixes Auto-sharded, single-database illusion — the DB owns routing, range splits, cross-range transactions, and re-balancing. The app sees one connection string and writes plain SQL with joins and multi-row transactions, just like Postgres. Schema changes are one online DDL command.

Path C — Switch to NoSQL (and regret the joins)

The third popular detour. "Postgres can't scale? Let's use DynamoDB / Cassandra / MongoDB." It works — write throughput is fantastic — but six months in, the app is drowning in app-side joins, denormalised data is drifting out of sync, secondary indexes are eventually consistent in surprising ways, and the analytics team is begging for a relational warehouse because they can't write the report they need. The lesson many teams learn the hard way: you didn't dislike SQL, you disliked single-box SQL. NoSQL solved the wrong half of the problem.

So what: read replicas hit a write ceiling, manual sharding moves the database's job into your app, and NoSQL costs you joins and ACID. Distributed SQL exists to thread the needle — keep the SQL surface and the ACID guarantees, but distribute the storage and the transaction coordinator across as many machines as you need.

Step 3

What "Distributed SQL" Actually Means — The Mental Model

Now we have the right question: "how do we keep SQL semantics while spreading the data across many nodes?" Every distributed-SQL system — CockroachDB, YugabyteDB, TiDB, and their ancestor Google Spanner — answers it with the same three architectural moves. If you internalise these three moves, every vendor's docs read like dialects of the same language.

Move 1 — Separate the SQL layer from the storage layer

A single-box Postgres bundles two jobs into one process: parsing & planning the SQL, and storing the bytes. Distributed SQL splits them. The SQL layer is a stateless tier of nodes that any client can connect to — it parses SQL, builds a plan, decides which storage nodes hold which rows, and ships sub-plans down. The storage layer is a separate fleet of nodes that holds the actual data, replicated and sharded. Because the SQL layer is stateless, you can scale it independently of storage — more query CPU? Add SQL nodes. More disk or write throughput? Add storage nodes. This split is what makes TiDB look like "TiDB nodes talk to TiKV nodes", and YugabyteDB look like "YSQL talks to DocDB". CockroachDB is unusual in that it co-locates SQL and storage on the same node (every node does both jobs) — but the logical split is still there inside one process.

Move 2 — Cut the table into ranges, then replicate each range with Raft

Every distributed-SQL system divides a table into many ranges (CockroachDB calls them "ranges", YugabyteDB and TiDB call them "tablets"/"regions"). A range is a contiguous slice of the table's primary-key space — say "orders with id between 0 and 100,000". When a range gets too big (typically 64 MB or 96 MB), it auto-splits. When a range gets too hot (too many reads/writes hitting it), it can also split. Each range lives on three (or five) storage nodes, with one acting as the leader chosen by the Raft consensus protocol. Every write to a range goes to the leader, which replicates it to the followers and only acknowledges the client once a majority have written it durably. This is the core trick: each range is its own little replicated state machine, and the cluster is millions of them stitched together.

The aha moment: the cluster isn't one big "shared database" — it's tens of thousands of tiny Raft groups, one per range, distributed across the nodes. There is no "primary node" to fail over. If a node dies, the ranges where it was the leader simply elect new leaders in another node, and the cluster keeps serving traffic. That's how failover takes seconds instead of minutes, and requires zero human intervention.

Move 3 — Coordinate cross-range transactions with MVCC + a hybrid clock

The hard part is what happens when one transaction touches rows in two different ranges that live on two different nodes. You need ACID across that boundary. Distributed SQL borrows two ideas from Spanner: multi-version concurrency control (MVCC) — every write creates a new versioned value tagged with a timestamp, never overwriting the old one — and a hybrid logical clock (HLC) — a timestamp that combines wall-clock time with a logical counter, so two transactions on different nodes can be totally-ordered without needing perfectly synchronized clocks. A coordinator picks a transaction timestamp, writes provisional records ("intents") to every involved range, and on commit flips them to permanent. Conflict detection happens via the timestamps. If you want the canonical reference design, this is exactly what Spanner does with TrueTime — the three open-source players approximate it without atomic clocks.

① Stateless SQL tier

Parse, plan, route. Scale by adding nodes. Connection pooling, retries, query optimization live here. A failed SQL node just removes a connection target — no data lost.

② Sharded, replicated storage

The table is auto-split into ranges, each replicated 3× via Raft. A range is the unit of placement, replication, and failover. The number of ranges can grow into the millions in a large cluster.

③ Distributed MVCC + HLC

Cross-range transactions get a global timestamp from the hybrid clock, write provisional intents to every range they touch, then atomically commit. Readers see a consistent snapshot at any past timestamp.

So in short — distributed SQL = "stateless SQL frontends + auto-sharded Raft-replicated storage + MVCC over a hybrid clock". Every difference between CockroachDB, YugabyteDB, and TiDB is a different choice inside one of these three boxes. Hold this mental model and the vendor docs become much easier to skim.

Step 4

The Production Shape — One Architecture, Three Vendors

Here is the unified architecture every distributed-SQL system implements. The vendors differ in where they put the boundaries and what protocol the SQL layer speaks, but the boxes are the same. The numbered diagram below is the picture to keep in your head — every CockroachDB, YugabyteDB, or TiDB doc you read can be back-mapped to one of these ten boxes.

flowchart TB CLIENT([① Application
libpq · JDBC · MySQL driver]) CLIENT -->|SQL query| LB[② Load Balancer
HAProxy · cloud LB] LB --> SQL1[③ SQL Node A
Parse · Plan · Route] LB --> SQL2[③ SQL Node B] LB --> SQL3[③ SQL Node C] SQL1 --> PD[④ Cluster Metadata
PD · Master · Gossip] SQL2 --> PD SQL3 --> PD SQL1 -->|read/write KV ops| TXN[⑤ Transaction Coordinator
HLC timestamp · 2PC intents] SQL2 --> TXN SQL3 --> TXN TXN --> R1[⑥ Range / Tablet 1
orders id 0–100k] TXN --> R2[⑦ Range / Tablet 2
orders id 100k–200k] TXN --> R3[⑧ Range / Tablet N
users id 0–500k] R1 --> KV1[⑨ Storage Engine
RocksDB · Pebble · LSM] R2 --> KV1 R3 --> KV1 KV1 --> RAFT[⑩ Raft Replication
Leader + 2 Followers per range] style CLIENT fill:#171d27,stroke:#7b8599,color:#d4dae5 style LB fill:#4a90d9,stroke:#4a90d9,color:#fff style SQL1 fill:#e8743b,stroke:#e8743b,color:#fff style SQL2 fill:#e8743b,stroke:#e8743b,color:#fff style SQL3 fill:#e8743b,stroke:#e8743b,color:#fff style PD fill:#9b72cf,stroke:#9b72cf,color:#fff style TXN fill:#d4a838,stroke:#d4a838,color:#0b0f14 style R1 fill:#38b265,stroke:#38b265,color:#fff style R2 fill:#38b265,stroke:#38b265,color:#fff style R3 fill:#38b265,stroke:#38b265,color:#fff style KV1 fill:#3cbfbf,stroke:#3cbfbf,color:#fff style RAFT fill:#e05252,stroke:#e05252,color:#fff

Component-by-component — what each numbered box does and what would break without it

Use the numbers in the diagram above to find the matching card below. The colored badge on each card matches the box fill in the diagram, so you can cross-reference at a glance.

① Application client

Your service. It speaks whichever wire protocol the vendor chose to emulate — PostgreSQL wire for CockroachDB and YugabyteDB, MySQL wire for TiDB. Crucially, your existing ORM, migration tool, and Postgres-trained engineers all work day-one.

Solves: the "do I have to rewrite my app?" question. The answer is almost always "no, just point at a new connection string."

② Load balancer

Routes connections to any of the stateless SQL nodes. Since every SQL node can handle any query (it'll forward to the right storage range internally), the LB only needs to do round-robin or health-check based steering. Cloud LBs, HAProxy, or DNS round-robin all work.

Solves: client doesn't need to know cluster topology. Add or remove SQL nodes without telling the app.

③ SQL nodes (stateless tier)

Parse the SQL, build a query plan, figure out which ranges hold the rows the query needs, ship sub-plans to those ranges, and merge the results. In CockroachDB, every node is also a storage node. In TiDB and YugabyteDB, SQL is a separate fleet (TiDB-server / YB-TServer with the YSQL process).

Solves: SQL-layer scaling is independent of storage scaling. Lots of complex queries? Add SQL nodes. Lots of data? Add storage nodes.

④ Cluster metadata service

The "brain" that knows which range lives where, which node is alive, and where to put new ranges. TiDB calls it PD (Placement Driver), YugabyteDB calls it the YB-Master, CockroachDB has no separate service — it gossips this state across all nodes. Holds the routing table: range → leader node.

Solves: without this, SQL nodes wouldn't know where to send a write. It's the address book of the cluster.

⑤ Transaction coordinator

For any transaction that touches more than one range, this picks a global timestamp from the hybrid logical clock, writes "provisional intents" to every involved range, waits for majority acks via Raft, then atomically flips them to committed. If anything fails midway, it rolls back the intents.

Solves: ACID across nodes. Without this you'd have eventual consistency and the dreaded "money disappeared between accounts" bug.

⑥⑦⑧ Ranges / tablets

The unit of replication and movement. A contiguous slice of one table's primary-key space — for example, orders rows with id 0–100k. Auto-splits when it crosses a size threshold (default ~64–96 MB), auto-merges when it shrinks. Each range is replicated on 3 (or 5) nodes via Raft.

Solves: data is rebalanceable in small chunks instead of having to migrate whole shards. When a node fills up, the cluster moves a few hot ranges elsewhere — transparently to the app.

⑨ Local storage engine

Inside each storage node, the bytes are written to a single-node LSM-tree. CockroachDB uses Pebble (a Go fork of RocksDB), TiDB uses RocksDB directly, YugabyteDB uses a customized RocksDB inside DocDB. The KV layer is just "key → value", where the key encodes (table, primary_key, column, version_timestamp) and the value is the cell payload.

Solves: turns SQL rows into a key-value abstraction so the upper layers don't care about disk format. LSM-trees give the write throughput a distributed cluster needs.

⑩ Raft replication

The consensus protocol that keeps the 3 (or 5) copies of each range in agreement. A write isn't acknowledged until a majority of replicas have it durably on disk. If the leader dies, the followers elect a new leader in milliseconds. This is what gives distributed SQL its "no manual failover" property.

Solves: survives single-node and (with 5 replicas) single-AZ failure with no data loss and no human intervention. The entire HA story of these systems rests on Raft.

A concrete walkthrough — Priya's order-placement query

Let's trace a real transaction end-to-end. Priya's app issues INSERT INTO orders (...) VALUES (...) RETURNING id; UPDATE inventory SET stock = stock - 1 WHERE sku = 'A123'; — two writes to two different tables that need to be atomic. Here's what happens:

The app library opens a Postgres-wire connection to the load balancer ②, which forwards to SQL node B ③.
SQL node B parses the SQL, asks the cluster metadata service ④ "which ranges hold orders primary key X and inventory.sku='A123'?" — and gets back "range R7 and range R23, leaders on nodes 14 and 22".
SQL node B spins up a transaction coordinator ⑤. It picks an HLC timestamp T = 1748528193.000001 and sends provisional writes ("intents") to ranges R7 and R23.
Each range's leader replicates the intent via Raft ⑩ to its 2 followers. Once a majority (2 of 3) have it on disk, the leader ACKs the coordinator. This is the cost of distribution — instead of a local fsync, you pay a network round-trip.
Both intents acked? The coordinator writes a single "commit record" for transaction T, then asynchronously flips the intents from "provisional at T" to "committed at T" — the writes are now visible to any reader at timestamp ≥ T.
If either intent had failed (conflict with another transaction, node down, timeout), the coordinator would have rolled back by writing "abort" records, and the SQL node would surface a serialization error to the app — which retries via the driver, transparently.

So what: the path uses every numbered component. Steps 1–2 are the SQL plane (③ + ④), step 3 hands off to the transaction plane (⑤), steps 4–5 are the storage plane (⑥/⑦/⑧ via ⑨ + ⑩). Each component "earns its keep" — pull any one out and the system breaks: no metadata service means SQL nodes can't find the data, no transaction coordinator means cross-range writes can't be atomic, no Raft means a single node failure loses data.

Step 5

Under the Hood — Raft per Range

"Raft" is the single most important word in this article. If you only learn one thing about distributed SQL, learn how Raft works at the range level — every other property of these systems (HA, no manual failover, no split-brain, replication, locality) is a direct consequence. Let's build it up from scratch.

What problem Raft solves, in one sentence

Imagine three replicas of one range — leader on node A, followers on B and C. The app sends a write. How do all three end up agreeing on the order of writes, even if one of them is slow, crashes, or is on the other side of a network partition? Raft is the protocol that says: "the leader proposes an entry to the followers; once a majority (2 of 3) acknowledge they have it durably, the entry is committed; readers and the leader can act on it; if the leader dies, the followers vote to promote a new one — and the new one is guaranteed to have all the committed entries the old one had." That's it. Everything else is engineering details.

Why "per range" matters

A naive approach would run one Raft group for the entire cluster — every write goes through one leader. That's a bottleneck the size of one machine. Distributed SQL instead runs one Raft group per range. With ranges of ~64 MB and a 1 TB cluster, that's roughly 16,000 Raft groups, each independently led by a different node in the cluster. Writes for unrelated ranges proceed in parallel, with no coordination between them. This is how the cluster scales linearly in write throughput — each new node brings new leadership capacity, not just more disk.

What happens when a node dies

Suppose node 14 — the leader of range R7 — crashes. Within ~1 second, the followers on nodes 22 and 31 notice they haven't received a heartbeat. One of them, say 22, increments its Raft term and asks 31 to vote for it. 31 votes yes, 22 becomes the new leader, and the cluster keeps serving range R7. No human, no DNS change, no application reconfiguration, no data loss. Meanwhile, the metadata service (PD/master/gossip) notices node 14 is down, picks a healthy node, copies the lost replica of R7 onto it, and brings R7 back to 3-replica health — typically within minutes for a few-GB range. This is what people mean when they say "distributed SQL self-heals."

The before/after that sells distributed SQL to ops teams: with vanilla Postgres + Patroni, an AZ failure means 30–120 seconds of unavailability, an on-call page, and an audit-log entry. With Raft-replicated distributed SQL, an AZ failure means a brief latency spike on a few ranges while leaders re-elect, no pager, and a Grafana panel that goes yellow for 90 seconds. The MTTR difference is what justifies the cost of moving for HA-critical workloads.

5-replica mode for multi-region resilience

If you spread a cluster across three regions, 3-replica Raft has a subtle weakness — if one region goes down, you're left with one replica and can't form a 2-of-3 majority anymore for ranges led there. The fix is 5-replica mode: every range has 5 copies, requiring 3 of 5 to acknowledge. With 5 copies distributed across 3 regions (2 + 2 + 1, for instance), one region can vanish entirely and you still have 3 of 5 — quorum holds, the cluster keeps serving. The cost is higher write latency (you wait for more nodes) and more storage (5× instead of 3×). All three vendors support this; CockroachDB makes it a one-line config.

Step 6

Under the Hood — Distributed ACID (MVCC + Hybrid Clock)

Raft per range handles HA. But it does not handle multi-row, multi-range transactions on its own — for that you need a second layer. This is where distributed SQL borrows the deepest idea from Google Spanner: combine multi-version concurrency control with a hybrid logical clock to get serializable transactions across the whole cluster without atomic clocks.

MVCC, in plain words

Vanilla Postgres already does MVCC inside one box — every write creates a new tuple version, and readers see a snapshot consistent with their transaction's start time. Distributed SQL does the same thing, but stretches it across the cluster. Every key-value entry in the storage layer is actually (table, primary_key, column, timestamp) → value. A read at timestamp T scans the storage and picks the latest version whose timestamp is ≤ T. Writes never destroy old versions immediately; a background process ("garbage collection") cleans them up after the retention window expires. This is what gives you snapshot isolation (or stronger) across the entire cluster.

The hybrid logical clock — Spanner without atomic clocks

The hard question MVCC raises is: how do we pick a transaction timestamp that's globally consistent, across nodes whose wall-clocks drift by tens of milliseconds? Google Spanner solves this with TrueTime — GPS receivers and atomic clocks in every datacenter, giving each timestamp a known uncertainty interval. The three open-source players can't afford atomic clocks, so they use a Hybrid Logical Clock (HLC) — a 64-bit number that combines a wall-clock component (so timestamps roughly track real time) with a logical counter (so two events on different nodes can be totally ordered even if their clocks are slightly off). Every message between nodes carries an HLC; receivers bump their own HLC to max(local, received) + 1. This gives "close enough to real time" timestamps that are guaranteed to be monotonically increasing per-node and approximately ordered globally — close enough to get serializable transactions, as long as wall-clock skew is bounded (NTP keeps clocks within ~50ms; if a node's clock drifts further, it shuts itself down).

The commit protocol — 2PC over Raft

Putting Raft and MVCC together, here's what a cross-range transaction actually looks like inside one of these systems. It's essentially distributed two-phase commit, with each "phase" replicated via the underlying Raft groups so the protocol itself is fault-tolerant.

sequenceDiagram participant C as Client participant S as SQL node participant TXN as Txn Coordinator participant R1 as Range A leader participant R2 as Range B leader C->>S: BEGIN -- UPDATE acct1 -- UPDATE acct2 -- COMMIT S->>TXN: start txn, pick HLC timestamp T TXN->>R1: write intent for acct1 at T TXN->>R2: write intent for acct2 at T R1->>R1: Raft-replicate intent to followers R2->>R2: Raft-replicate intent to followers R1-->>TXN: intent durable, majority ack R2-->>TXN: intent durable, majority ack TXN->>TXN: write commit record at T TXN-->>S: success S-->>C: COMMIT OK Note over TXN,R2: Async step -- flip intents to committed

What "isolation level" actually means here

The three vendors differ slightly in their default isolation level. CockroachDB defaults to SERIALIZABLE — the strongest level, where transactions appear as if executed one at a time. YugabyteDB defaults to snapshot isolation (Postgres's default), with serializable available. TiDB defaults to snapshot isolation. In all cases, the level is enforced by the timestamp the transaction is assigned and by conflict detection at intent time — if two transactions try to write the same row, one gets a serialization error and retries. The application-level effect: your existing Postgres app's BEGIN; ... COMMIT; blocks just work, but with a higher chance of seeing serialization-error retries under contention than on single-box Postgres.

The hidden tax: a single-row INSERT on single-box Postgres is one fsync — call it 1 ms. The same INSERT on distributed SQL is a Raft round-trip across 3 nodes (~1 ms intra-AZ, ~5 ms cross-AZ, ~70 ms cross-region). For a cross-region 5-replica deployment, every write is ~tens of milliseconds. This is not a bug; it is physics. Distributed SQL trades latency for availability and scale. The cure is to pin data to the region it's used in — which is exactly what the next section's "follower reads", "leader preferences", and "geo-partitioning" features do.

Step 7

CockroachDB — Deep Dive

Cockroach Labs was founded in 2015 by ex-Googlers who had worked on F1/Spanner and wanted to bring Spanner-grade semantics to anyone without Google's atomic clocks. The product is written in Go, open-source (with an enterprise tier), and uses the PostgreSQL wire protocol. Its trademark is geographic locality features and the cleanest developer experience among the three.

Architecture at a glance

The big architectural choice CockroachDB makes is every node is symmetric. There is no separate SQL tier, no separate metadata service, no separate storage tier. Every CockroachDB node runs the SQL layer, the transaction coordinator, the KV storage layer, and participates in Raft groups. Cluster state is gossiped peer-to-peer instead of held in a dedicated service. This makes the cluster simple to operate — there's only one binary to deploy — at the cost of slightly less independent scaling.

flowchart TB CL([App via libpq]) CL --> N1 CL --> N2 CL --> N3 subgraph cluster["CockroachDB cluster (symmetric nodes)"] N1[Node 1
SQL · KV · Raft · Gossip] N2[Node 2
SQL · KV · Raft · Gossip] N3[Node 3
SQL · KV · Raft · Gossip] N1 <--> N2 N2 <--> N3 N1 <--> N3 end N1 --> PEB1[Pebble LSM] N2 --> PEB2[Pebble LSM] N3 --> PEB3[Pebble LSM] style N1 fill:#e8743b,stroke:#e8743b,color:#fff style N2 fill:#e8743b,stroke:#e8743b,color:#fff style N3 fill:#e8743b,stroke:#e8743b,color:#fff style PEB1 fill:#3cbfbf,stroke:#3cbfbf,color:#fff style PEB2 fill:#3cbfbf,stroke:#3cbfbf,color:#fff style PEB3 fill:#3cbfbf,stroke:#3cbfbf,color:#fff

What makes CockroachDB distinctive

Range-level geo-partitioning

You can tell the cluster, with one SQL statement: "rows where region='eu-west' must have all their replicas in EU datacenters." The cluster automatically moves matching ranges to EU nodes. This is the single best multi-region story of the three — no manual sharding, no app-side routing. ALTER TABLE users CONFIGURE ZONE USING constraints = '[+region=eu]' and you're compliant with GDPR data-residency.

Follower reads

For read-your-own-writes you must read from the Raft leader. For slightly-stale reads you can read from any follower with AS OF SYSTEM TIME '-5s'. This lets you serve "yesterday's reports" from any region with no cross-region round-trip — huge latency win for analytics dashboards.

Serializable by default

Stronger than Postgres's default. You'll see more retries under contention but you don't have to reason about anomalies. Cockroach's drivers know to retry serialization errors transparently with exponential backoff, so most apps don't notice.

Pebble storage engine

Written in Go, a fork of RocksDB built for tight integration with the rest of the codebase. Same LSM-tree internals — write-optimised, with periodic compaction.

Where CockroachDB hurts

PostgreSQL compatibility is partial. The wire protocol works, but stored procedures, some triggers, and some array/JSON operators behave differently. Tools that introspect pg_catalog deeply (some ORMs, some BI tools) sometimes need adapters.
Single-row INSERT is slower than vanilla Postgres. 1 ms becomes 3–5 ms intra-AZ. For a low-volume CRUD app this is irrelevant; for a tight loop doing 50k INSERTs/sec from one client, the round-trips show up.
Licensing tightened in 2024. Cockroach Labs moved core features behind a commercial license. Self-hosting at scale now realistically requires an enterprise contract — factor this into the build-vs-buy math.

Step 8

YugabyteDB — Deep Dive

YugabyteDB, started in 2016 by ex-Facebook engineers who built and ran Facebook's Cassandra fleet, takes a different stance: instead of inventing a new SQL layer, reuse PostgreSQL's actual source code on top of a distributed storage engine. The result is the most "Postgres-compatible" of the three — by a wide margin — because the SQL parser, planner, types, and stored procedures literally are Postgres's.

Architecture — YSQL on DocDB

YugabyteDB splits the architecture into two clear tiers. The SQL tier, called YSQL, runs an actual fork of PostgreSQL — same parser, planner, type system. The storage tier, called DocDB, is a Cassandra-style sharded, Raft-replicated, RocksDB-backed key-value store with document-aware encoding. The Postgres source is patched to translate table reads into DocDB key-range queries instead of local disk reads. Because they kept Postgres's SQL frontend intact, an enormous amount of compatibility (extensions, stored procedures, even some logical replication semantics) falls out for free.

flowchart TB CL([App via libpq]) CL --> LB[Smart driver/LB] LB --> TS1 LB --> TS2 subgraph yb["YugabyteDB cluster"] direction TB M[YB-Master
cluster metadata, placement] TS1[YB-TServer 1
YSQL Postgres process
+ DocDB tablet peer] TS2[YB-TServer 2
YSQL Postgres process
+ DocDB tablet peer] TS3[YB-TServer 3
YSQL Postgres process
+ DocDB tablet peer] M -.- TS1 M -.- TS2 M -.- TS3 TS1 <-- Raft per tablet --> TS2 TS2 <-- Raft per tablet --> TS3 TS1 <-- Raft per tablet --> TS3 end TS1 --> RDB1[RocksDB] TS2 --> RDB2[RocksDB] TS3 --> RDB3[RocksDB] style M fill:#9b72cf,stroke:#9b72cf,color:#fff style TS1 fill:#4a90d9,stroke:#4a90d9,color:#fff style TS2 fill:#4a90d9,stroke:#4a90d9,color:#fff style TS3 fill:#4a90d9,stroke:#4a90d9,color:#fff style RDB1 fill:#3cbfbf,stroke:#3cbfbf,color:#fff style RDB2 fill:#3cbfbf,stroke:#3cbfbf,color:#fff style RDB3 fill:#3cbfbf,stroke:#3cbfbf,color:#fff

Two APIs in one cluster

A quirk of YugabyteDB's design: because DocDB is a shared substrate, the same cluster exposes both a SQL API (YSQL, Postgres-compatible) and a Cassandra-compatible API (YCQL). Many teams use this transitionally — migrate a Cassandra workload into a YugabyteDB cluster, then write new features against the SQL side, with both reading from the same underlying storage.

What makes YugabyteDB distinctive

Highest Postgres compatibility

If your app uses pg_partman, postgis, complex stored procedures, or PL/pgSQL, YugabyteDB is the least-rewriting path. The SQL layer is a Postgres fork, so most extensions compile and run.

Tablet-level placement, like Cockroach's geo-partitioning

You can configure tablespaces that pin certain tablets to certain regions for data-residency. The API is more Postgres-flavored (tablespace + placement policy) than Cockroach's ZONE syntax.

Apache 2.0 license

The core is genuinely open-source, no enterprise gating on core features. For teams burnt by the SSPL/BSL drift in MongoDB and Cockroach, this is a major selling point.

Lag in query optimisation under contention

Because YSQL inherits Postgres's planner, it's mature for single-node thinking but can produce non-optimal distributed plans for big joins. Cost-based optimisations specific to distributed execution are being added but lag CockroachDB and TiDB.

When YugabyteDB is the right call

If you have an existing PostgreSQL application — especially one that leans on Postgres-only features — and you need horizontal scale + HA + multi-region without rewriting your ORM, your stored procedures, or your extensions, YugabyteDB is almost always the lowest-friction landing. Teams running it in production typically describe migration as "change connection string, run our migration suite, fix a small number of incompatibilities."

Step 9

TiDB — Deep Dive

TiDB, by PingCAP (founded 2015 in Beijing), is the MySQL-native sibling of the family. It speaks the MySQL wire protocol and was designed from day one as a separated compute + storage architecture — TiDB nodes for SQL, TiKV nodes for storage, TiFlash columns for analytics. It is the most mature of the three for HTAP (Hybrid Transactional / Analytical Processing) workloads.

Architecture — three independent fleets

TiDB makes the cleanest separation of the three. TiDB-server is the stateless SQL layer. TiKV is the distributed key-value storage with Raft per region (TiDB calls ranges "regions"). PD (Placement Driver) is the cluster brain — metadata, timestamp oracle, scheduling. And optionally, TiFlash nodes hold a columnar replica of the same data via Raft Learner, so analytical queries run on a column store without ETL.

flowchart TB CL([App via MySQL driver]) CL --> LB[HAProxy / Cloud LB] subgraph compute["Compute tier"] LB --> T1[TiDB-server 1] LB --> T2[TiDB-server 2] LB --> T3[TiDB-server 3] end subgraph control["Control plane"] PD1[PD 1] PD2[PD 2] PD3[PD 3] PD1 <--> PD2 PD2 <--> PD3 end T1 -.metadata.- PD1 T2 -.metadata.- PD1 subgraph storage["Storage tier - row store"] K1[TiKV 1] K2[TiKV 2] K3[TiKV 3] K1 <-- Raft per region --> K2 K2 <-- Raft per region --> K3 K1 <-- Raft per region --> K3 end subgraph analytics["Storage tier - column store"] F1[TiFlash 1] F2[TiFlash 2] end T1 --> K1 T2 --> K2 T3 --> F1 K1 -.Raft learner.- F1 K2 -.Raft learner.- F2 style T1 fill:#e8743b,stroke:#e8743b,color:#fff style T2 fill:#e8743b,stroke:#e8743b,color:#fff style T3 fill:#e8743b,stroke:#e8743b,color:#fff style PD1 fill:#9b72cf,stroke:#9b72cf,color:#fff style PD2 fill:#9b72cf,stroke:#9b72cf,color:#fff style PD3 fill:#9b72cf,stroke:#9b72cf,color:#fff style K1 fill:#38b265,stroke:#38b265,color:#fff style K2 fill:#38b265,stroke:#38b265,color:#fff style K3 fill:#38b265,stroke:#38b265,color:#fff style F1 fill:#3cbfbf,stroke:#3cbfbf,color:#fff style F2 fill:#3cbfbf,stroke:#3cbfbf,color:#fff

What makes TiDB distinctive

MySQL wire protocol

If your existing app is MySQL, TiDB is the most natural drop-in. JDBC drivers, ORMs (Hibernate, MyBatis), and BI tools all work. The two big MySQL-only databases in the family — TiDB and Vitess — split along "scale via TiDB's distributed SQL" vs "scale via Vitess's sharded MySQL" lines.

HTAP via TiFlash columnar replicas

You write to TiKV (rows). TiFlash maintains a column-store replica via Raft Learner. The same SQL can run on either side — TiDB's optimizer picks rows for point queries and columns for big aggregations. This collapses the typical "OLTP + Kafka + warehouse" pipeline into one cluster.

Percolator-inspired transactions

TiDB's transaction model is a derivative of Google's Percolator (the system Google used for indexing web pages). It uses a timestamp oracle in PD to issue strictly monotonic timestamps, simplifying the HLC trade-off — at the cost of having PD on the hot path of every transaction.

Strong production track record at huge scale

Used by ByteDance, Shopee, Pinterest, Square for clusters in the multi-hundred-TB to petabyte range. The "we scale linearly past 100 TB without re-sharding" story is well-documented.

Where TiDB hurts

Multi-region story is weaker than CockroachDB. The geo-aware placement features exist but feel newer.
PD is a single logical service — replicated for HA, but every transaction asks it for a timestamp, so its scaling matters. PingCAP has improved this dramatically, but in petabyte clusters PD tuning becomes a job.
MySQL-only. If you're a Postgres shop, TiDB is the wrong cousin — your stored procedures and types won't translate cleanly.

Step 10

Side-by-Side Comparison

Pulled into one table — the differences that matter when you're picking between the three. There is no universally "best" choice; the right pick depends on which SQL dialect your app already speaks, where your team's expertise sits, and whether multi-region or HTAP is the bigger driver.

Dimension	CockroachDB	YugabyteDB	TiDB
Wire protocol	PostgreSQL	PostgreSQL (YSQL) + Cassandra (YCQL)	MySQL
SQL compatibility	Postgres-like, reimplemented	Postgres fork — highest fidelity	MySQL-like, reimplemented
Architecture	Symmetric — every node does SQL + KV	YSQL + DocDB; YB-Master separate	TiDB-server + TiKV + PD (3 tiers)
Storage engine	Pebble (Go fork of RocksDB)	Custom RocksDB inside DocDB	RocksDB
Default isolation	SERIALIZABLE	Snapshot (SERIALIZABLE available)	Snapshot (SERIALIZABLE available)
Transaction model	Spanner-style HLC	HLC + per-tablet Raft	Percolator + PD timestamp oracle
HTAP	Limited — basic columnar support	Limited	First-class via TiFlash
Multi-region geo-partitioning	Best-in-class	Good (tablespace-based)	Adequate (placement rules)
License	BSL → enterprise for core features	Apache 2.0	Apache 2.0
Origin / backers	Cockroach Labs (NYC, ex-Google)	Yugabyte Inc. (CA, ex-Facebook)	PingCAP (Beijing)
Notable users	DoorDash, Bose, Cisco	Kroger, GM, Fidelity	ByteDance, Shopee, Pinterest, Square
Best fit for	Global SaaS, multi-region OLTP, data-residency	Postgres apps scaling out without rewrite	MySQL apps + HTAP, China/APAC scale

30-second decision summary: Postgres app and need multi-region first → CockroachDB. Postgres app and need maximum Postgres compatibility → YugabyteDB. MySQL app or need real-time analytics in the same DB → TiDB.

Step 11

Migrating from Postgres / MySQL — What Actually Changes

The marketing promise is "change the connection string and you're done." The reality is closer to "change the connection string, run your test suite, and prepare for 5–15% of your queries needing edits." Here's a realistic punch list teams hit when moving an established Postgres or MySQL app to one of these systems.

Things that just work

DDL — tables, indexes, constraints

Schemas migrate cleanly. CREATE TABLE, CREATE INDEX, foreign keys, check constraints, NOT NULL — all standard. Online schema changes are first-class (you can ADD COLUMN on a huge table without downtime).

Most SELECTs, INSERTs, UPDATEs

Joins, subqueries, CTEs, window functions, aggregates — all work. Postgres-side and MySQL-side dialect features are reproduced. ORM-generated SQL almost always runs unmodified.

Transactions

BEGIN; ... COMMIT; blocks behave the same. Isolation levels are configurable. The main behavioral change is: expect more serialization errors under contention — code your driver to retry.

Standard drivers

libpq, JDBC, ADO.NET, Go's pgx, Python's psycopg2 — all connect over the standard wire protocol. SSL, auth methods, connection pooling — same.

Things that bite

Sequences and auto-increment

A Postgres SERIAL column allocates IDs from a centralized counter. In a distributed cluster, a single counter is a bottleneck and a hot range. All three vendors recommend UUIDs or composite IDs with a hash prefix to avoid hot spots. Migrating means changing primary key strategies — non-trivial for legacy schemas.

Single-row hot keys

A pattern like "one row tracks the global counter for X" works fine on Postgres but melts in a distributed cluster — every write goes to the one leader holding that row. You either denormalize (counter shards) or rethink the design.

Stored procedures & triggers

YugabyteDB is best here (Postgres fork). CockroachDB has limited PL/pgSQL. TiDB doesn't support stored procedures at all. Heavy reliance on triggers is the #1 reason migrations stall.

Extensions

PostGIS, pg_partman, pg_trgm — YugabyteDB inherits many for free; Cockroach implements a subset; TiDB is MySQL-flavored, so this doesn't apply. Audit your extension usage early.

Large analytical queries

A 30-second analytical query on Postgres might take 90 seconds on distributed SQL because the planner ships data across the cluster. TiDB+TiFlash is the only product designed for this; for the others, plan to keep a separate warehouse for analytics.

Connection-per-request anti-patterns

Distributed SQL nodes are sensitive to connection storms. PgBouncer or a smart driver pool is mandatory at scale — more so than for vanilla Postgres.

A realistic migration path

Pilot a single new service on the new DB rather than lifting the monolith. New services don't carry legacy ID schemes or triggers — they're a clean test of the operational model.
Run a compatibility audit using the vendor's tooling. CockroachDB has a cockroach demo import tester; YugabyteDB has the Voyager migration toolkit; TiDB has DM (Data Migration) for MySQL.
Dual-write for a quarter. Write to both old and new DB, read from old. Compare row counts and aggregates daily. This is your safety net against "we found a query that returns different results" surprises.
Switch reads to the new DB on non-critical pages first (settings, dashboards), then on hot paths after a couple of weeks of stable behavior.
Cut writes over, leave the old DB read-only for a month as a panic-rollback option, then decommission.

Step 12

When To Pick Distributed SQL — and When To Stay on Postgres

The honest answer most vendors won't give you: most teams should not move to distributed SQL. A single Postgres box on AWS Aurora or GCP CloudSQL with read replicas serves 95% of internet workloads happily for years. Distributed SQL has real costs — operational complexity, higher per-transaction latency, smaller talent pool, partial SQL compatibility. Pick it only when one of these forces is genuinely material to your business.

Good fits — pick distributed SQL

Global multi-tenant SaaS with data-residency rules

EU customer data must stay in EU, India data in India, US in US — and each customer wants low local latency. CockroachDB or YugabyteDB's geo-partitioning is the single cleanest answer; sharding-per-region by hand is a permanent engineering tax.

Financial / inventory systems where ACID is non-negotiable

You need cross-row atomic transactions ("debit one account, credit another") and you need them to keep working even when a region goes down. NoSQL is out (no joins/ACID), and single-box RDBMS is out (HA is too soft). This is the original Spanner use case at Google — Ads spend can't be eventually consistent.

Workloads outgrowing a single Postgres box (>5 TB hot, >50k writes/sec)

You've maxed the largest practical RDS instance, you've added replicas, and writes are the bottleneck. Manual sharding would take a year. Distributed SQL is the "buy instead of build" answer.

HTAP — real-time analytics on transactional data

You want to run BI dashboards against fresh OLTP data without a Kafka + warehouse pipeline. TiDB + TiFlash is purpose-built for this; the alternative is operating a streaming ETL pipeline you didn't really want.

HA SLAs measured in five nines (99.999%)

Five nines is ~5 minutes/year of downtime. Patroni-style Postgres HA can't deliver this — every failover burns minutes. Raft-replicated distributed SQL routinely measures node-loss recovery in single-digit seconds.

Existing Cassandra/Mongo teams who keep wanting joins

Teams that picked NoSQL years ago, now drowning in app-side joins and eventually-consistent surprises. Distributed SQL gives them back relational semantics without surrendering horizontal scale.

Bad fits — stay on Postgres / MySQL

Single-region apps under 1 TB

A vanilla Postgres on RDS will serve millions of users from one region happily. You'll pay the latency tax and operational complexity of distributed SQL for benefits you don't need.

Latency-critical low-volume OLTP

If your SLA is "p99 commit under 2 ms" on a low-write-volume app, distributed SQL's 3–5 ms minimum is a non-starter. Vanilla Postgres on local NVMe is faster.

Heavy stored-procedure / trigger logic

Unless you're going to YugabyteDB and your procedures are PL/pgSQL flavor, expect to rewrite a lot of logic. The migration cost may exceed the benefit.

Analytical / OLAP workloads

If your primary read pattern is full table scans for aggregates, you want a columnar warehouse (Snowflake, BigQuery, Redshift, ClickHouse) — not a row-oriented distributed OLTP store. TiFlash narrows this gap but doesn't close it for petabyte analytics.

The honest red flag: if your justification for moving is "Postgres can't scale" but you haven't actually maxed out a 96-core RDS box, you're solving a future problem with present complexity. Buy more vertical first. Move horizontally only when vertical is genuinely exhausted.

Step 13

Trade-offs & Gotchas Everyone Hits

Every team that runs distributed SQL in production runs into the same handful of surprises. Knowing them up-front saves a quarter of debugging. Here is the punch list.

Latency floor — physics, not bugs

A write committing on distributed SQL must reach a majority of replicas durably before the client gets OK. In the same AZ that's ~1–3 ms; cross-AZ within one region ~5 ms; cross-region 5-replica deployments push individual writes into the 50–100 ms range. This is the cost of consistency — you cannot have CP plus zero-latency writes across continents; the speed of light is the limit. Mitigation: pin leaders to the region where writes are issued (CockroachDB's "leader preferences"), or pin data to that region entirely (geo-partitioning).

Hot spots

The most common production outage on distributed SQL is a single hot range. A monotonically increasing primary key (timestamp, autoincrement) puts every new insert on the same range's leader — that one node bottlenecks the whole cluster. Mitigation: hash-prefix primary keys, use UUIDs, or use the vendor's "hash-sharded index" feature. Audit your schema for any "global counter" pattern before launch.

Connection pool sizing

Distributed SQL nodes maintain expensive per-connection state. Opening 10k connections from a Lambda fleet will melt them. Always front the cluster with PgBouncer (or similar) in transaction-pooling mode — and size the pool well below the cluster's per-node limit.

Schema changes during peak

Yes, online DDL works. But a CREATE INDEX on a 5 TB table still consumes CPU and IO on every range that holds matching rows. Run heavy schema changes off-peak; CockroachDB and TiDB support throttling them.

Backup & restore — not the same shape as Postgres

You don't pg_dump a distributed SQL cluster the same way. Vendors provide their own backup tooling (CockroachDB's BACKUP/RESTORE to S3, TiDB's BR tool, YugabyteDB's distributed backup). Set up scheduled backups and test restores from day one — many teams find out their backups don't actually round-trip on the day they need them.

Monitoring is different

You're now monitoring 16,000 Raft groups, not one Postgres process. The metrics that matter — range count per node, leadership distribution, raft-log latency, slow ranges, hot ranges — are different from pg_stat_*. Use the vendor's built-in dashboards; don't try to re-derive everything from Prometheus exporters alone.

Talent pool is smaller

Every engineer has used Postgres. Far fewer have run a 50-node CockroachDB or TiDB cluster. Budget for vendor support contracts or hire someone who has done this before — the operational knowledge curve is real, especially during the first incident.

Step 14

Decision Tree — Which One (If Any)?

Here is the picture I'd put on a whiteboard if a team asked me "should we move to distributed SQL, and which one?" Walk it top-down. The honest answer for most teams is "stay where you are".

flowchart TD START([Need to scale
your RDBMS?]) START --> Q1{Workload outgrowing
single Postgres/MySQL
OR multi-region required
OR 99.99%+ SLA?} Q1 -->|No| STAY[Stay on Postgres/MySQL
+ read replicas + Aurora
Cheapest, simplest, fast enough] Q1 -->|Yes| Q2{What's the primary driver?} Q2 -->|Multi-region data residency| Q3{What's the SQL dialect?} Q2 -->|Petabyte-scale OLTP| Q4{Postgres or MySQL?} Q2 -->|HTAP — analytics on live data| TIDB1[TiDB + TiFlash] Q2 -->|Just HA & horizontal scale| Q5{Heavy stored procs
or PG extensions?} Q3 -->|Postgres| COCKROACH1[CockroachDB
Best geo-partitioning] Q3 -->|MySQL| TIDB2[TiDB
geo features adequate] Q4 -->|Postgres| YB1[YugabyteDB
highest PG fidelity] Q4 -->|MySQL| TIDB3[TiDB] Q5 -->|Yes| YB2[YugabyteDB
PG fork has them] Q5 -->|No, Postgres-flavored| COCKROACH2[CockroachDB] Q5 -->|No, MySQL-flavored| TIDB4[TiDB] style STAY fill:#38b265,stroke:#38b265,color:#fff style COCKROACH1 fill:#e8743b,stroke:#e8743b,color:#fff style COCKROACH2 fill:#e8743b,stroke:#e8743b,color:#fff style YB1 fill:#4a90d9,stroke:#4a90d9,color:#fff style YB2 fill:#4a90d9,stroke:#4a90d9,color:#fff style TIDB1 fill:#3cbfbf,stroke:#3cbfbf,color:#fff style TIDB2 fill:#3cbfbf,stroke:#3cbfbf,color:#fff style TIDB3 fill:#3cbfbf,stroke:#3cbfbf,color:#fff style TIDB4 fill:#3cbfbf,stroke:#3cbfbf,color:#fff

So in short — distributed SQL is one of the highest-impact additions to the modern infrastructure toolbox, but it solves a very specific problem: "my workload has outgrown a single RDBMS box, but I still want SQL, joins, and ACID." If that sentence describes you, the three products in this article are mature, well-supported, and used by serious companies. If it doesn't, the most valuable thing this page can give you is permission to stay on Postgres a few years longer.

One closing thought — the "buy" axis you can't ignore

All three vendors offer fully-managed cloud versions: CockroachDB Cloud, YugabyteDB Managed, and TiDB Cloud. The operational lift of running a 50-node distributed SQL cluster yourself is real and ongoing — backups, version upgrades, capacity planning, node replacement. For most teams the right move is to use the managed offering unless you have a strong "no managed services" stance or a regulatory boundary that forbids it. The vendor-priced premium is almost always less than the cost of one experienced DBRE.

If you got this far: the next things to read are the Database Families Compared page (where distributed SQL sits in the broader landscape), and the Key Technologies Cheat Sheet for how all of this fits next to Redis, Kafka, and the rest of the modern stack.

Distributed SQL Under the Hood