The Family Tree — Why "All SQL Databases Are the Same" Is a Lie
Every relational database speaks SQL, supports ACID, and stores tables of rows and columns. From the outside they look interchangeable. Then you actually use two of them, hit your first UPSERT, your first JSON column, your first replication setup — and discover they disagree about almost everything underneath.
SELECT ... FOR UPDATE SKIP LOCKED, the way she always did. It crashes on MySQL 5.7. She checks the version: SKIP LOCKED arrived in 8.0. She upgrades, retries, and now the query runs — but inserts use AUTO_INCREMENT instead of SERIAL, JSON queries need a different operator, and her favorite RETURNING clause doesn't exist. Same SQL standard, two different worlds.They all started from the same relational model (Codd, 1970s) but evolved under wildly different incentives:
- Oracle (1979) — built for enterprises that pay six-figure license fees. Optimized for the hardest workloads (mixed OLTP/OLAP), every feature in the book, infamous price tag.
- MySQL (1995) — built to be the database of the LAMP web. Fast for simple reads, pluggable storage engines, "good enough" semantics. Now owned by Oracle.
- PostgreSQL (1996) — built by academics who took correctness seriously. Extensible types, MVCC done right, modern features arrive here first. The "Linux of databases".
- SQL Server (1989) — Microsoft's enterprise database, tightly integrated with .NET, Active Directory, and the Windows ecosystem. Now runs on Linux too.
- SQLite (2000) — not a server. A C library that puts the whole database in one file. Ships in every phone, browser, and IoT device on Earth.
- MariaDB (2009) — a fork of MySQL by the original author after Oracle bought MySQL. Mostly compatible, faster moving on features.
| DB | First release | Maintainer | License | Default port |
|---|---|---|---|---|
| Oracle | 1979 | Oracle Corp. | Proprietary | 1521 |
| SQL Server | 1989 | Microsoft | Proprietary (free Express) | 1433 |
| MySQL | 1995 | Oracle Corp. | GPL / Commercial dual | 3306 |
| PostgreSQL | 1996 | PG Global Dev Group | PostgreSQL License (BSD-like) | 5432 |
| SQLite | 2000 | D. Richard Hipp | Public domain | N/A (file) |
| MariaDB | 2009 | MariaDB Foundation | GPL | 3306 |
Storage Engines — How Each One Actually Writes to Disk
The biggest mental model gap between MySQL and Postgres is at the storage layer. Postgres has one storage model. MySQL has a marketplace of them. Oracle has its own block format that's been refined for 40+ years. Once you understand this, half the "why does X behave differently?" questions answer themselves.
A storage engine is the part of the database that decides how rows actually live on disk: the file format, the index layout, how writes are made durable, how concurrent readers see consistent data. The SQL parser and query planner sit above it. The engine is what gives you (or denies you) crash safety, MVCC, foreign keys, and transactions.
MySQL — Pluggable engines
MySQL uniquely lets you pick an engine per table. This was revolutionary in 2003 and is the source of a lot of MySQL's quirks.
- InnoDB — the default since MySQL 5.5. Row-level locking, MVCC, foreign keys, crash recovery via redo log. Use this. Always.
- MyISAM — older, table-level locks, no transactions, no FKs, but very fast for read-only workloads. Used in legacy MediaWiki/WordPress setups.
- MEMORY — RAM-only, lost on restart. Useful for temp tables.
- Archive — write-only compressed log table.
- NDB — MySQL Cluster's distributed engine. Different beast entirely.
- MyRocks — RocksDB (LSM-tree) packaged as a MySQL engine, originally by Facebook. Better for write-heavy SSD workloads.
PostgreSQL — One engine, no choice
Postgres has exactly one storage format: heap tables with MVCC tuples. Every row gets xmin (transaction that inserted it) and xmax (transaction that deleted/updated it) hidden columns. Updates are never in-place — they write a new tuple and mark the old one dead. Background VACUUM reclaims the dead ones.
Oracle — Blocks, segments, tablespaces
Oracle organizes data into blocks (typically 8KB) grouped into extents, which form segments stored in datafiles belonging to tablespaces. Heap-organized tables are default; Index-Organized Tables (IOT) cluster data inside the primary key index. Undo segments power read-consistency (Oracle's MVCC).
SQL Server — Pages and extents too
SQL Server stores data in 8KB pages, grouped into 64KB extents. Clustered indexes are the table — like InnoDB. Heap tables exist but are uncommon; almost everyone has a clustered index, usually on the primary key.
| Engine | On-disk shape | Updates | Reclamation |
|---|---|---|---|
| InnoDB (MySQL) | Clustered B-tree on PK; rows live in the leaf | In-place when possible; otherwise pointer to new location | Purge thread cleans old MVCC versions from rollback segments |
| Postgres heap | Unordered heap + separate index pages | Always insert new tuple, mark old dead | VACUUM (manual or auto) reclaims dead tuples |
| Oracle | Blocks within tablespaces; heap or IOT | In-place when row fits; chains/migrates otherwise | Undo segments age out; no global vacuum |
| SQL Server | 8KB pages, clustered B-tree on PK by default | In-place; row-overflow pages if needed | Tempdb + ghost cleanup task |
UPDATE users SET last_seen = now() for every user, every minute, creates millions of dead tuples. Autovacuum eventually catches up — until it doesn't, and you wake up to a 200GB table that should be 20GB.MVCC — Three Different Implementations of "Readers Don't Block Writers"
Every modern relational database promises Multi-Version Concurrency Control: readers see a consistent snapshot, writers don't have to wait for them, writers don't have to lock the world. How each implements it is wildly different — and that difference shows up in your monitoring.
Imagine you're reading page 50 of a book. Someone else wants to rewrite page 50. Without MVCC, you'd either block them ("don't touch the book while I'm reading") or get a mangled page. With MVCC, the writer creates a fresh copy of page 50, and you keep reading the original until you're done. Old copies are cleaned up later, once no one needs them.
That metaphor maps almost perfectly to the database: every row exists as multiple versions, each tagged with the transaction ID that created it. Your SELECT sees only versions visible to your transaction's snapshot. Writers create new versions. Old versions are reclaimed when no live transaction needs them.
Postgres — Version chains inside the heap
Every tuple carries (xmin, xmax). A new UPDATE writes a fresh tuple with xmin = current_txn and updates the old one's xmax. The old tuple lives on until VACUUM. Visibility: "I can see this row if my snapshot includes xmin and excludes xmax."
-- Initial insert (txn 100): ROW: { id=1, name='Alice', xmin=100, xmax=0 } -- UPDATE users SET name='Alicia' WHERE id=1; (txn 105) -- Now the heap contains TWO physical rows: ROW1: { id=1, name='Alice', xmin=100, xmax=105 } -- dead to txns > 105 ROW2: { id=1, name='Alicia', xmin=105, xmax=0 } -- the live version -- VACUUM eventually deletes ROW1 once no snapshot needs it.
InnoDB (MySQL) — Versions live in the undo log
InnoDB stores the current row in-place, and pushes the old version into the undo log (also used for rollback). Each row carries a 6-byte transaction ID and a 7-byte rollback pointer. When you SELECT under REPEATABLE READ, InnoDB walks the undo chain backward until it finds a version your snapshot can see.
Oracle — Undo segments, same idea
Oracle invented this approach in the 1980s. Old row versions live in undo segments. Long-running reports get the famous ORA-01555: snapshot too old error when the undo they need has been overwritten.
SQL Server — Optional MVCC
By default, SQL Server uses lock-based concurrency — readers block writers and vice versa. You opt into MVCC by enabling READ_COMMITTED_SNAPSHOT or ALLOW_SNAPSHOT_ISOLATION, after which old versions go to tempdb. Tempdb sizing is a whole career.
| DB | Where old versions live | How they're reclaimed | Failure mode |
|---|---|---|---|
| Postgres | Inline in the heap as dead tuples | VACUUM (autovacuum process) | Table bloat, transaction-ID wraparound |
| InnoDB | Undo log (system tablespace or per-table) | Purge thread once trx no longer needed | Undo grows unboundedly under long transactions |
| Oracle | UNDO tablespace | Aged out by undo retention policy | ORA-01555 snapshot too old |
| SQL Server | tempdb (when snapshot isolation enabled) | Version-store cleanup task | tempdb fills, queries spill, server stalls |
SELECT n_dead_tup FROM pg_stat_user_tables WHERE relname = 'users' tells you. VACUUM FULL rewrites it, but locks the table; pg_repack does it online.Indexes — B-Tree Everywhere, But the Side Dishes Differ
Every relational DB uses a B-Tree (or B+Tree) as the workhorse index. What separates them is the specialized index types — and on Postgres, the variety is staggering. If you've only ever used B-Trees, you've been leaving free performance on the table.
What everyone has
- B-Tree — ordered, supports equality + range + ORDER BY. The default.
- Unique — B-Tree with a uniqueness constraint.
- Composite — multi-column B-Tree. Order of columns matters (the "leftmost prefix" rule).
- Covering / Included columns — extra columns stored in the index leaf so the table is never read.
What Postgres uniquely has
| Type | What it's for | Example |
|---|---|---|
| Hash | Equality-only lookups (rarely used — B-Tree usually wins) | CREATE INDEX ... USING HASH (col) |
| GiST | Generalized search tree — geospatial, ranges, full-text, kNN | PostGIS uses it; range types use it |
| GIN | Inverted index — JSONB, arrays, full-text, trigrams | CREATE INDEX ... USING GIN (jsonb_col) |
| BRIN | Block-range index — tiny, perfect for huge sorted tables (logs, time-series) | CREATE INDEX ... USING BRIN (created_at) |
| SP-GiST | Space-partitioned GiST — quad-trees, kd-trees, IP ranges | Rarely used directly |
| Bloom | Multi-column equality with low false-positive rate (extension) | CREATE EXTENSION bloom |
| pgvector / HNSW | Vector similarity for AI/embeddings | CREATE INDEX ... USING hnsw (embedding vector_cosine_ops) |
-- BRIN over the timestamp column: ~5 MB instead of ~15 GB B-tree CREATE INDEX idx_logs_ts ON logs USING BRIN (created_at); -- GIN over JSONB metadata for "find rows where meta contains tag X" CREATE INDEX idx_logs_meta ON logs USING GIN (meta jsonb_path_ops); -- Partial index: only "error" rows, which are 0.5% of the table CREATE INDEX idx_logs_errors ON logs (created_at) WHERE level = 'ERROR';
MySQL InnoDB — Clustered index = the table
The PRIMARY KEY index is the table. Rows live in the leaf nodes of the PK B-tree. Secondary indexes store the PK value (not a row pointer), so every secondary lookup is two B-tree traversals. This is why a sequential BIGINT PK ("autoincrement") is so important in MySQL — it keeps inserts at the right edge of the tree.
Oracle — Index-Organized Tables (IOT)
Standard tables are heap-organized (rows scattered, row IDs reference physical location). IOTs cluster the table inside the PK index (like InnoDB). Oracle also has bitmap indexes — brilliant for low-cardinality columns in warehouses, terrible under concurrent writes.
SQL Server — Columnstore for OLAP
For analytical workloads, SQL Server's columnstore indexes store data column-by-column with compression, giving 10× compression and 100× query speed for star-schema queries. Postgres has cstore_fdw and the citus extension for similar; MySQL has no native columnstore.
Isolation Levels — Same Words, Different Behavior
The SQL standard defines four isolation levels — READ UNCOMMITTED, READ COMMITTED, REPEATABLE READ, SERIALIZABLE. Every relational database claims to support them. None of them implement them the same way. Knowing the gaps is the difference between code that works in dev and code that mysteriously double-charges your customers in production.
The standard's four levels and what they forbid
| Level | Dirty read | Non-repeatable read | Phantom read | Lost update |
|---|---|---|---|---|
| READ UNCOMMITTED | Allowed | Allowed | Allowed | Allowed |
| READ COMMITTED | Prevented | Allowed | Allowed | Allowed |
| REPEATABLE READ | Prevented | Prevented | Allowed | Prevented* |
| SERIALIZABLE | Prevented | Prevented | Prevented | Prevented |
What actually happens, by DB
| DB | Default | READ COMMITTED | REPEATABLE READ | SERIALIZABLE |
|---|---|---|---|---|
| Postgres | READ COMMITTED | Statement-level snapshot | Transaction-level snapshot; phantoms blocked by side effect | True serializable via Serializable Snapshot Isolation (SSI) |
| MySQL InnoDB | REPEATABLE READ | Statement snapshot | Transaction snapshot + next-key locks block phantoms | Full table-level S-locks; rarely used |
| Oracle | READ COMMITTED | Statement snapshot | Not supported (skipped to Serializable) | True snapshot isolation |
| SQL Server | READ COMMITTED (lock-based) | Locks unless snapshot enabled | Range locks held to commit | Strict 2PL with key-range locks |
-- Session A: BEGIN; SELECT balance FROM accounts WHERE id = 1; -- reads 100 -- Session B (concurrently): BEGIN; UPDATE accounts SET balance = balance - 30 WHERE id = 1; COMMIT; -- balance now 70 -- Session A (still in transaction, naive update): UPDATE accounts SET balance = 100 - 50 WHERE id = 1; -- writes 50 COMMIT; -- Final balance: 50 — but should be 20! Session B's deduction is LOST.
The fix in every DB is the same: use SELECT ... FOR UPDATE, or use a write that doesn't depend on the read (UPDATE ... SET balance = balance - 50), or use SERIALIZABLE. Knowing which one your isolation level forces is the trick.
Replication — Statement, Row, WAL, Redo — Pick Your Poison
Every production database has a replica. How that replica is kept in sync defines what you can do (read scaling, failover, point-in-time recovery, logical CDC) and what can break (replication lag, conflicts, data loss).
The two families
- Physical replication — ship the literal bytes of the change log. Replicas are byte-for-byte identical. Fast, simple, but no cross-version or cross-platform replication and you can only replicate the entire cluster.
- Logical replication — ship a logical description of the change ("row X changed column Y from A to B"). Replicas can be different versions, can subscribe to a subset of tables, can transform data. More flexible, more failure modes.
By database
| DB | Native physical | Native logical | Common tooling |
|---|---|---|---|
| Postgres | Streaming replication (WAL shipping) | Logical replication (since 10), pglogical | Patroni, Stolon, repmgr, pg_basebackup |
| MySQL | Binary log replication (statement, row, or mixed format) | Same binlog in ROW format is effectively logical | MHA, Orchestrator, Group Replication, ProxySQL |
| Oracle | Data Guard (redo apply) | GoldenGate | RAC + Data Guard |
| SQL Server | Always On Availability Groups | Transactional / merge replication, CDC | Always On listener, log shipping |
MySQL binlog formats — the source of subtle bugs
| Format | What's shipped | Pros | Cons |
|---|---|---|---|
| STATEMENT | The SQL text | Compact, readable | Non-deterministic functions (NOW(), UUID()) drift between primary and replica |
| ROW | Before/after row images | Always correct, replication-safe | Bigger logs; harder to debug |
| MIXED | Statement, fall back to row when unsafe | Best of both | Subtle when you check what was logged |
# On the primary's postgresql.conf: wal_level = replica # or 'logical' for logical replication max_wal_senders = 10 hot_standby = on # replicas accept read queries # On the replica, after pg_basebackup: primary_conninfo = 'host=primary.db port=5432 user=replicator' # Streaming replication is now live. Replica catches up by reading WAL.
pg_current_wal_lsn() handoff).Data Types — Beyond INT and VARCHAR
Every database has the same basic types. The interesting differences live in the extras — and Postgres is the clear winner here, with built-in support for things that need third-party libraries everywhere else.
| Type | Postgres | MySQL | Oracle | SQL Server |
|---|---|---|---|---|
| Auto-increment PK | BIGSERIAL, GENERATED AS IDENTITY | AUTO_INCREMENT | SEQUENCE + trigger or IDENTITY (12c+) | IDENTITY(1,1) |
| Boolean | Native BOOLEAN | TINYINT(1) (no real bool) | No native bool; NUMBER(1) or CHAR(1) | BIT |
| UUID | Native UUID type | BINARY(16) or CHAR(36) | RAW(16) or VARCHAR2(36) | Native UNIQUEIDENTIFIER |
| JSON | JSONB (binary, indexed) | JSON (binary) | JSON (21c) or CLOB | NVARCHAR(MAX) with JSON functions |
| Array | Native (any type) | JSON-encoded only | VARRAY / nested tables | Table types, JSON, XML |
| Range types | Native (int4range, tstzrange, ...) | — | — | — |
| Network address | Native INET, CIDR | — | — | — |
| Geospatial | PostGIS (de facto standard) | Spatial types (basic) | Oracle Spatial | geometry / geography |
| Money | NUMERIC (recommended) / MONEY | DECIMAL | NUMBER | MONEY / DECIMAL |
| Enum | Native CREATE TYPE ... AS ENUM | ENUM('a','b') column | — | — |
| Vector (embeddings) | pgvector extension | — | 23ai native | — |
-- Range type: a booking that occupies a time interval CREATE TABLE bookings ( id BIGSERIAL PRIMARY KEY, room INT, span TSTZRANGE, -- exclusion constraint: no two bookings for the same room can overlap EXCLUDE USING GIST (room WITH =, span WITH &&) ); -- Array column: store tags inline without a join table CREATE TABLE posts ( id BIGSERIAL PRIMARY KEY, tags TEXT[] ); SELECT * FROM posts WHERE tags @> ARRAY['java', 'kafka']; -- INET: store IPs natively, query by subnet SELECT * FROM requests WHERE client_ip <<= '10.0.0.0/8'::INET;
TINYINT(1) is just an integer that conventionally stores 0/1. WHERE active = TRUE compares against the integer 1; WHERE active = 'TRUE' coerces 'TRUE' to 0 and silently returns rows where active = 0. Use explicit integers and you'll thank yourself.JSON Support — Same Word, Very Different Capabilities
Every modern relational DB stores JSON. None of them are equivalent. Knowing the differences before you bet a project on "we'll just use JSON columns" is worth its weight in incident postmortems.
Postgres — JSONB is the gold standard
Two types: JSON (stored as text, preserves whitespace and key order) and JSONB (binary, parsed once, indexed, fast). Almost everyone uses JSONB. You get path queries, containment operators, indexing via GIN, and full JSONPath since version 12.
CREATE TABLE events (id BIGSERIAL, payload JSONB); CREATE INDEX idx_payload ON events USING GIN (payload jsonb_path_ops); -- Query: events where payload.type = 'login' AND payload.region = 'IN' SELECT * FROM events WHERE payload @> '{"type":"login","region":"IN"}'; -- Extract a field SELECT payload->>'user_id' AS user_id FROM events; -- Update a nested value (returns a new JSONB; you write it back) UPDATE events SET payload = jsonb_set(payload, '{status}', '"ok"') WHERE id = 42;
MySQL — Decent JSON, fewer index tricks
MySQL 5.7+ has a native JSON type stored as binary. You query with JSON_EXTRACT() or the ->/->> operators. Indexing requires generated columns — you create a virtual column from a JSON expression and index that.
CREATE TABLE events ( id BIGINT AUTO_INCREMENT PRIMARY KEY, payload JSON, user_id BIGINT GENERATED ALWAYS AS (payload->>'$.user_id') STORED, INDEX idx_user (user_id) ); SELECT * FROM events WHERE JSON_EXTRACT(payload, '$.type') = 'login';
Oracle — JSON inside CLOB / BLOB, or native in 21c+
Pre-21c, JSON was a constraint on a CLOB column. From 21c, JSON is a first-class native type with binary storage and indexing. Query syntax uses JSON path ($.field) similar to MySQL.
SQL Server — JSON as text
SQL Server has JSON functions (JSON_VALUE, OPENJSON, FOR JSON PATH) but no native JSON type. You store it in NVARCHAR(MAX) and index via computed columns. Functional, not luxurious.
| Capability | Postgres JSONB | MySQL JSON | Oracle 21c JSON | SQL Server |
|---|---|---|---|---|
| Native binary | Yes | Yes | Yes (21c+) | No (text) |
| Generic index | GIN | — | JSON search index | — |
| Indexable via computed col | Yes | Yes | Yes | Yes |
Containment @> | Yes | JSON_CONTAINS() | JSON_EXISTS | OPENJSON |
| Partial in-place update | Hot updates via JSONB ops | Yes (5.7+) | Yes | No |
Stored Procedures & SQL Dialects — Where Vendor Lock-in Lives
SQL itself is mostly portable. The procedural language built around it isn't. PL/SQL (Oracle), T-SQL (SQL Server), PL/pgSQL (Postgres), and MySQL's stored procedure dialect are all different. Code that lives in the database is the hardest code to migrate.
| DB | Procedural language | External languages allowed |
|---|---|---|
| Oracle | PL/SQL — the most mature, can do almost anything | Java (stored Java procedures) |
| SQL Server | T-SQL — close cousin to PL/SQL | CLR (C#, VB) procedures |
| Postgres | PL/pgSQL (default) + PL/Python, PL/Perl, PL/V8 (JS), PL/Java, PL/R | Anything — extensible |
| MySQL | SQL stored procedures (limited, deprecated by Oracle's stewardship) | No external langs |
Same logic, four dialects — auto-incremented sequence
CREATE SEQUENCE order_seq START WITH 1; CREATE OR REPLACE TRIGGER orders_bi BEFORE INSERT ON orders FOR EACH ROW BEGIN :NEW.id := order_seq.NEXTVAL; END;
CREATE TABLE orders ( id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY, ... ); -- Or simpler: BIGSERIAL is shorthand for "INT + sequence + default"
CREATE TABLE orders ( id BIGINT AUTO_INCREMENT PRIMARY KEY, ... );
Returning the inserted row
This is a perfect example of "looks portable, isn't":
INSERT INTO users (email) VALUES ('p@x.com') RETURNING id, created_at;
INSERT INTO users (email) VALUES ('p@x.com'); SELECT LAST_INSERT_ID();
INSERT INTO users (email) VALUES ('p@x.com') RETURNING id INTO :v_id;
UPSERT — every DB's favorite party trick
INSERT INTO counters (key, n) VALUES ('a', 1) ON CONFLICT (key) DO UPDATE SET n = counters.n + 1;
INSERT INTO counters (`key`, n) VALUES ('a', 1) ON DUPLICATE KEY UPDATE n = n + 1;
MERGE INTO counters c USING (SELECT 'a' AS k, 1 AS v FROM dual) src ON (c.key = src.k) WHEN MATCHED THEN UPDATE SET c.n = c.n + 1 WHEN NOT MATCHED THEN INSERT (key, n) VALUES (src.k, src.v);
Full-Text Search & Vector Search — The Modern Asks
"Search" used to mean Elasticsearch. Now relational DBs have caught up enough that for many products, you can skip the extra service.
Full-text search
| DB | Native FTS | Notes |
|---|---|---|
| Postgres | tsvector + tsquery, ranking via ts_rank | GIN-indexed. Multilingual dictionaries. Production-grade. |
| MySQL | FULLTEXT indexes (InnoDB and MyISAM) | Boolean + natural-language modes; weaker stemming than PG |
| Oracle | Oracle Text — extremely powerful | Enterprise feature, full linguistic suite |
| SQL Server | Full-text indexes with stemming & thesaurus | Decent but quirky setup |
| SQLite | FTS5 module — built-in | Surprisingly good for what it is |
ALTER TABLE articles ADD COLUMN tsv TSVECTOR GENERATED ALWAYS AS (to_tsvector('english', title || ' ' || body)) STORED; CREATE INDEX idx_articles_tsv ON articles USING GIN(tsv); SELECT id, title, ts_rank(tsv, query) AS rank FROM articles, plainto_tsquery('english', 'replication lag') query WHERE tsv @@ query ORDER BY rank DESC LIMIT 10;
Vector search — the AI/embeddings era
| DB | Vector support | Index type |
|---|---|---|
| Postgres | pgvector extension (now installed by AWS RDS, GCP CloudSQL, Azure) | HNSW, IVFFlat |
| Oracle 23ai | Native VECTOR type and similarity functions | Native HNSW |
| SQL Server 2025+ | Native vector type and similarity search (preview) | Limited |
| MySQL HeatWave | Vector support in HeatWave (cloud only) | Limited |
| SQLite | sqlite-vec extension | Brute-force + IVF |
CREATE EXTENSION vector; CREATE TABLE docs ( id BIGSERIAL, content TEXT, embedding VECTOR(1536) -- e.g. OpenAI ada ); CREATE INDEX idx_docs_embedding ON docs USING hnsw (embedding vector_cosine_ops); -- Find the 5 docs closest to a query embedding SELECT id, content FROM docs ORDER BY embedding <=> '[0.012, -0.004, ...]'::VECTOR LIMIT 5;
Concurrency & Locking — Where Deadlocks Are Born
Two transactions modifying the same data is the source of every interesting database story. The locking strategies differ in granularity (row, page, table), in scope (data + index keys?), and in whether reads block writes (lock-based) or coexist (MVCC).
Lock granularity
| DB | Default lock granularity | Escalation |
|---|---|---|
| Postgres | Row-level for writes (no page locks). Table-level for DDL. | None — Postgres tracks locks per-row in shared memory; you can run out of max_locks_per_transaction |
| MySQL InnoDB | Row-level + next-key (gap) locks under RR | None — but gap locks can lock more than you expect |
| Oracle | Row-level, no lock escalation, no lock manager (locks stored in data blocks) | None — Oracle's "locks are free" boast |
| SQL Server | Row → page → table lock escalation (default ~5000 row locks) | Lock escalation can surprise; ALTER TABLE ... SET (LOCK_ESCALATION = DISABLE) |
Lock modes you'll encounter
- Shared (S) — multiple readers can hold it.
- Exclusive (X) — only one holder, blocks all others.
- Update (U) (SQL Server, MySQL) — intent-to-write, prevents lock upgrade deadlocks.
- Intent (IS, IX) — at the parent level (table) declaring that some child (row) is locked.
- Gap / next-key (InnoDB) — locks the range between index keys to block phantom inserts under RR.
Explicit row locking — almost portable
-- Postgres / MySQL 8.0 / Oracle / SQL Server (with READPAST) SELECT id, payload FROM jobs WHERE status = 'pending' ORDER BY created_at LIMIT 1 FOR UPDATE SKIP LOCKED; -- The row is yours; other workers skip past it.
SKIP LOCKED in 8.0. Postgres has had it since 9.5. SQL Server uses WITH (UPDLOCK, READPAST) as the hint syntax. Worker-queue code written on one is rarely portable to another without rewrites.Deadlocks
Every DB detects deadlocks (cycles in the lock graph) and aborts one transaction. The differences:
- Postgres & MySQL InnoDB — detect after a short timeout, pick the cheapest victim (least work done) and roll it back. Your client gets a deadlock error and must retry.
- Oracle — detects via a cycle-finder thread; rolls back the transaction that closed the cycle.
- SQL Server — fastest detector; you can set
DEADLOCK_PRIORITYto influence the victim choice.
Partitioning & Sharding — Living Past a Single Box
Every database eventually outgrows one machine. The relational world splits into two answers: partition the table (still one DB, split internally) and shard the database (multiple DBs, one logical table). Built-in support for each varies wildly.
Partitioning — built-in
| DB | Partition types | Notes |
|---|---|---|
| Postgres | Range, list, hash (since 10) | Native declarative since 10; mature since 12. Foreign keys to partitioned tables since 12. |
| MySQL | Range, list, hash, key, columns variants | InnoDB only (since 5.7). Limit ~8K partitions per table. |
| Oracle | Range, list, hash, composite, interval, reference, system | Most flexible by far; partition advisor and online operations |
| SQL Server | Range only (via partition functions) | Up to 15K partitions; integrates with partition switching for fast loads |
CREATE TABLE events ( id BIGINT, ts TIMESTAMPTZ NOT NULL, body JSONB ) PARTITION BY RANGE (ts); CREATE TABLE events_2026_01 PARTITION OF events FOR VALUES FROM ('2026-01-01') TO ('2026-02-01'); CREATE TABLE events_2026_02 PARTITION OF events FOR VALUES FROM ('2026-02-01') TO ('2026-03-01'); -- Inserts route automatically; queries with ts predicate prune partitions SELECT * FROM events WHERE ts >= '2026-02-01';
Sharding — bring your own
Built-in horizontal scale across machines is where the open-source relational DBs don't shine. The common answers:
| DB | Sharding option | What it gives you |
|---|---|---|
| Postgres | Citus extension (now part of Microsoft Azure) | Distributed Postgres — transparent sharding, distributed JOINs |
| MySQL | Vitess (used by YouTube, Slack, Pinterest) | Sharding middleware with online resharding |
| Oracle | Oracle Sharding (12.2+) | Native — system-managed or user-defined |
| SQL Server | Elastic Database / sharded pools on Azure | Cloud-managed sharding |
| NewSQL | CockroachDB, Spanner, TiDB, YugabyteDB | Sharding built into the storage layer from day one |
High Availability & Failover — Who Becomes Primary When?
Replication gives you copies. HA decides what happens when the primary dies — who promotes a replica, when, and how do you stop two primaries from existing at once (split-brain).
| DB | Native HA | How it elects a new primary |
|---|---|---|
| Postgres | None native — Patroni, repmgr, Stolon | Patroni uses a DCS (etcd/Consul/Zookeeper) for leader election; promotes the most caught-up replica |
| MySQL | Group Replication, InnoDB Cluster (with MySQL Router) | Group Replication uses Paxos-style consensus; usually quorum of 3 or 5 |
| Oracle | Data Guard (replicas) + RAC (active-active on shared storage) | Fast Start Failover via Data Guard observer; RAC has no failover — all nodes are active |
| SQL Server | Always On Availability Groups | Windows Server Failover Cluster + AG listener |
Synchronous vs asynchronous replication
The core trade-off is durability vs latency:
- Async — primary commits, ships WAL to replicas later. Fast. If primary dies after commit but before ship, those commits are lost.
- Sync — primary waits for at least one replica to acknowledge before returning success. Slower. No data loss on primary failure.
- Semi-sync (MySQL) — wait for at least one replica's receipt, not full apply. Middle ground.
- Quorum (Group Replication, Spanner, Postgres with quorum committee) — wait for majority. Tolerates one failure with no data loss.
# In postgresql.conf: synchronous_commit = on # default — fsync to local WAL synchronous_commit = remote_write # wait for replica receipt synchronous_commit = remote_apply # wait for replica replay synchronous_commit = off # fastest, may lose ~3 commits on crash synchronous_standby_names = 'ANY 2 (replica1, replica2, replica3)' # Commit succeeds when any 2 of these 3 replicas ack.
ANY N or have explicit failover for the sync replica too.Performance Profile — Who's Faster, and at What?
"X is faster than Y" is meaningless without the workload. Here's the workload-by-workload truth — each row of the table below is a deep story about what the database is actually doing under the hood. Skim the table; read the subsection that matches your real workload.
| Workload | Winner | One-line why |
|---|---|---|
| Single-row PK lookup, very high QPS | MySQL InnoDB | Clustered PK + sequential autoinc + adaptive hash index |
| Complex analytical queries (joins, window funcs) | Postgres, SQL Server, Oracle | Mature cost-based planners + parallel query + JIT |
| Write-heavy with random keys | MyRocks (LSM), Postgres tuned | LSM trees absorb random writes; B-trees thrash |
| Mixed OLTP + reporting (HTAP) | Oracle, SQL Server | Resource governors + columnstore prevent noisy-neighbor |
| JSON-heavy with queries | Postgres JSONB | Binary format + GIN indexes = indexed path queries |
| Time-series | TimescaleDB / BRIN | Hypertables + tiny BRIN indexes on naturally-sorted data |
| Embedded / on-device | SQLite | No server, no IPC; function calls into a C library |
| Geospatial | Postgres + PostGIS | 20+ years of mature R-tree / GiST spatial indexing |
① Single-row PK lookup at very high QPS — MySQL InnoDB
This is the workload InnoDB was built for, and the design decisions compound. Each one is small; together they explain the 30–40% edge MySQL holds on point reads.
The clustered index IS the table. In InnoDB, the primary key B+Tree's leaf nodes contain the actual row data — not pointers to it. A lookup like SELECT * FROM users WHERE id = 42 is one B+Tree traversal and the row is right there. Postgres splits this into two structures: a separate heap (unordered rows) and a B-Tree index that points into it. Index lookup → heap fetch = two cache lines touched instead of one.
Sequential AUTO_INCREMENT keeps the hot region tiny. Monotonic keys (1, 2, 3, …) mean every insert lands at the rightmost leaf of the B+Tree. The next ~1000 rows live in one 16KB page, that page stays pinned in the buffer pool, and new rows are written sequentially to disk — no page splits, no fragmentation. Switch the PK to a UUID and every insert lands at a random spot in the tree; throughput drops 3–5×. This is why "use BIGINT AUTO_INCREMENT, never UUID v4 as your PK" is the loudest community rule in MySQL.
The Adaptive Hash Index — the silent weapon. InnoDB watches access patterns. When certain PK values get hit repeatedly, it builds an in-memory hash index on top of the B+Tree, transparently. Subsequent lookups become O(1) hash hits, skipping the B-tree entirely. You don't configure it. You don't see it in EXPLAIN. It just sits there making hot reads faster. Postgres has no equivalent.
INSERT BUFFER AND ADAPTIVE HASH INDEX ------------------------------------- Hash table size 4425293, node heap has 1234 buffer(s) 139.93 hash searches/s, 12.41 non-hash searches/s
That ratio (~92% hash hits) is the AHI catching most PK lookups before they ever touch the B-tree.
Tiny per-row overhead. InnoDB rows carry 6 bytes of transaction ID + 7 bytes of rollback pointer = 13 bytes of MVCC header. Postgres tuples carry a 24-byte header (xmin, xmax, cmin, cmax, ctid, infomask). For a small row, that's 30% more on-disk size in Postgres — fewer rows per page, more I/O per lookup.
No version-chain walk for the live row. At READ COMMITTED, InnoDB reads the current row in-place; only if your snapshot needs an older version does it walk the undo log. Postgres always reads the heap tuple, then checks visibility, then potentially walks HOT chains — extra bookkeeping per read.
WHERE email = ? is two B+Tree traversals (email index → PK → row). If your PK is a 36-byte UUID string, every secondary index is bloated by carrying it in every leaf. Yet another reason narrow integer PKs matter in MySQL even more than in Postgres.Numbers, single 32-core / NVMe / all-in-RAM box: InnoDB ~250K point-read QPS, Postgres ~180K. Sequential inserts: InnoDB ~150K/s, Postgres ~110K/s. The edge is real and consistent — and disappears the moment your query is more complex than WHERE id = ?.
② Complex analytical queries — Postgres, SQL Server, Oracle
The moment a query has 5+ joins, window functions, CTEs, or aggregations across millions of rows, the bottleneck stops being storage and becomes the query planner. This is where Postgres, SQL Server, and Oracle pull ahead — sometimes by 10× or more — over MySQL.
Cost-based optimization with deep statistics. All four DBs are cost-based, but the depth of statistics differs. Postgres tracks column histograms, n-distinct estimates, most-common-value lists, and (since 10) extended statistics over correlated columns. Oracle's optimizer has 40 years of accumulated hints, hashes, and heuristics. SQL Server has cardinality-estimation versions you can choose between. MySQL 8's optimizer is improving — hash joins arrived in 8.0.18, after Postgres had them for two decades — but the planner still chokes on queries that the other three handle without breaking a sweat.
Join algorithms. Postgres, SQL Server, and Oracle implement nested-loop, hash-join, and merge-join, and the planner picks dynamically based on row estimates. MySQL was nested-loop-only until 8.0.18 (yes, you read that right). A query like "join 50M orders to 100K customers on a non-indexed column" was effectively O(N×M) in MySQL — minutes — while Postgres did a hash join in seconds.
Parallel query. Postgres added parallel sequential scan in 9.6 and now parallelizes joins, aggregates, and index scans. Oracle and SQL Server have parallelized everything for decades. MySQL's parallel execution arrived in 8.0 but only for specific operations (CHECK TABLE, parallel index reads in 8.0.14). For a 1B-row aggregate, Postgres with max_parallel_workers_per_gather = 8 finishes in 1/6th the wall time of single-threaded execution.
JIT compilation. Postgres 11+ compiles query expressions to native code via LLVM for queries above a cost threshold — a 30M-row aggregation that interprets one tuple at a time can become a tight machine-code loop. SQL Server and Oracle have their own equivalent (batch-mode execution + adaptive memory grants). MySQL has none.
Window functions, recursive CTEs, lateral joins. All standard since SQL:2003. Postgres has had window functions since 8.4 (2009); SQL Server and Oracle have had them longer. MySQL added them in 8.0 (2018). Code written against PG/SQL Server/Oracle often won't even parse on older MySQL.
-- "Find each customer's top 3 orders by value, plus a running total" SELECT customer_id, order_id, amount, RANK() OVER (PARTITION BY customer_id ORDER BY amount DESC) AS rk, SUM(amount) OVER (PARTITION BY customer_id ORDER BY placed_at) AS running_total FROM orders WHERE placed_at >= NOW() - INTERVAL '90 days';
Numbers: on a 100M-row analytical workload (TPC-H Q5-style), Postgres 16 with parallel query routinely runs 5–10× faster than MySQL 8 on the same hardware. Oracle and SQL Server typically beat both — at the cost of license fees.
FORCE INDEX hints are a common smell. Same query on Postgres rarely needs hints because the planner has more options to consider. If your roadmap has growing analytical complexity and you're on MySQL, plan the Postgres migration before the query latency becomes a crisis.③ Write-heavy with random keys — MyRocks, Postgres (tuned)
The default storage in every relational DB is a B-Tree. B-Trees are wonderful for ordered access and range scans — and a disaster for sustained random writes at scale. This is the workload that LSM trees (Log-Structured Merge) were invented to handle.
Why B-Trees suffer. Every insert finds its position in the tree (an O(log n) lookup), reads the page, modifies it, and writes it back. If your keys are random (UUIDs, hashed IDs), every insert lands on a different page. The buffer pool churns — pages get pulled in, dirtied, evicted, pulled in again. Page splits cascade. SSDs feel some of this pain less than spinning disks but they still pay the write amplification cost (10–20× more bytes written than logical data).
How LSM trees fix it. An LSM tree accepts writes into an in-memory sorted structure (memtable). When that fills, it's flushed to disk as a sorted file (SSTable). New SSTables stack on top of older ones. A background process compacts them — merging and discarding overwrites. Writes are always sequential. Random updates become append-only.
WRITE arrives: 1. Append to WAL on disk // sequential — fast 2. Insert into memtable // in-memory sorted Memtable fills (~64MB): 3. Flush to SSTable on disk // sequential write of sorted data Background: 4. Compact older SSTables // merge + dedupe + drop tombstones
MyRocks packages RocksDB as a MySQL engine. Originally built by Facebook to shrink their UserDB MySQL footprint; they reported ~50% space savings and ~50% write reduction vs InnoDB. For SSD-heavy write workloads (1M+ writes/sec), MyRocks does 3–5× the sustained throughput of InnoDB with random keys.
Postgres's middle ground. Postgres is B-Tree-based, but it has tricks that soften random-write pain: HOT (Heap-Only Tuples) updates avoid index updates when no indexed column changed; fillfactor < 100 leaves slack in pages so updates stay in-place; partitioning lets you drop entire chunks instead of deleting row-by-row. With these dials tuned, Postgres can handle very heavy writes — just not at the scale where MyRocks shines.
| Engine | Type | Random-write strength | Read penalty |
|---|---|---|---|
| InnoDB | B+Tree | Moderate (page churn under random keys) | None |
| Postgres heap | Heap + B-Tree index | Moderate (HOT updates help) | None |
| MyRocks | LSM (RocksDB) | Excellent — sequential always | Slight — reads may check multiple SSTables |
| Cassandra / ScyllaDB | LSM | Excellent (write-optimized DB) | Higher for ranges |
The catch: LSM reads are slower because a key may live in the memtable, in any of several SSTables, or have been deleted (tombstoned). Bloom filters mitigate this but don't eliminate it. If your workload is 80% writes / 20% reads, LSM wins. If it's 80% reads / 20% writes, B-Tree wins. For 50/50, benchmark both — there's no shortcut.
④ Mixed OLTP + reporting (HTAP) — Oracle, SQL Server
"Real-time dashboards on the production database" is the dream that haunts every architect. The problem is the noisy-neighbor effect: a reporting query that scans 50M rows starves the buffer pool of OLTP-relevant pages, holds shared locks under lock-based isolation, and saturates I/O bandwidth. OLTP latency spikes from 2ms to 200ms while the report is running.
Oracle's answer: Resource Manager + Materialized Views with Query Rewrite. Resource Manager lets you guarantee CPU, I/O, and parallelism per consumer group — analytics gets capped at 20% CPU, OLTP gets the rest. Materialized views with automatic refresh pre-compute reporting answers; the optimizer can transparently rewrite SELECT SUM(...) queries to hit the MV instead of the base table. Oracle's In-Memory Column Store (12c+) caches the same data in both row format (for OLTP) and columnar format (for analytics) — same buffer pool, two representations.
SQL Server's answer: Resource Governor + Columnstore indexes. Resource Governor is similar to Oracle's. Columnstore indexes are the killer feature — same table, both rowstore (for OLTP) and columnstore (for analytics) indexes. Analytical queries hit the columnstore and get 10–100× speedup with 5–10× compression; OLTP keeps using the clustered rowstore. SQL Server tracks which rows have been recently modified and merges deltas into the columnstore lazily.
-- Clustered rowstore for OLTP point reads CREATE TABLE orders ( id BIGINT IDENTITY PRIMARY KEY, customer_id BIGINT, amount DECIMAL(19,4), placed_at DATETIME2 ); -- Nonclustered columnstore for analytics on same table CREATE NONCLUSTERED COLUMNSTORE INDEX idx_orders_cs ON orders (customer_id, amount, placed_at); -- Point read uses the clustered rowstore (~1ms) SELECT * FROM orders WHERE id = 42; -- Aggregate uses the columnstore (~100ms instead of 30s) SELECT customer_id, SUM(amount) FROM orders WHERE placed_at >= '2026-01-01' GROUP BY customer_id;
Why Postgres and MySQL struggle here. Postgres has parallel query (great) but no resource governor (a runaway analytical query can exhaust connections and slow OLTP). Native columnstore is missing — extensions like citus columnar and hydra are still maturing. MySQL has none of these, plus a weaker planner — running heavy analytics on the OLTP master is genuinely dangerous.
The modern alternatives. If HTAP is the requirement, the realistic options in 2026 are: SQL Server with columnstore, Oracle with In-Memory, SingleStore (purpose-built HTAP), or TiDB with its TiKV (rowstore) + TiFlash (columnstore) split. Postgres shops usually solve it by streaming CDC to a separate analytics warehouse (ClickHouse, BigQuery, Snowflake) — works well but adds operational complexity.
⑤ JSON-heavy with queries — Postgres JSONB
Storing JSON is easy in every DB. Querying it with index speed is where Postgres pulls ahead by a margin that's actually hard to overstate.
JSONB is a parsed binary tree. When you INSERT '{"user":"alice","tags":["a","b"]}' as JSONB, Postgres parses it once, validates it, and stores it as a binary structure where keys are interned and values are typed. Every subsequent read avoids re-parsing. MySQL JSON does the same. SQL Server's NVARCHAR(MAX) + JSON functions does not — every query re-parses the text.
GIN indexes — the real superpower. A GIN (Generalized Inverted iNdex) on a JSONB column indexes every key, every value, and every path inside the document. The containment operator @> can then satisfy queries like "find all docs whose type is 'login' AND region is 'IN'" in milliseconds against 100M-row tables.
CREATE INDEX idx_events_payload ON events USING GIN (payload jsonb_path_ops); -- Indexed lookup on nested JSON keys — milliseconds even at 100M rows SELECT id, payload->>'user_id' AS uid FROM events WHERE payload @> '{"type":"login","region":"IN"}' AND created_at >= NOW() - INTERVAL '1 day';
What the others can do. MySQL JSON requires generated virtual columns to be indexable: extract the field into a column, index that column. Works, but you need to know up front which fields you'll query. SQL Server is in the same boat. Oracle 21c added a proper JSON search index (similar to GIN); pre-21c it was CLOB + function-based indexes.
| Capability | Postgres JSONB | MySQL JSON | Oracle 21c+ | SQL Server |
|---|---|---|---|---|
| Re-parse on each query | No (parsed once) | No | No | Yes (text-based) |
| Indexed without knowing field | Yes (GIN over everything) | No | Yes (search index) | No |
| Indexed for specific field | Yes (expression index) | Yes (generated column) | Yes | Yes (computed column) |
| Partial in-place update | Limited (rewrites whole value) | Yes (5.7.22+) | Yes | No |
Numbers: a containment query (WHERE payload @> '{...}') against a 50M-row JSONB table with a GIN index typically returns in 5–30ms. Without the index it's a 30-second sequential scan. Same query against MySQL JSON without a generated column = 30+ seconds; with the right generated column + index = ~10ms (but only for that specific extracted field).
views counter inside a 50KB JSONB blob and you increment it 1000×/sec, you're writing 50MB/sec of WAL. Lift hot counters into their own columns or Redis. JSONB is for occasionally-mutated, frequently-read shapes — not for hot path partial updates.⑥ Time-series — TimescaleDB & BRIN indexes
Time-series workloads have a very specific shape: data arrives in time order, queries are almost always range scans by time, you aggregate over windows, and old data is rarely updated. Every part of a general-purpose DB can be specialized around that shape.
BRIN indexes — tiny indexes for sorted data. A B-Tree index on a 100GB time-series table is typically 15–20GB. A BRIN (Block Range INdex) on the same column is ~5MB. BRIN works by storing, for each contiguous block range, just the min/max value of the indexed column. When the column is naturally sorted (like created_at in an append-only logs table), BRIN can skip 99% of pages with a tiny memory footprint. The trade-off: if data isn't sorted, BRIN is useless.
CREATE INDEX idx_logs_ts ON logs USING BRIN (created_at) WITH (pages_per_range = 128); -- Query: "last 1 hour of logs" scans < 1% of the table SELECT * FROM logs WHERE created_at >= NOW() - INTERVAL '1 hour';
TimescaleDB — Postgres turned into a time-series DB. A Postgres extension that introduces hypertables: virtual tables that automatically partition by time into many child tables (chunks). The optimizer routes queries to relevant chunks only, parallelizes across them, and supports continuous aggregates (materialized rollups refreshed incrementally). Add native columnar compression (8–15× compression ratio) for old chunks and you have a serious time-series engine wearing Postgres clothing.
Why purpose-built time-series DBs still win at extreme scale. ClickHouse, InfluxDB, and QuestDB are built ground-up for this workload. ClickHouse routinely ingests millions of rows/sec on a single node and aggregates billions of rows in seconds via vectorized execution + columnar storage. For pure metrics at extreme scale, they outperform TimescaleDB. The trade-off: SQL dialect quirks, weaker transactional semantics, separate operational stack.
| Option | Strength | Weakness |
|---|---|---|
| Postgres + BRIN | Zero new dependencies, "just an index" | No automatic partitioning, no columnar compression |
| TimescaleDB | SQL-compatible, transactional, hypertables, compression | Adds an extension to operate |
| ClickHouse | Insane aggregation speed, columnar, vectorized | SQL dialect, weak updates/deletes |
| InfluxDB | Purpose-built API, retention policies, downsampling | Storage engine has churned across versions |
Heuristic: if you're under ~1B rows / 1TB / 100K writes-per-sec, BRIN or TimescaleDB inside Postgres is the right answer — one database to operate. Beyond that, a dedicated TS engine pays for itself in ingestion throughput and query latency.
⑦ Embedded / on-device — SQLite
SQLite isn't competing with the others — it's a different category of database. Every characteristic that "limits" it (no server, no network, one writer at a time) is actually an asset for the workloads it owns.
No IPC means microsecond latency. A query in Postgres or MySQL is a network round-trip (even on localhost: serialize → kernel socket → deserialize → parse → execute → serialize back). A SQLite query is a function call into a C library — nanoseconds of overhead. For "open file, read 1000 rows, close" patterns (browser history, app preferences, IoT sensor log), SQLite is 10–100× faster than client/server DBs on the same hardware.
WAL mode + memory-mapped I/O. Default journal mode (rollback journal) serializes everything. WAL mode lets multiple readers proceed concurrently with one writer — and writers no longer block readers. Add PRAGMA mmap_size = 30GB and SQLite memory-maps the database file directly, bypassing kernel buffers. Reads become pointer dereferences. Insert throughput on a local NVMe with WAL + synchronous = NORMAL routinely hits 100K rows/sec.
PRAGMA journal_mode = WAL; PRAGMA synchronous = NORMAL; -- still crash-safe, fsync less often PRAGMA cache_size = -100000; -- 100 MB of page cache PRAGMA temp_store = MEMORY; PRAGMA mmap_size = 30000000000; -- 30 GB mmap window PRAGMA foreign_keys = ON; PRAGMA wal_autocheckpoint = 1000;
The one-writer ceiling. SQLite serializes writes globally — even in WAL mode. For a single-user app or a low-write web service, this never matters. For a multi-tenant SaaS doing 10K writes/sec, it's a hard wall. Litestream and LiteFS work around it for some patterns (single-writer + replicated readers), and "one SQLite database per tenant" is a legitimate architecture used by Notion-style products — but at that point you're sharding manually.
Numbers: on a MacBook Pro NVMe with WAL: ~100K simple inserts/sec, ~500K simple reads/sec, query latency consistently under 10µs. No setup, no daemon, no port, no auth. Just a file.
⑧ Geospatial — Postgres + PostGIS
If you're doing anything with maps, locations, polygons, routing, or "find everything within 5km of here", PostGIS is the answer. It's the only choice with the depth of features and ecosystem maturity for serious geospatial work — and it's been the answer for 20+ years.
R-tree indexing via GiST. A B-Tree can't index 2D points or polygons (there's no total ordering of shapes in 2D space). PostGIS uses Postgres's GiST (Generalized Search Tree) infrastructure to build R-tree indexes that bound geometries in rectangles, then recursively bound those rectangles. A "find everything within this polygon" query becomes a tree pruning operation that visits a tiny fraction of the table.
The function library is exhaustive. PostGIS implements the full OGC Simple Features SQL standard (300+ ST_* functions): distance, intersection, union, buffer, convex hull, nearest-neighbor, raster operations, network analysis (via the pgRouting extension). Anything you can express on a 2D plane or a sphere, there's a function for it.
CREATE EXTENSION postgis; CREATE TABLE restaurants ( id BIGSERIAL PRIMARY KEY, name TEXT, geom GEOGRAPHY(POINT, 4326) -- WGS84 lat/lng ); CREATE INDEX idx_rest_geom ON restaurants USING GIST (geom); -- Find restaurants within 1km of Bangalore, sorted by distance SELECT id, name, ST_Distance(geom, user_loc) AS meters FROM restaurants, (SELECT ST_MakePoint(77.5946, 12.9716)::GEOGRAPHY AS user_loc) u WHERE ST_DWithin(geom, user_loc, 1000) ORDER BY geom <-> user_loc -- kNN operator — uses GIST index LIMIT 20;
Alternatives, briefly:
- Oracle Spatial — comparable depth, proprietary, expensive. Used in defense/government.
- SQL Server geometry/geography — solid but smaller function library; well integrated with Power BI.
- MySQL spatial types — basic; lacks the depth of PostGIS function library; geographic (vs Cartesian) computations are limited.
- SQLite + Spatialite — bring PostGIS-like spatial functions to SQLite; great for offline maps in apps.
- MongoDB 2dsphere — adequate for point-in-radius queries; not a real geospatial DB beyond that.
Numbers: PostGIS routinely handles 100M+ polygon tables (parcel data, country boundaries, road networks) on commodity hardware. Spatial joins between two such tables complete in seconds with proper GiST indexes. Uber, Airbnb, Strava, and every mapping startup of the last 15 years has had PostGIS somewhere in the stack.
Numbers people quote (rough, single-node, well-tuned)
- Postgres 16 — 50K–100K TPS for OLTP; pgbench routinely shows 100K+ TPS on modern hardware. Complex analytics 5–10× MySQL.
- MySQL 8 — 200K+ QPS for simple PK reads; clustered index + AHI give it the edge for that pattern. Loses ground as queries get complex.
- SQLite — 100K+ inserts/sec on local NVMe with WAL mode; single-writer ceiling.
- MyRocks — 3–5× InnoDB write throughput on random-key workloads with SSDs; ~50% space saving.
- Oracle / SQL Server — depend entirely on hardware and edition. They scale further vertically (NUMA-aware, parallel everything) than open-source counterparts; In-Memory / Columnstore tilt HTAP workloads decisively in their favor.
- TimescaleDB — 1M+ time-series rows/sec ingest on a single node; queries on compressed chunks 10–100× the equivalent on vanilla Postgres tables.
SQLite — The Relational DB You Use Every Day Without Noticing
SQLite is the most widely deployed database engine in the world. It runs in every iPhone, every Android, every Chrome and Firefox install, on every airplane (literally — it's in Airbus avionics). Not a server. Not a "lightweight version" of anything. A different category of relational DB entirely.
What makes it different
- No server process. A C library you link in. The "database" is a single file. To connect, you
fopen()that file. - Serverless (the original meaning). No network protocol, no auth, no port. Your process talks to the file directly via the library.
- Whole-file locking — only one writer at a time. Multiple readers under WAL mode.
- Dynamic typing — columns have "type affinity" not strict types. You can put a string into an INTEGER column (it'll store as text). This is unique and divisive.
- Public domain. Use anywhere, commercial or not, no attribution. Hipp does not believe in licenses.
When SQLite is the right answer
- Embedded apps — phones, IoT, browser local state, desktop apps.
- Low-concurrency websites — Litestream/LiteFS now make SQLite viable for production web apps (Tailscale, Pieter Levels' projects).
- Test fixtures — spin up an in-memory DB in 5ms, run tests, throw it away.
- File format — Apple, Adobe, and dozens of others use SQLite as a structured file format. Better than inventing your own.
- Analytics on local files — modern SQLite + DuckDB-style analytics is great for desktop data crunching.
When it's not
- High write concurrency — only one writer at a time.
- Network access — you can't connect across machines (technically you can via NFS, but you shouldn't).
- Fine-grained auth — there's no user system.
PRAGMA journal_mode = WAL; -- writers don't block readers PRAGMA synchronous = NORMAL; -- fsync less often, still durable on crash PRAGMA foreign_keys = ON; -- off by default, surprisingly PRAGMA temp_store = MEMORY; PRAGMA mmap_size = 30000000000; -- memory-mapped I/O
CREATE TABLE t (n INTEGER) happily accepts INSERT INTO t VALUES ('hello'). The 'hello' is stored as TEXT in that column. Use the STRICT table flag (added 3.37) to enforce real types.MariaDB vs MySQL — The Fork That Stuck
Oracle bought Sun in 2010, inheriting MySQL. Monty Widenius (MySQL's original author) didn't trust Oracle and forked MySQL into MariaDB. Fifteen years later, MariaDB is everywhere — but the projects have diverged enough that "they're the same" is no longer true.
| Feature | MySQL 8 | MariaDB 11 |
|---|---|---|
| Default storage engine | InnoDB | InnoDB (their own fork) + Aria for system tables |
| JSON | Binary native JSON type | JSON is an alias for LONGTEXT with helper functions |
| UUID type | Function only (8.0.13+) | Native UUID column type (10.7+) |
| Sequences | — | Yes (Oracle-compatible CREATE SEQUENCE) |
| System-versioned tables | — | Yes — full SQL:2011 temporal tables |
| Oracle compatibility mode | — | Yes (PL/SQL subset) |
| Group replication / clustering | Group Replication, InnoDB Cluster | Galera Cluster (sync multi-master) |
| License | GPL / commercial dual | GPL only |
| Roadmap driver | Oracle Corporation | MariaDB Foundation (non-profit) |
SQL Server & Oracle — Where Enterprise Workloads Live
Open-source DBs dominate web. Proprietary enterprise DBs dominate banks, telcos, ERPs, governments. The features that justify their license fees are less about raw speed and more about completeness — every analytics tool, every compliance feature, every migration path is bundled.
SQL Server — Microsoft's Swiss-army knife
- T-SQL — the procedural language is genuinely powerful (window functions, recursive CTEs, table-valued parameters).
- SSAS / SSIS / SSRS — bundled OLAP cubes, ETL, and reporting.
- Columnstore indexes — game-changing for data-warehouse workloads.
- Always On AGs — HA story is one of the smoothest.
- Linux support since 2017 — used to be Windows-only, now happily on Linux containers.
- Editions — Express (free, 10GB), Developer (free, full features for non-prod), Standard, Enterprise (very expensive).
Oracle — The big iron
- RAC (Real Application Clusters) — multiple instances share one storage layer. Active-active. The only way to "scale up" a single Oracle DB across machines without rewriting your app.
- PL/SQL — most powerful procedural language of any RDBMS.
- Partitioning — best-in-class.
- Data Guard + GoldenGate — physical and logical replication respectively, both mature.
- Exadata — Oracle's purpose-built hardware/software stack. Pricey, fast, all-in-one.
- 23ai — adds native vector search, property graph, JSON-relational duality. Oracle catching up on modern features.
- License complexity — per-core, per-named-user, per-processor with multipliers; LMS audits are legendary.
Why anyone still pays
- 30-year-old codebases written in PL/SQL or T-SQL. Re-platforming is a $10M+ project.
- Compliance certifications (HIPAA, FedRAMP at the highest levels) come pre-built.
- Vendor support contracts — when production is down, having a paid escalation path matters.
- Specific features no open-source DB matches yet (RAC for true shared-storage multi-master, Exadata Smart Scan, Oracle Text linguistic features).
NewSQL — CockroachDB, Spanner, TiDB, YugabyteDB
Postgres scales vertically. Sharding scales horizontally but breaks transactions. NewSQL is the answer to "can we have SQL + ACID + horizontal scale, please?" — and as of 2026, the answer is "yes, with some caveats."
The contenders
| DB | Wire protocol | Consensus | Notes |
|---|---|---|---|
| Google Spanner | Custom SQL dialect | Paxos + TrueTime atomic clocks | Multi-region external consistency. Cloud-only. |
| CockroachDB | Postgres wire protocol | Raft per range | "Survive anything." Open source until BSL change in 2024. |
| TiDB | MySQL wire protocol | Raft + Percolator | HTAP — separate TiKV (OLTP) + TiFlash (columnar OLAP). |
| YugabyteDB | Postgres-compatible (YSQL) + Cassandra-compatible (YCQL) | Raft | Open source. Postgres-derived query layer. |
| Vitess | MySQL wire protocol | Custom (uses MySQL replicas) | Sharding layer, not a new storage engine. Powers YouTube. |
| Citus (PG ext) | Postgres | Standard PG replication | Distributed Postgres tables via co-location. Microsoft Azure. |
What you give up
- Single-node latency. A simple insert that's 0.1ms in single-node Postgres becomes 2–5ms in CockroachDB (consensus round-trip).
- Full feature parity. Stored procedures, exotic data types, some indexes — usually a subset of what their wire-compatible parent supports.
- Operational simplicity. Now you're running a distributed system. Backups, restores, schema changes are different beasts.
What you gain
- Geographic distribution with strong consistency. A row in Mumbai stays consistent with reads in San Francisco.
- Linear scale-out. Add a node, get more capacity. No sharding decisions.
- Survive failures. Lose a node (or a whole region) — no downtime.
- One database, many regions. Compliance scenarios where data residency matters.
Migration Pitfalls — Moving From One to Another
"Just port the SQL" is the optimist's plan. Real migrations break on the dialect differences, the data-type coercions, the procedural code, and the operational habits.
The classic landmines
- Quoting and case sensitivity. Postgres folds unquoted identifiers to lowercase; MySQL is case-sensitive on Linux, case-insensitive on Mac/Windows. Code that "works locally" breaks in prod.
- Auto-increment semantics. MySQL's
AUTO_INCREMENTnever reuses values; Postgres sequences can have gaps. Code that depends on contiguous IDs breaks. - Empty string vs NULL. Oracle treats
''as NULL; Postgres and MySQL treat them as distinct. Migrations Oracle → anything else surface this immediately. - Date/time semantics. MySQL stores
TIMESTAMPin UTC and converts on read; Postgres'sTIMESTAMPTZis similar but the conversion rules differ.DATETIMEin MySQL has no timezone at all. - Boolean type. Postgres has it; MySQL fakes it with
TINYINT; Oracle has none. Cross-DB ORMs handle this differently. - NULL ordering. Postgres: NULLs LAST in ASC by default. MySQL: NULLs FIRST.
ORDER BY x ASC NULLS LASTworks on Postgres only. - Concurrency hot-spots. A pattern that survives in MySQL InnoDB (auto-increment hot-spot) may underperform in Postgres (sequence cache contention) — and vice versa.
- Stored procedures. See section 9 — basically rewrite from scratch.
What a real migration looks like
- Schema port — types, constraints, indexes (1–2 days for small, weeks for large).
- SQL port — replace dialect-specific syntax. Tools: pgloader, Ora2Pg, AWS SCT.
- Procedural code rewrite — the long pole. Often easier to move to application code.
- Data move — bulk load + ongoing CDC until cutover.
- Cutover with rollback plan — usually a freeze window.
- Operational re-learning — monitoring, backups, alerting all need re-instrumentation.
Choosing Cheat Sheet & Rapid-Fire Q&A
Distilled. If you remember nothing else from this page, remember this.
Picking the database — by what you're building
| What you're building | Default pick | Why |
|---|---|---|
| Startup MVP, anything in 2026 | Postgres | Features, JSON, vectors, FTS, modern SQL — out of the box |
| High-volume read-heavy web app | MySQL 8 or Postgres | Both handle it; pick the one your team knows |
| Mobile / desktop / embedded | SQLite | No server, single file, ubiquitous |
| Enterprise with existing PL/SQL | Oracle | Migrating off is a years-long project |
| Shop standardized on .NET / Microsoft | SQL Server | Tooling and integration win it |
| Need multi-region strong consistency | CockroachDB / Spanner | NewSQL solves this; nothing else does |
| Time-series at scale | TimescaleDB or InfluxDB | TS-specific compression and partitioning |
| Geospatial-heavy | Postgres + PostGIS | The reference implementation |
| Existing MySQL but want better governance | MariaDB or stay on MySQL 8 | Both are fine; differences are subtle |
Default to Postgres unless you have a specific reason not to — it has caught up on read performance, surpassed on features (JSON, FTS, vectors, types, transactional DDL), and the operational tooling is now first-class on every cloud.
Existing code and existing contracts. PL/SQL codebases, certified compliance bundles, mission-critical apps with seven-figure replacement costs. New greenfield projects rarely pick Oracle unless they're inside an Oracle-shop already.
Yes. It powers a huge slice of the internet (Booking, Uber, Slack via Vitess, GitHub, WordPress sites). It's faster than Postgres at the simplest read patterns and the ecosystem is enormous. The "Postgres is universally better" narrative is overstated — both are excellent.
Default isolation. MySQL = REPEATABLE READ (with phantom prevention). Postgres = READ COMMITTED (looser). Code that quietly relied on RR semantics will race on Postgres unless you also set SET TRANSACTION ISOLATION LEVEL REPEATABLE READ.
Multi-region, strong-consistency requirement; or you genuinely outgrew a single beefy Postgres (rare — modern hardware pushes that ceiling very high). For most apps, vertical-scaled Postgres + read replicas beats CockroachDB.
Rarely. They lock you to one DB, are hard to test and version, and modern app servers are fast enough that the round-trip cost is negligible. Use them only when truly latency-critical (set-based operations on millions of rows) or for security boundaries.
Yes, with caveats. Combined with WAL mode + Litestream/LiteFS for replication, SQLite powers production sites at moderate scale (Tailscale's coordination plane, Pieter Levels' projects, Expensify's internal services). One-writer-at-a-time is the ceiling.
"Schemaless" / no-migration-needed. Every long-lived schema converges on a contract — the only question is whether the DB enforces it or your app does. Migrations are not the problem; pretending you don't need them is.
CHECK constraints and partial indexes. A CHECK (status IN ('open','closed','cancelled')) or a CREATE INDEX ... WHERE status = 'open' can replace whole classes of application bugs and queries. Free correctness/performance. Use them.
Postgres. It's the database with the most feature breadth, the most active development, the best modern ecosystem (pgvector, TimescaleDB, Citus, PostGIS, pgRouting, FDW), and the lowest cost-of-knowledge — your skills transfer to managed cloud services on AWS, GCP, Azure, and dozens of hosted providers.