← Back to Design & Development
High-Level Design

Ticketmaster / BookMyShow

From "50,000 fans clicking the same seat at 09:00:00.001" to a sharded, ACID-backed, fairness-queued booking system — the architecture that earns every box

Read this with the framework in mind

This deep-dive applies the 4-step HLD interview framework. As you read, map each section to Requirements → Entities → APIs → High-Level Design → Deep Dives, and notice which of the 8 common patterns and key technologies are at play.

Framework → 8 Patterns → Tech Cheat Sheet →
Step 1

What is Online Ticket Booking?

It's 08:59 on a Friday morning. The new Marvel film opens at midnight. Sarah, sitting in Bangalore, has her finger hovering over the "buy" button on BookMyShow. So does Raj, two cubicles down. So do 50,000 other fans across the country, all targeting the same theater — PVR Forum, Screen 1 — which has exactly 200 seats. At 09:00:00 sharp the booking window opens. Within the next 0.4 seconds, every one of those 50,000 fans hits "select seat" — and many of them tap the same seat (the legendary J-12, dead center).

The system has to do four things, all at once, all correctly: browse (show movies, theaters, showtimes), select seats (let users pick from a live seat-map), pay (collect money via Stripe/Razorpay), and issue tickets (email a confirmation with a QR code). And it must do all of this with zero double-bookings — because if two people show up at PVR Forum holding tickets for J-12, the company gets sued and the brand dies. That's the system we're designing.

The two questions that drive every design decision below: (1) When 50K users tap the same seat in the same millisecond, how do we pick exactly one winner without double-booking? (2) How do we keep that winner's seat "held" for 5 minutes while they enter their card details, but instantly free it if they walk away — and let waiting users grab it fairly?
Step 2

Requirements & Goals

Before drawing a single box, pin down what the system must do. In an interview, asking these questions out loud signals you're building from first principles, not pattern-matching a memorized solution.

✅ Functional Requirements

  • List cities where the service operates
  • List movies currently running in a given city
  • List cinemas showing a given movie in that city
  • List shows (date + time + screen) for a chosen cinema
  • Render the seat-map of a chosen show with live availability
  • Let users select and hold seats for 5 minutes while paying
  • Atomic multi-seat orders — all seats book together or none do
  • Fair waiting queue when seats are sold out and may free up

⚙️ Non-Functional Requirements

  • Highly concurrent — many users bidding for the same seat
  • ACID-compliant — no double-bookings, no lost payments, no partially-paid orders
  • Highly available — release-day traffic spikes are 100× normal
  • Low latency on browse — the seat-map should render in under 200ms

🚫 Out of Scope

  • User authentication & identity (assume an existing auth service)
  • Cinema/theater management portals
  • Recommendations & personalization
The non-functional requirements are the harder ones. Listing movies is a database query. Letting 50K users contend for one seat without double-booking — and doing it fairly — is the part that actually requires architecture.
Step 3

Design Considerations

Four hard constraints shape this whole system. Skip any one of them and the design falls apart in production.

🧩 Atomic multi-seat orders

If Sarah wants 4 seats together for her family, the system must book all 4 or none. Booking 3 of 4 and then failing on the fourth leaves her with three useless tickets and a missing family member. This is a textbook ACID transaction — the kind a relational DB does effortlessly and a NoSQL store fights you on.

⚖️ Fairness during sellouts

When the show is full and someone abandons their hold, the freed seat shouldn't go to whoever happens to refresh first — it should go to whoever has been waiting longest. Without explicit fairness, the experience devolves into a refresh-button arms race that rewards bots and punishes patient users.

🚫 Limit seats per booking

Cap orders at 10 seats. Higher caps invite scalpers who buy 200 seats and resell them at 5× face value. The cap, combined with rate-limiting by user/payment-card, makes large-scale scalping operationally painful.

📈 Surge during popular releases

Average traffic might be 1K req/s; release-day traffic for a Marvel/Avengers premiere can be 50K req/s for the first 60 seconds. The system must scale horizontally on demand, and the booking path must not melt under burst load.

Step 4

Capacity Estimation

Numbers drive every architectural choice. Out loud, even if rough. The system is read-heavy on browse (millions of people checking showtimes) but write-coordinated on book (a smaller number of high-stakes transactions).

Traffic estimates

Assume 3 billion page views per month across browse paths (city → movie → cinema → show → seat-map). Of those, roughly 10 million tickets sold per month — about a 300:1 browse-to-book ratio.

Browse

~1.2K req/s avg

3B / (30 × 86400)

Bookings

~4 req/s avg

10M / (30 × 86400)

Peak browse

~50K req/s

40× spike on release

Peak bookings

~200 req/s

seat-contention burst

Storage estimate

Per day across the catalog: 500 cities × 10 cinemas/city × 2,000 seats/cinema × 2 shows/day × 100 bytes/seat-row ≈ 2 GB/day. Over 5 years, including bookings, payments, and audit trails: ~3.6 TB total. With 70% headroom: ~5 TB provisioned.

MetricValueWhy it matters
Browse req/s (peak)50K/sDrives cache size and read-replica fan-out
Booking req/s (peak)200/sDrives DB write tier and isolation strategy
5-yr storage3.6 TBFits a single MySQL cluster with read replicas
Tickets sold / month10MDrives notification, payment, audit volume
Seats per booking≤ 10Anti-scalping cap; cap shapes lock granularity
Step 5

System APIs

Two endpoints carry the interesting load: search (find what you want to watch) and reserve (claim seats while paying). A third endpoint completes the booking after payment success. Defining the contract early locks down the architecture before the first box is drawn.

REST API surface
// Search — read path, high QPS
GET /api/v1/search
{
  "api_key":     "abc123...",
  "keyword":     "Marvel",          // optional movie name fragment
  "city":        "Bangalore",       // filter by city
  "lat_long":    "12.97,77.59",     // optional, for nearby cinemas
  "radius_km":   10,                // search radius from lat_long
  "datetime":    "2026-05-08T18:00",
  "postal_code": "560001",          // alternative to lat_long
  "sort":        "showtime"         // showtime | rating | distance
}
→ 200 OK  { "results": [{ movie, cinema, show_id, showtime, available_seats }, ...] }

// Reserve — write path, the contention hot spot
POST /api/v1/reserve
{
  "api_key":    "abc123...",
  "session_id": "sess-9d4f...",     // sticky session for checkout
  "movie_id":   "mov-1234",
  "show_id":    "show-7891",
  "seats":      ["J-12", "J-13"]    // 1..10 seats
}
→ 200 OK   { "reservation_id": "res-...", "expires_at": "...+5min", "amount": 480 }
→ 409 Conflict  { "error": "seats_taken", "alternatives": ["J-14","J-15"] }
→ 429 Queued    { "error": "show_full", "queue_position": 47, "etr_seconds": 180 }

// Confirm — called after payment succeeds
POST /api/v1/confirm
{ "reservation_id": "res-...", "payment_token": "tok_..." }
→ 201 Created  { "booking_id": "bk-...", "tickets": [{seat, qr_code}, ...] }
Why three calls instead of one big "buy"? Because reserving a seat (claiming it for 5 minutes) is a fundamentally different operation from paying for it (charging a card). Reserving is fast, frequent, and reversible. Paying is slow (Stripe takes 2-3 seconds), unreliable (cards decline, networks drop), and irreversible. Splitting them lets the contended write — the seat lock — finish in milliseconds while the slow payment runs separately, with the lock auto-expiring if payment fails.
Step 6

Database Schema

The data model is wide — about 10 entities — but every interesting interaction touches the same handful: Show, Show_Seat, Booking, Payment. Three observations: (1) we have strong relational ties (a Show belongs to a Cinema_Hall, a Show_Seat belongs to a Show and a Cinema_Seat), (2) we need multi-row atomic transactions for multi-seat orders, and (3) the UNIQUE constraint on a Show_Seat row is what physically prevents double-booking. All three points push us toward a relational store like MySQL/PostgreSQL.

erDiagram CITY { string city_id PK string name string state string zipcode } CINEMA { string cinema_id PK string name int total_halls string city_id FK } CINEMA_HALL { string hall_id PK string name int total_seats string cinema_id FK } CINEMA_SEAT { string seat_id PK int seat_number string seat_type string hall_id FK } MOVIE { string movie_id PK string title string language string genre int duration_minutes } SHOW { string show_id PK timestamp start_time string movie_id FK string hall_id FK } SHOW_SEAT { string show_seat_id PK string show_id FK string cinema_seat_id FK string status decimal price string booking_id FK } BOOKING { string booking_id PK string user_id FK string show_id FK int num_seats string status timestamp created_at timestamp expires_at } PAYMENT { string payment_id PK string booking_id FK decimal amount string status string provider_ref } USER { string user_id PK string name string email } CITY ||--o{ CINEMA : "has" CINEMA ||--o{ CINEMA_HALL : "contains" CINEMA_HALL ||--o{ CINEMA_SEAT : "has" CINEMA_HALL ||--o{ SHOW : "hosts" MOVIE ||--o{ SHOW : "scheduled as" SHOW ||--o{ SHOW_SEAT : "has" CINEMA_SEAT ||--o{ SHOW_SEAT : "instantiated as" USER ||--o{ BOOKING : "places" SHOW ||--o{ BOOKING : "for" BOOKING ||--o| PAYMENT : "settled by" BOOKING ||--o{ SHOW_SEAT : "claims"

The SHOW_SEAT row is the heart of the ledger. Its status column is a small state machine — FREE, HELD, or BOOKED — and a UNIQUE constraint on (show_id, cinema_seat_id) for status='BOOKED' means the database itself rejects any attempt to commit two bookings for the same seat. This is the final safety net: the actual contention lock lives in Redis (much faster, see §7), and MySQL's job is to durably record truth and refuse any double-INSERT that slips through if Redis ever loses a lock on failover.

Why relational for the ledger: we need (a) durable records of every payment for audit and refunds, (b) a UNIQUE constraint enforced at the storage layer that serves as the final defense against double-bookings, and (c) ACID transactions for "INSERT booking + INSERT show_seat + INSERT payment" as one atomic commit. NoSQL stores either give you per-row atomicity (not enough for the three-table commit) or eventual consistency (catastrophic when "did the user pay?" must be definitive). Our 3.6TB also fits comfortably in a sharded MySQL cluster.
Step 7 · CORE

High-Level Architecture — From Naive to Production

This is the section that wins or loses the interview. We'll build the architecture in three passes: the simplest thing that could plausibly work, why it falls apart at scale, and the production shape where every box justifies itself.

Pass 1 — The naive design (and why it breaks)

Sketch the simplest possible system: a few stateless app servers behind a load balancer talking to one MySQL. To book a seat, the app does INSERT INTO booking .... Done.

flowchart LR C["Client"] --> LB["Load Balancer"] LB --> APP["App Server"] APP --> DB[("MySQL")]

Three failures emerge the moment a real movie release hits this design:

💥 Two clicks, same seat — both succeed

Sarah and Raj both tap seat J-12 at 09:00:00.001. App server A handles Sarah, app server B handles Raj. Both run INSERT INTO booking at the same instant. Without row-level locking and a UNIQUE constraint, both inserts succeed — and now two physical humans hold a ticket for the same chair. The brand is dead.

💥 30-minute holds block everyone

A user clicks "select seats" then walks away to make tea. If the seat sits in HELD state for 30 minutes with no automatic expiry, that's 30 minutes of every other fan seeing it as taken. With 200 seats and 50K interested fans, the show looks "sold out" within seconds even though most "holds" never become bookings.

💥 50K req/s on release morning melts MySQL

A single MySQL handles ~5K simple writes/sec. A flash-sale spike of 50K req/s on the booking path pegs the DB CPU at 100%, p99 latency goes from 5ms to 5 seconds, and connection pools start dropping requests. The whole site stops responding — not just the popular movie.

Pass 2 — The mental model: split the lock from the ledger

Here's the central insight that reshapes the whole design: a seat reservation is two things in two different stores — (a) a fast atomic claim in an in-memory key-value store (Redis) that auto-expires after 5 minutes, and (b) a durable record in a relational ledger (MySQL) whose UNIQUE constraint is the final guarantee that no two humans ever hold the same chair. The lock and the ledger have different jobs; jamming them into one store forces the wrong trade-off on both.

Think of it like a museum cloakroom. When you check your coat, the attendant clips a numbered tag to it and writes your name in the register. The tag is the fast claim — anyone who walks up to grab that coat sees the tag and is instantly turned away. The register is the durable record — it survives the attendant's shift change, the building closing, even the tag falling off. Booking systems need both, and trying to make one piece play both roles is what causes either double-bookings (no register) or unbearable contention on the register (no tag).

📖 Browse Plane

~50K req/s peak. Read-only fan-out — list cities, movies, cinemas, shows, seat-maps. Tolerates stale data (a 2-second-old seat-map is fine). Cacheable to the moon. No locks, no transactions.

⚡ Lock Layer (Redis)

Absorbs 50K-req/s contention. One key per seat: SET seat:show:X NX PX 300000. Single-threaded atomic SETNX picks one winner; everyone else gets nil in ~1ms. The TTL is the 5-minute timer — Redis itself expires abandoned holds, no separate timer service needed.

📒 Ledger (MySQL)

~200 writes/s peak. The durable record of every reservation and booking. A UNIQUE constraint on (show_id, seat_id) for BOOKED rows is the last line of defense — even if Redis loses a lock during failover, the second commit hits the constraint and errors out cleanly. MySQL never sees lock contention; it only sees post-payment writes.

An async plane sits behind these three — WaitingUsersService listening on Redis keyspace notifications for "seat key expired" events to wake the longest-waiting fan, Payment Service isolating Stripe's flaky latency from the lock path, and Notification Service fanning out emails and QR codes via Kafka. The split keeps the lock layer narrow, the ledger uncontended, and the slow external dependencies asynchronous.

The two-layer lock analogy: Redis is the cloakroom tag — instant, atomic, auto-falls-off after 5 minutes. MySQL is the register — slower, durable, with a UNIQUE constraint that quietly catches the edge cases where Redis's tag system fails (a TTL race during slow payment, a Sentinel failover that loses recent writes). You always check the tag first (cheap, fast); you only write to the register when payment succeeds (correctness-critical). That layering is what lets the system absorb 50K req/s without ever putting 50K row-locks on MySQL.

Pass 3 — The production shape

Now the full picture with Redis SETNX as the contention absorber and MySQL as the ledger of truth. Every node is numbered — find its matching card below to see what it does and crucially what would break without it.

flowchart TB CL["① Client — Web · Mobile App"] subgraph EDGE["Edge Tier"] LB["② Load Balancer — sticky sessions"] end subgraph BROWSE["Browse Plane"] SS["③ Search Service"] CDB[("④ Show Catalog DB — Cassandra")] CACHE["⑪ Cache Redis"] end subgraph BOOK["Booking Plane"] BS["⑤ Booking Service"] RL[("⑥ Redis Lock Cluster — SETNX + TTL")] BDB[("⑦ Bookings DB — MySQL Ledger")] end subgraph ASYNC["Async Plane"] WU["⑧ WaitingUsersService"] PS["⑨ Payment Service"] NS["⑩ Notification Service"] SMR["⑫ Seatmap Refresher"] end CL --> LB LB -->|"GET browse"| SS LB -->|"POST reserve / confirm"| BS SS --> CACHE CACHE -.miss.-> CDB BS -->|"SETNX seat:show:S"| RL BS -->|"INSERT booking"| BDB BS --> PS RL -.expired-event.-> WU WU -.notify next.-> NS PS -.success.-> BS BS -.confirm.-> NS BDB -.binlog tail.-> SMR RL -.keyspace tail.-> SMR SMR -.refresh bitmap every 1-2s.-> CACHE style CL fill:#e8743b,stroke:#e8743b,color:#fff style LB fill:#171d27,stroke:#9b72cf,color:#d4dae5 style SS fill:#171d27,stroke:#4a90d9,color:#d4dae5 style CDB fill:#171d27,stroke:#4a90d9,color:#d4dae5 style CACHE fill:#171d27,stroke:#3cbfbf,color:#d4dae5 style BS fill:#171d27,stroke:#e8743b,color:#d4dae5 style RL fill:#171d27,stroke:#e05252,color:#d4dae5 style BDB fill:#171d27,stroke:#38b265,color:#d4dae5 style WU fill:#171d27,stroke:#9b72cf,color:#d4dae5 style PS fill:#171d27,stroke:#d4a838,color:#d4dae5 style NS fill:#171d27,stroke:#d4a838,color:#d4dae5 style SMR fill:#171d27,stroke:#3cbfbf,color:#d4dae5

Component-by-component — what each numbered box does

Use the numbers in the diagram above to find the matching card. Each one answers what is this, why is it here, and what would break without it.

Client (Web / Mobile)

The browser tab or mobile app a user is staring at. It walks the user through the funnel: pick a city, pick a movie, pick a showtime, watch the seat-map render, tap seats, hit "pay", enter card details, see the QR code. From the client's view the entire system is one HTTPS endpoint — but every interesting concern (latency, fairness, idempotency) is shaped by what the client does next.

Solves: nothing on its own — but every architectural decision flows backward from "what does the user see and how does it feel?" Sticky sessions, idempotency keys, and reservation timers are all about not breaking the client's mental model of "I clicked a seat, it's mine."

Load Balancer

The front door. Sits in front of all stateless services, distributes incoming HTTPS, terminates TLS, and yanks unhealthy backends out via 5-second health checks. Crucially we use sticky sessions during checkout — once a user has a reservation in flight, the LB pins them to the same Booking Service node so the in-memory session state (selected seats, payment intent) doesn't have to be re-fetched on every click.

Solves: single-point-of-failure on app servers, plus the "lost cart" problem if requests bounce between nodes mid-checkout. Without the LB, a single pod crash takes down the site. Without sticky sessions, a user who refreshes mid-checkout might land on a fresh node that has no idea they're holding 4 seats.

Search Service

Stateless service that answers all the browse queries: "what cities?", "what's playing in Bangalore tonight?", "what are the showtimes for Marvel at PVR Forum?". It hits the cache first, falls through to the catalog DB on miss. Read-only, so it scales horizontally — add pods until the LB stops complaining.

Solves: isolating the read-heavy browse workload from the lock path. Without a dedicated search service, the same pods running SETNX claims would also be serving 50K browse req/s, and the locks would starve under read pressure.

Show Catalog DB (Cassandra)

The denormalized read-store for the browse plane. Holds movies, cinemas, halls, showtimes, and a snapshot of seat-availability counts per show (not the live lock state). Cassandra is right here because the data is read-heavy, eventually-consistent, and replicates naturally across regions — a fan in Mumbai shouldn't pay a 200ms round-trip to a primary in Delhi just to see what's playing tonight.

Solves: the 50K-req/s browse spike. Without a denormalized read store, every "list movies in city" would join across movie, show, cinema, cinema_hall on the bookings DB — exactly the joins the bookings DB doesn't need to be doing.

Booking Service

Stateless service that handles POST /reserve and POST /confirm. The reserve flow per request: validate input → run a Lua script on the Redis Lock Cluster ⑥ that atomically SETNXs all requested seat keys with a 5-minute TTL → on success, write a HELD reservation row to MySQL Ledger ⑦ for audit → return reservation ID. Total budget: under 20ms (vs ~100ms when the lock was a SERIALIZABLE row-lock on MySQL).

Solves: orchestrating the lock-then-ledger sequence. Without a dedicated booking service, you'd have lock-acquisition logic, Lua scripts, idempotency keys, and rollback paths sprinkled across whatever node happens to handle the request.

Redis Lock Cluster (SETNX + TTL)

The contention absorber. One Redis key per seat: seat:show-7891:S with value = reservation_id, claimed via SET ... NX PX 300000. Atomic — when 50,000 fans tap the same seat in the same millisecond, Redis is single-threaded per shard so exactly one SETNX succeeds, all others get nil back in ~1ms. The TTL is the 5-minute reservation timer — Redis itself expires abandoned holds, no separate timer service needed. On expiry, Redis emits a __keyevent@*__:expired notification that WaitingUsersService ⑧ subscribes to. Sharded by show_id via Redis Cluster so one hot show pegs one shard, not the whole cluster.

Solves: absorbing 50K-req/s contention without touching MySQL. Without this, every double-tap on the popular seat would land as a SERIALIZABLE row-lock on the bookings DB — and MySQL's ~1K writes/s ceiling at SERIALIZABLE would crater under the load.

Bookings DB (MySQL Ledger)

The durable record of every reservation, booking, and payment — and the last line of defense against double-booking. After Redis grants a lock, BS writes a HELD reservation row for audit. After payment succeeds, BS commits a BOOKED row guarded by a UNIQUE constraint on (show_id, cinema_seat_id) for status='BOOKED'. If Redis ever loses a lock (Sentinel failover, TTL race during slow payment), two BS pods might both think they own the seat — but only one INSERT can survive the UNIQUE constraint. The other gets a constraint violation, BS refunds the just-charged card, and apologises. Ugly but correct.

Solves: durability + final correctness guard. Redis is fast but loses state on async failover; MySQL is slower but transactionally durable. The constraint is what makes the whole architecture safe to operate.

WaitingUsersService

Per-show FIFO queue of users waiting for seats to free up after a sellout. Stored as a Redis list (queue:show-X) so it's sharded with the lock cluster. Triggered by Redis keyspace notifications — WU runs PSUBSCRIBE __keyevent@*__:expired and wakes up whenever a seat:* key expires. On wake: LPOP queue:show-X to get the next waiter, call Notification Service ⑩ to push them a deep link, and start their 5-minute claim window. Keyspace notifications are at-most-once — a fallback poller scans every 10s for shows with free seats + a non-empty queue to catch missed events.

Solves: fairness during sellouts. Without an explicit queue, the next user to SETNX after a TTL fires wins — which favors bots polling at 100 Hz, not the patient fan who's been waiting 20 minutes.

Payment Service

Wraps Stripe / Razorpay / native UPI integrations. Booking Service calls it with the reservation ID, the amount, and a payment token from the client. Returns success/failure asynchronously via webhook. Holds an idempotency key so retries don't double-charge a card. On success it calls back into Booking Service, which writes the BOOKED row to MySQL (UNIQUE-guarded) and releases the Redis lock via a Lua "delete-if-owner" script.

Solves: isolating the slow, flaky external dependency from the fast lock-acquisition path. Without a separate service, Stripe's 3-second p99 latency would balloon every reservation request to 3 seconds — and Stripe's occasional outages would directly take down booking.

Notification Service

Async fan-out to email, SMS, and push. Sends the "you reserved 4 seats — pay in the next 5 minutes" warning, the "your seats are available" wake-up to waiting users, and the final "here's your QR code" confirmation. Subscribes to events from Booking Service via Kafka — never inline on the request path, so a flaky SMS provider can't slow checkout.

Solves: communication without coupling. Without an async notification service, every booking would block on "send SMS" — and SMS gateways have multi-second tail latencies.

Cache Redis

A separate Redis cluster from the lock cluster (different SLA, different failure tolerance) holding (a) show metadata that almost never changes (movie title, cinema name, showtime), and (b) seat-map bitmaps — a compact seatmap:show-X key per show with one bit (or 2 bits) per seat showing FREE/HELD/BOOKED, kept fresh by Seatmap Refresher ⑫. Search service hits this first, falls through to Cassandra on miss.

Solves: the 50K browse req/s spike. Without a cache, the seat-map render alone (every fan opens 5 seat-maps before settling on one) would push the catalog DB past its read ceiling. With it, 95%+ of seat-map renders never reach the DB. Keeping it separate from the lock Redis means a stale browse cache can never accidentally affect lock correctness.

Seatmap Refresher

A background worker that keeps the seatmap:show-X bitmaps in Cache Redis ⑪ fresh, without forcing the browse path to touch the real lock-and-ledger stores. It runs two tails in parallel: (a) a MySQL binlog tail on the Bookings DB ⑦ watching every show_seat row transition (HELD→BOOKED, anything→AVAILABLE), and (b) a Redis keyspace tail on the Lock Cluster ⑥ subscribing to __keyevent@*__:set and __keyevent@*__:expired events on seat:* keys. On every event it flips the right bit in the right bitmap and writes it back to Cache Redis. Refresh cadence: ~1–2 seconds end-to-end.

Solves: the "live seat-map for browsers without locking the lock store" problem. Without it, the seat-map render would have to either (a) hit the Lock Redis directly on every browse — flooding the correctness-critical store with 50K req/s of reads, or (b) hit MySQL — defeating the whole point of the cache. The refresher decouples the read shape from the write shape so each can scale on its own. Slight staleness is the price (up to ~2s) but the SETNX in the booking path catches any resulting "looked free, actually taken" race.

Concrete walkthrough — Sarah and Raj race for J-12

Two real flows, mapped to the numbered components above. The first shows concurrency resolution via Redis SETNX; the second shows TTL expiry and fairness via keyspace notifications.

⚔️ Concurrent click — Sarah vs. Raj at 09:00:00.001

  1. Both Clients ① POST /reserve with seat J-12. Both hit the Load Balancer ②, which pins each user to a Booking Service ⑤ pod (sticky session).
  2. Both pods invoke the same Lua script on Redis Lock Cluster ⑥: SET seat:show-7891:J-12 <res-id> NX PX 300000.
  3. Redis is single-threaded per shard — it serializes the two SETs and grants the lock to whichever physically arrived first, say Sarah's by 0.3ms. Sarah's SET returns OK; Raj's returns nil. End-to-end Redis latency: ~1ms.
  4. Sarah's pod writes a HELD reservation row to MySQL Ledger ⑦ for audit (no lock contention — MySQL never sees the 49,999 losers).
  5. Raj's pod sees the nil response and returns 409 Conflict — alternatives: J-11, J-13, J-14.
  6. Sarah's UI shows "5:00 to complete payment". Raj's UI shows "J-12 just got taken — try one of these".

⏰ Sarah doesn't pay — fairness kicks in at 09:05:01

  1. Sarah got distracted. The Redis key seat:show-7891:J-12 hits its 5-minute TTL and Redis auto-deletes it. No service code runs the expiry — Redis handles it internally.
  2. Redis Lock Cluster ⑥ emits a keyspace notification: __keyevent@0__:expired seat:show-7891:J-12.
  3. WaitingUsersService ⑧, subscribed via PSUBSCRIBE, receives the event and runs LPOP queue:show-7891 — that's Priya, who's been waiting since 09:01:30.
  4. WU calls Notification Service ⑩ to push Priya: "your seat is available — complete checkout in 5 min", with a deep link.
  5. Priya taps the link; her client POSTs /reserve to Booking Service ⑤; BS SETNXs the seat in Redis. Fresh 5-minute TTL starts for her.
  6. If Priya also abandons, the cycle repeats — Redis TTL fires, keyspace event, WU pops the next fan. First-come-first-served, enforced.
So what: the architecture is built around three insights — (1) browse and book are different shapes so they get different planes (read replicas vs. write-coordinated path); (2) the lock and the ledger belong in different stores — Redis SETNX absorbs 50K-req/s of contention in 1ms each, MySQL's UNIQUE constraint is the final guarantee that no two humans hold the same chair; (3) fairness during sellouts is a queue problem driven by TTL events, not a refresh-button race. Every box in the diagram earns its place by killing a specific failure mode from Pass 1.
Step 8

Concurrency & Isolation — The Two-Layer Lock

Interviewers probe this section hardest. With Redis as the contention absorber, the question splits into two: (1) what stops two users from claiming the same seat in the same millisecond? Answer: Redis SETNX, single-threaded per shard, atomic. (2) What if Redis ever loses a lock — a Sentinel failover during a slow payment, a TTL race? Answer: a UNIQUE constraint on the MySQL ledger that catches the resulting double-INSERT at commit time. Two layers. Both indispensable.

Layer 1 — Redis SETNX (the fast claim)

A single Redis command does the entire reservation in 1ms:

// Atomic claim: succeeds only if the key doesn't already exist
SET seat:show-7891:J-12  "res-9d4f"  NX  PX  300000"OK"        // you got it, 5-min TTL started(nil)       // someone else owns it, return 409

For multi-seat orders (Sarah wants 4 seats together), one SETNX per seat is wrong — if seats 1–3 succeed and seat 4 fails, you've leaked three locks. Wrap it in a Lua script so the whole claim is all-or-nothing:

-- claim_seats.lua — atomically claim all keys or release what we got
for i = 1, #KEYS do
  if redis.call('SET', KEYS[i], ARGV[1], 'NX', 'PX', ARGV[2]) == false then
    -- one failed → release the ones we got, return failure
    for j = 1, i-1 do redis.call('DEL', KEYS[j]) end
    return 0
  end
end
return 1

Lua scripts run single-threaded in Redis, so the whole multi-seat claim is one atomic step from any concurrent SETNX's point of view. To release a lock (on payment success or user cancel) you must check ownership first — otherwise you might delete someone else's freshly-claimed lock if the TTL fired between your acquire and your release:

-- release_if_owner.lua — only DEL if the value is still our reservation_id
if redis.call('GET', KEYS[1]) == ARGV[1] then
  return redis.call('DEL', KEYS[1])
else
  return 0  -- someone else owns it now, leave it alone
end

Layer 2 — MySQL UNIQUE constraint (the safety net)

Redis is fast but loses lock state on async failover, and the TTL can fire while payment is mid-flight. So the final commit goes through MySQL with a UNIQUE constraint that physically blocks duplicate bookings:

CREATE TABLE show_seat (
  show_id        VARCHAR(32),
  cinema_seat_id VARCHAR(32),
  booking_id     VARCHAR(32),
  status         ENUM('AVAILABLE','HELD','BOOKED'),
  UNIQUE KEY uniq_booked_seat (show_id, cinema_seat_id, status)
                          -- only one BOOKED row per (show, seat)
);

On payment success, Booking Service does INSERT → on duplicate key, fail. If two pods both think they hold seat J-12 (because Redis lost the lock), the first INSERT lands, the second hits the UNIQUE violation, and BS gracefully refunds the second user's charged card. Embarrassing, rare, but never silently wrong.

The TTL race — the bug you must name out loud

Scenario: User X's reservation TTL fires at t=300s while Stripe is still processing their card. Redis deletes the key. User Y, refreshing the seat-map, sees J-12 marked FREE and SETNXs it — succeeds. At t=302s both User X and User Y get payment-success webhooks. Both BS pods try to INSERT a BOOKED row for J-12. The UNIQUE constraint catches the second one — one user gets a ticket, the other gets a refund + apology. The fix without Layer 2 doesn't exist in pure Redis: there's no safe way to "extend the lock atomically with payment" because Redis can't see Stripe's state.
The trade-off you must say out loud: Redis SETNX gives us ~100K lock-ops/s per shard (vs ~1K writes/s for MySQL at SERIALIZABLE) — a 100× contention-absorption win. The cost is we now have two stores to reason about, async-replication concerns on the Redis cluster, and must accept that ~0.01% of bookings will hit the UNIQUE constraint and need refund handling. For ticket booking where the contention pattern is "50K fans burst on one seat", the trade is overwhelmingly worth it. For a system where contention is mild, plain MySQL SERIALIZABLE would be simpler and safer.
Step 9

The Reservation Lifecycle

A single seat moves through a small, strict state machine. With Redis as the lock and MySQL as the ledger, each transition lives in a specific store and is owned by a specific component.

stateDiagram-v2 [*] --> AVAILABLE : show created AVAILABLE --> HELD : SETNX (5-min TTL) HELD --> BOOKED : payment success HELD --> AVAILABLE : TTL expiry / cancel BOOKED --> [*] : show ends AVAILABLE --> [*] : show ends

AVAILABLE → HELD

Owned by Booking Service. Triggered by POST /reserve. The transition is a single Redis Lua-scripted SETNX with PX 300000. On success, BS also writes a HELD reservation row to MySQL for audit (not for locking — the lock lives entirely in Redis).

HELD → BOOKED

Owned by Booking Service, triggered by a payment-success webhook from Payment Service. INSERT into the booking table and INSERT into show_seat with status=BOOKED — the UNIQUE constraint here is what makes the architecture safe. After the MySQL commit, BS releases the Redis lock via the "delete-if-owner" Lua script.

HELD → AVAILABLE (expiry)

Owned by Redis itself. When the 5-minute TTL fires, Redis deletes the key and emits a __keyevent@*__:expired notification. No service code runs the expiry — Redis handles it internally. WaitingUsersService ⑧ subscribes to the notification stream and wakes the next queued fan.

HELD → AVAILABLE (cancel)

User clicks "give up these seats" before the TTL fires. BS runs the release-if-owner Lua script to atomically DEL the Redis key (only if the value still matches the reservation_id, defending against the cancel-vs-expiry race). The HELD audit row in MySQL is marked CANCELED.

Why the state machine matters: with five states and four transitions, every code path that touches a seat goes through exactly one of these arrows — making bugs (and audits) tractable. The most subtle bug is the race between TTL-expiry-on-Redis and explicit-cancel-via-BS. Two safeguards: (a) the release-if-owner Lua script means "cancel after expiry" is a safe no-op (the value already changed), and (b) MySQL's UNIQUE constraint catches any double-BOOKED attempt that somehow slips through.
Step 10

WaitingUsersService — The Fairness Queue

When a show goes "full", the next 100 fans who try to reserve don't see "sorry, sold out" — they see "you're number 47 in the queue, estimated wait 3 minutes". The queue exists because, on average, 5-10% of HELD reservations expire without payment — meaning even sold-out shows have a steady drip of seats coming back, and someone has to be next in line.

sequenceDiagram participant U as User participant BS as Booking Service participant R as Redis Lock Cluster participant WU as WaitingUsers participant NS as Notification Note over U,R: Show is sold out — all 200 seats held in Redis U->>BS: POST /reserve · 2 seats BS->>R: SETNX seat:show:S — all locked? yes R-->>BS: nil BS->>R: LPUSH queue:show · user_id R-->>BS: queue_position=47 BS-->>U: 429 Queued · ETR 3min Note over R: Sarah's TTL fires R->>R: DEL seat:show:S (auto-expiry) R-->>WU: __keyevent@*__:expired (PSUBSCRIBE) WU->>R: LPOP queue:show R-->>WU: user47 WU->>NS: notify user47 · 5-min window opens NS->>U: push: your seats are available! U->>BS: POST /reserve · clicks deep link BS->>R: SETNX seat:show:S R-->>BS: OK BS-->>U: 200 OK · reservation_id

Per-show queue in Redis (LIST)

One queue per show, stored as a Redis list keyed queue:show-X. LPUSH to enqueue, LPOP to dequeue. Lives in the same Redis cluster as the locks (sharded by show_id via hashtag) so all events for one show land on one shard — no cross-node coordination, no race conditions.

Triggered by keyspace notifications

WU runs PSUBSCRIBE __keyevent@*__:expired. When any seat:* key expires, WU pops the head of the matching queue:show-X and notifies that user. The 5-minute claim window for the woken user is itself a fresh SETNX PX 300000.

At-most-once delivery — fallback poller

Redis keyspace notifications are fire-and-forget — a momentary subscriber disconnect drops the event. WU runs a fallback poller every 10 seconds that scans for shows where free seats exist but the queue is non-empty — catches any missed notifications.

Multi-seat priority

If a queued user wants 4 seats but only 1 freed up, we don't dequeue them — we keep them in queue and look for the next user wanting 1 seat. Optional: a separate "would-accept-fewer" queue, but this is a UX choice with trade-offs.

Step 11

Fault Tolerance

The Redis lock cluster is the most failure-sensitive piece — async replication and TTL semantics make it the weakest correctness link. The MySQL UNIQUE constraint exists precisely because Redis failures cannot be fully prevented.

🔁 Redis Lock Cluster — replicated, with a known weak point

Redis Cluster with primary + 2 replicas per shard, replicated asynchronously. On primary failure, Sentinel/Cluster promotes a replica in 5–15 seconds. Known issue: recently-acquired locks that hadn't replicated yet are lost on failover. Two pods could then both think they own the same seat.

Mitigation: the MySQL UNIQUE constraint at commit time catches the resulting double-INSERT. The loser gets a refund + apology, never a duplicate ticket. Expected rate: ~0.01% of bookings during a failover event.

🌬️ WaitingUsersService — accept some loss

The queue itself is in Redis (durable to RDB/AOF snapshots), but the active PSUBSCRIBE to keyspace notifications is in-memory in WU. If WU restarts, in-flight expiry events fired during the restart window are dropped — but the fallback 10-second poller scans for "free seats, non-empty queue" and re-fires the missed wake-ups.

Failure mode: a restart means waiters get notified 5–15 seconds late. Annoying, never wrong.

💾 MySQL Ledger — synchronous replicas + automated failover

The Bookings DB is a primary with two synchronous replicas in different availability zones. Writes commit only after at least one replica acknowledges. Automated failover (e.g., via Orchestrator or Aurora's built-in failover) promotes a replica in 10-30 seconds on primary failure. The UNIQUE constraint survives failover because it's part of the durable schema.

RPO=0, RTO≈30s — no data loss, brief unavailability.

📥 Payment idempotency

Every /confirm call carries an idempotency key. If the network drops mid-call and the client retries, the Payment Service recognizes the same key and returns the already-charged response — never double-charging the card. The Booking Service does the same when receiving the webhook callback, and the UNIQUE constraint catches any duplicate booking row at commit time.

Step 12

Data Partitioning

3.6TB fits on one big MySQL box, but a single box can't survive its own failure and can't absorb a release-day spike on its own. We shard the Bookings DB. Choosing the shard key is the most consequential decision in this section.

❌ Shard by movie_id

Tempting because a query like "show me all bookings for Marvel" stays on one shard. But: when Avengers releases, every booking in the country flows to one shard. That shard's CPU pegs at 100% while the others sit at 5%. Hot-shard hell.

✅ Shard by show_id (with consistent hashing)

A specific show is "Avengers · 21:00 · PVR Forum". There are millions of shows across the catalog and any given show has at most ~500 seats — bounded write volume. Hashing show_id distributes load uniformly even on release day.

Why show_id is the right key — three reinforcing reasons

🌡️ Bounded heat

The hottest possible thing is a single sold-out show — say 500 seats × 50 attempts/seat = 25K writes total. That's ~5 minutes of work for a single shard at 200 req/s. Nothing melts.

🎯 Co-located fairness

The Redis Lock Cluster and its queue:show-X lists are sharded by show_id too (Redis Cluster hashtag {show_id} forces all keys for one show onto one slot). All operations for one show — SETNX, TTL expiry, keyspace-notification dequeue — happen on a single Redis shard. No cross-node coordination, no race conditions.

📈 Cheap rebalancing

Use consistent hashing not hash % N. Adding a shard relocates only 1/N of shows instead of all of them — a few hours of rebalance instead of a multi-day migration with the cluster degraded throughout.

Replication on top of partitioning: each shard is replicated 3× across availability zones with synchronous replication on the primary write. Reads from the booking DB (rare — most reads go through the catalog DB) can hit replicas. The browse plane's Cassandra catalog already handles its own multi-region replication.
Step 13

Cache & Load Balancer

Two infrastructure pieces with disproportionate impact on whether release day feels fast or feels broken. Note that the cache Redis is a separate cluster from the lock Redis — different SLAs, different failure tolerances, their keys never share a namespace.

Two Redis clusters, two jobs

🔒 Lock Redis (⑥)

Holds seat:show-X:S keys with SETNX semantics + TTL, plus queue:show-X lists for waiting users. Correctness-critical — a lost key can cause a double-INSERT attempt (caught by MySQL UNIQUE but still ugly). Replication: primary + 2 replicas, RDB snapshots every 60s, AOF appendfsync everysec.

🗂️ Cache Redis (⑪)

Holds show metadata and seat-map bitmaps. Tolerates stale data — a 2-second-old seat-map is fine, the SETNX on Lock Redis catches any resulting conflict. Replication is relaxed (RDB-only), failures are non-correctness events.

Cache — what to put in Redis, what to leave out

✅ Cache: show metadata

Movie title, cinema name, hall name, showtime, total seats, base price. Effectively immutable for the life of the show. TTL: hours. Hit rate: 99%+. Lives in Cache Redis as JSON blobs keyed by show:<id>.

✅ Cache: seat-map snapshots

A compact bitmap per show showing FREE/HELD/BOOKED for each seat. Refreshed every 1-2 seconds by tailing the bookings DB binlog. Slightly stale is fine — when a fan clicks a stale-FREE seat, the SETNX on Lock Redis catches the conflict and returns 409.

❌ Don't cache: the lock value itself

The active lock on a seat lives only in Lock Redis. Mirroring it into Cache Redis would invite stale-read race conditions — not worth it.

Load Balancer — sticky sessions during checkout

The LB does normal round-robin for browse traffic. But the moment a user starts /reserve, we set a sticky cookie pinning that user to the same Booking Service pod for the rest of their checkout (up to 5 minutes). This isn't strictly required for correctness — Redis and MySQL are the sources of truth — but it lets the Booking Service keep small in-memory state (selected seats, payment intent, idempotency key) instead of refetching on every click.

Algorithm choice: start with round robin, upgrade to least-connections once one pod's CPU consistently runs hotter than others. Health checks every 5 seconds; evict unhealthy pods within 15 seconds. The LB itself is an active-active pair behind a single virtual IP — single LB failure is invisible to clients.
Step 14

Interview Q&A

How do you prevent two users from booking the same seat?
Two-layer lock — Redis for speed, MySQL for safety. (1) SET seat:show-X:S "res-id" NX PX 300000 on the Redis Lock Cluster — single-threaded per shard, atomic, returns nil to losers in ~1ms. (2) UNIQUE constraint on (show_id, cinema_seat_id, status='BOOKED') in MySQL — even if Redis loses a lock on failover, the second INSERT errors out and we refund the second user's card. The losing user gets 409 Conflict with alternative seats; the rare double-INSERT edge case (~0.01% during a Redis failover) is caught at commit time.
Why Redis SETNX instead of MySQL SERIALIZABLE?
Throughput. A single Redis shard handles ~100K SETNX/s; MySQL at SERIALIZABLE handles ~1K writes/s. On a 50K-req/s flash sale, the 49,800 losers bounce off Redis in ~1ms each and never touch MySQL — which is now reserved for the ~200 winners actually committing payments. The cost: two stores to reason about, async-replication concerns on Redis, and we must accept that ~0.01% of bookings trigger the MySQL UNIQUE catch-and-refund flow. For ticket booking where contention is the dominant pattern, the trade is overwhelmingly worth it.
How do you handle a flash sale at 50K req/s?
Three pressure-relief mechanisms. (1) Browse plane absorbs most of it — 95%+ of those 50K are GETs on seat-maps, served by Cache Redis at memory speed. (2) Lock Redis absorbs the booking burst — ~100K SETNX/s per shard, sharded by show_id with consistent hashing. (3) WaitingUsersService throttles writes — once a show is full, new reservers go straight into a Redis LIST queue (no MySQL write, no lock contention). MySQL only sees the ~200 req/s of "actually committing a successful booking" requests.
What if a user clicks reserve but their network drops before payment?
The Redis TTL takes care of it. The seat key was set with PX 300000 (5-minute TTL). After 5 minutes with no /confirm, Redis auto-deletes the key and emits a __keyevent@*__:expired notification. WaitingUsersService picks up the event and dequeues the next waiter. The user who lost their network sees an "expired" message on their next page-load. No double-charge (payment was never started), no stuck seat.
Why split the lock (Redis) from the ledger (MySQL)?
They have different shapes. Locks need to be claimed and released at memory speed under massive contention — that's Redis. The ledger needs to be durable, transactionally consistent, and serve as the regulatory record of every payment — that's MySQL. Jamming both into MySQL means 50K-req/s contention pegs the bookings DB. Jamming both into Redis means losing payment data on a Sentinel failover. Splitting them lets each store specialize, and the MySQL UNIQUE constraint is the bridge that catches the cracks where Redis loses a lock.
How is fairness enforced when 100 users want the same 10 seats?
Per-show Redis LIST queue + keyspace-notification wake-up. The first 10 users to SETNX get the locks. Users 11–100 are LPUSHed to queue:show-X with their position. When any of the 10 Redis locks TTL-expires, the keyspace notification wakes WaitingUsersService, which LPOPs the head of the queue (longest-waiting user) and gives them a 5-minute claim window. If they don't claim, we wake the next in line. First-come-first-served, enforced by a Redis LIST and a TTL, not by who refreshes fastest.
Why MySQL for the ledger instead of Cassandra or DynamoDB?
The UNIQUE constraint. MySQL InnoDB enforces unique indexes at the storage layer — even concurrent INSERTs on the same key are serialized and the second one errors cleanly. Cassandra has no such concept (lightweight transactions are per-partition only and 10× slower). DynamoDB's conditional writes work but are comparatively expensive at this scale. The whole architecture rests on "Redis is fast but loses locks sometimes; MySQL has the final word at commit time" — and MySQL's UNIQUE constraint is what makes that final word safe.
How would you scale this for global launch (multi-region)?
Browse plane goes global, lock plane stays regional. The Cassandra catalog DB and Cache Redis replicate to all regions — a fan in Mumbai gets seat-maps from a Mumbai replica with ~10ms latency. The Lock Redis cluster and Bookings MySQL stay single-region per market (e.g., one cluster for India, one for US) because cross-region SETNX or strict-consistency commits would have intolerable latency (100ms+ round trips). Each market's shows live in that market's stores; users only book in their region. For genuine global shows (concert tours), shard by show_id and pin each show to the region with the most local demand.
The one-line summary the interviewer remembers: "It's a two-layer lock — Redis SETNX absorbs 50K-req/s contention in ~1ms each, MySQL's UNIQUE constraint catches the rare cases where Redis loses a lock — fronted by a Cassandra/Redis browse plane for read traffic and backed by a Redis-LIST + keyspace-notification fairness queue that turns a 'whoever refreshes first wins' free-for-all into a deli-counter ticket system."