Dream11 — Fantasy Sports HLD Design

The point of this page

Every box on the architecture diagram carries a reason. Picking Postgres over MySQL, ALB over NLB, Cassandra over MongoDB — these aren't taste calls, they're forced by the load you've signed up to handle. As you read, watch for the "Why this, not that" blocks — that's where the actual interview-grade design lives.

Framework → DB Families → Tech Cheat Sheet →

Step 1

Clarify Requirements

Imagine Raj in Mumbai opens Dream11 at 6:55 PM. The IPL match starts at 7:30. He needs to build a team of 11 cricketers, pay ₹49 to enter a "Mega Contest" with 12 lakh other people, and once that match starts, watch his rank update live as players hit fours and take wickets. Multiply Raj by 10 million people doing the same thing in the same 35-minute window — that's the system we're designing. Pin the contract before reaching for tech.

✅ Functional Requirements

Browse upcoming matches and discover contests (free, paid, mega, head-to-head)
Build a team of 11 from a match's player pool under salary-cap + role-quota constraints
Join one or more contests for a given match — pay entry fee from in-app wallet
Hard lock all teams at match start time — no edits after the toss
Live score → live points → live leaderboard during the match (sub-5 s lag)
Distribute prize money to winners after the match ends & results are reviewed
Wallet: deposit (UPI/cards), withdraw to bank, transaction history

⚡ Non-Functional Requirements

Concurrency: 10 M concurrent users during IPL final, peak join-rate 500 K/sec in the last 5 min
Latency: team save < 300 ms; leaderboard refresh < 1 s; score → leaderboard < 5 s
Availability: 99.99% on the hot path (join, lock, score, leaderboard) — downtime during a live match is unforgivable
Consistency: wallet & contest-entry must be strongly consistent (money + fairness); leaderboard can be eventually consistent (a 1-second-stale rank is fine)
Fairness: no edits after deadline — this is a regulatory + trust requirement, not a perf one

Out of scope (confirm with interviewer): KYC flow, anti-fraud / collusion detection (separate ML system), tax/TDS computation, push-notification platform internals, chat / social features, news & analytics articles, sport-specific scoring logic (we'll treat scoring as a pluggable strategy — cricket today, football tomorrow).

Step 2

Capacity & Scale Estimates

The numbers below force every tech choice that follows. A 10 M-concurrent burst over 5 minutes isn't a "use a bigger box" problem — it's a "your DB will die if you don't pre-shard" problem. Anchor the design in these numbers and the choices fall out.

Metric	Assumption	Result / Implication
Total registered users	200 M (India)	—
Daily active users on a match day	50 M	—
Peak concurrent users (IPL final)	10 M	Forces stateless app tier + sticky-LB on user_id
Peak team-save rate (last 5 min)	500 K writes/sec	Forces async write path + Kafka buffer in front of DB
Peak contest-join rate	200 K joins/sec	Forces wallet on Aerospike (1 ms p99), not RDBMS
Contests per match	~10 K contests	—
Largest single contest	12 M entries (Mega Contest)	Leaderboard must rank 12 M rows in <1 s — Redis ZSET, not SQL ORDER BY
Total teams across all contests / match	~150 M teams	Score event must fan-out to 150 M score updates within 5 s
Score events / over of cricket	~10 ball events × 22 score deltas = ~200	Per match: ~12 K events; with 5-second SLA → stream processor (Flink)
Wallet operations / sec at peak	200 K debits/sec on contest join	Single Postgres can't sustain this — wallet moves to Aerospike
Storage for historical matches + teams	~5 PB over 5 years	Hot data in Postgres/Cassandra; cold archive in S3

The three numbers that hurt: (1) 500 K team-saves/sec — no single SQL primary survives this, so writes go through Kafka and land in a sharded Cassandra. (2) 200 K wallet debits/sec — Postgres tops out around 20 K writes/sec per primary, so wallet lives in Aerospike. (3) 12 M-entry leaderboards refreshed every 5 s — SQL ORDER BY points DESC LIMIT 100 over 12 M rows is a non-starter; Redis ZSET ranks in O(log N) and serves the top-100 in microseconds.

Step 3

Actors & Use Cases

Four kinds of actor drive the system. The colour each gets here is the colour you'll see them in on the architecture diagram — that visual link is intentional.

flowchart LR U([User / Player]) S([Score Provider
e.g. Sportradar]) A([Admin / Ops]) P([Payment Gateway]) U --> BR[Browse Matches & Contests] U --> BT[Build Team] U --> JC[Join Contest] U --> VL[View Live Leaderboard] U --> WD[Wallet Deposit / Withdraw] S --> SE[Push Ball-by-Ball Events] A --> CM[Create Match & Player Pool] A --> CR[Cancel / Resolve Disputes] A --> RP[Release Prize Money] P --> SC[Settlement Callbacks] style U fill:#e8743b,stroke:#e8743b,color:#fff style S fill:#4a90d9,stroke:#4a90d9,color:#fff style A fill:#9b72cf,stroke:#9b72cf,color:#fff style P fill:#38b265,stroke:#38b265,color:#fff

🏏 Browse

Fetch upcoming matches and contest list. Read-heavy, cacheable for hours — perfect CDN territory.

👕 Build Team

Pick 11 players within ₹100 cr salary cap, 3–5 batsmen, 1 WK, etc. Validated on server (clients lie).

💸 Join Contest

Atomic: debit wallet → insert into contest_entries → write to leaderboard. Idempotency-key protects against retries.

📈 Live Leaderboard

WebSocket subscription per user × contest. Server pushes top-N + your rank every 5 s during a match.

📡 Score Push

Sportradar / official feed POSTs ball events into our ingest API. We compute fantasy points and fan out.

🏆 Prize Distribution

End-of-match job: read final leaderboard → compute payouts → credit winners' wallets in a single ledger transaction.

Step 4

High-Level Architecture

This is where most candidates draw eight boxes and call it a day. We're going to do it properly — three passes. Pass 1: the simplest thing that could work, and where it falls apart. Pass 2: the central architectural idea (a three-plane split). Pass 3: the production shape, with every component justified.

Pass 1 — the naive design (and why it dies at toss time)

Raj opens Dream11 and posts his team. The simplest design: one app server validates the team, debits his wallet, inserts into a contest row, and serves the leaderboard by running SELECT * FROM teams WHERE contest_id = ? ORDER BY points DESC LIMIT 100 on every refresh. One Postgres, one app box.

flowchart LR C([Mobile App]) -- save / join / refresh --> A[App Server] A -- everything --> DB[(Postgres
users, teams, contests,
wallet, scores, leaderboard)] style C fill:#e8743b,stroke:#e8743b,color:#fff style A fill:#4a90d9,stroke:#4a90d9,color:#fff style DB fill:#9b72cf,stroke:#9b72cf,color:#fff

💥 The toss-time cliff

At 7:25 PM, 5 million users mash "Join Contest" in 60 seconds. That's 80 K writes/sec on a single Postgres rated for ~20 K. The DB collapses; users see "payment failed" while their wallet got debited. Trust gone.

💥 The leaderboard query of doom

A Mega Contest has 12 M entries. Every refresh runs ORDER BY points DESC LIMIT 100 over 12 M rows. Even with an index, that's a sort across 100s of MB of memory per query. Multiply by 12 M users hitting refresh every 5 s → 2.4 M sorts/sec. No machine on earth.

💥 The score fan-out cliff

Virat Kohli hits a six. That single event must update points for ~80 M teams that picked him (40%+ ownership). One ball ≠ one DB update. It's tens of millions of dependent writes. Inline = death.

The lesson: Dream11 is not one workload — it's three workloads in a trench coat. Transactional stuff (wallet, contest entry) needs ACID. Streaming stuff (score fan-out) needs Kafka + Flink. Read-heavy ranked stuff (leaderboard) needs an in-memory sorted set. Trying to make one Postgres do all three is the source of every page above.

Pass 2 — three planes, three storage philosophies

The decision that organises everything: split the system into a transaction plane, a stream plane, and a real-time plane. Each has different consistency needs, different traffic patterns, and so they get different storage tech. They communicate through a Kafka backbone — so a slowdown in one plane can't bring down the others.

🏦 Transaction plane

Things involving money and fairness. Wallet debits, contest entries, prize payouts. Strong consistency, ACID, idempotency. Slow is OK; wrong is not.

Postgres for users, contests, contest entries
Aerospike for the wallet (200 K ops/sec, ms-latency)
Double-entry ledger pattern, sum-to-zero invariant

📡 Stream plane

Score events arrive from the feed, get translated into fantasy point deltas, then fan out to every team that picked the affected player. High throughput, eventual consistency, idempotent processing.

Kafka as the event backbone (12 K events / match, but bursty)
Flink jobs for score → points translation & fan-out
Cassandra for the wide team-points table (write-optimised)

⚡ Real-time plane

Live leaderboard + push-to-device. Read-heavy, ranking is the only operation, staleness of a second is acceptable, latency is everything.

Redis ZSET as the canonical leaderboard for each contest
WebSocket gateway holding 100 K connections each
Push diffs (your rank, top-10) every 3–5 seconds

Why this split is the whole game: if a flash-crash drops the leaderboard cache, money is still safe — the wallet ledger in Aerospike + Postgres is the source of truth and continues working. If Kafka backs up briefly, leaderboard is stale by a few seconds but wallet debits keep flowing. Each plane fails independently because they share no hot storage. The contract between them is event topics, not shared tables.

Pass 3 — the production shape

Now expand each plane into its concrete components. Read the diagram twice: once tracing what happens when Raj joins a contest (orange → blue → purple), once tracing what happens when Kohli hits a six (cyan → green → pink → orange).

flowchart LR CL([① Mobile / Web Client]) CDN[② CDN
CloudFront] ALB[③ ALB L7] GW[④ API Gateway
auth, rate-limit] TS[⑤ Team Service] CS[⑥ Contest Service] WL[⑦ Wallet Service] PG[(⑧ Postgres
users, contests)] AS[(⑨ Aerospike
wallet ledger)] KF[⑩ Kafka
events backbone] IG[⑪ Score Ingestion] FL[⑫ Flink
points, fan-out] CA[(⑬ Cassandra
team_points)] RZ[(⑭ Redis ZSET
leaderboards)] NB[⑮ WebSocket Gateway] PUSH[⑯ Push Notifs
FCM / APNs] CL -->|static| CDN CL -->|API| ALB ALB --> GW GW --> TS GW --> CS GW --> WL TS --> KF CS --> PG CS --> WL WL --> AS CS --> KF IG --> KF KF --> FL FL --> CA FL --> RZ RZ --> NB NB ==>|push| CL FL --> PUSH style CL fill:#e8743b,stroke:#e8743b,color:#fff style CDN fill:#d4a838,stroke:#d4a838,color:#000 style ALB fill:#4a90d9,stroke:#4a90d9,color:#fff style GW fill:#4a90d9,stroke:#4a90d9,color:#fff style TS fill:#38b265,stroke:#38b265,color:#fff style CS fill:#38b265,stroke:#38b265,color:#fff style WL fill:#38b265,stroke:#38b265,color:#fff style PG fill:#9b72cf,stroke:#9b72cf,color:#fff style AS fill:#9b72cf,stroke:#9b72cf,color:#fff style KF fill:#3cbfbf,stroke:#3cbfbf,color:#000 style IG fill:#3cbfbf,stroke:#3cbfbf,color:#000 style FL fill:#3cbfbf,stroke:#3cbfbf,color:#000 style CA fill:#ec5d8a,stroke:#ec5d8a,color:#fff style RZ fill:#e05252,stroke:#e05252,color:#fff style NB fill:#d4a838,stroke:#d4a838,color:#000 style PUSH fill:#d4a838,stroke:#d4a838,color:#000

Component-by-component — what each numbered box does

Find the number in the diagram, read the matching card. Each card answers: what is this thing, why is it here, and what would break if we removed it tomorrow?

① Mobile / Web Client

The Android / iOS / web app every user runs. Does enough work to feel snappy: caches the player pool locally, pre-computes preview points, runs the salary-cap and role-quota checks before hitting the server. But it never owns truth — every team save, join, and wallet operation is re-validated server-side because a malicious client can lie.

Solves: latency & offline tolerance. If the client did nothing locally, every keystroke during team build would round-trip to the server — unusable on patchy 4G in a stadium parking lot.

② CDN (CloudFront / Akamai)

Caches static assets & semi-static data: app bundle, player photos, match thumbnails, team-of-the-week posters, and the player pool JSON (which only changes when the captain announces the playing XI ~30 min before toss). Cache TTL is short for the player pool (60 s) and long for images (days).

Why CloudFront over an in-region CDN? India-first user base means we want POPs in Mumbai, Delhi, Chennai, Bengaluru, Hyderabad — CloudFront has all five. Akamai is comparable but pricier; an in-region CDN like Jio CDN doesn't have global edges for NRI users. Solves: origin offload — without a CDN, every "open match page" hits our API for assets that haven't changed in hours.

③ Load Balancer — AWS ALB (L7)

Terminates TLS, routes by path (/team/*, /contest/*, /ws/*), sticky-hashes on user_id so the same user's requests warm the same caches.

Why ALB and not NLB? ALB understands HTTP — it can route by URL path, terminate TLS, do health checks at the HTTP level, integrate with WAF, and stick connections on a cookie or header. NLB is L4 (TCP only) — faster (no packet inspection) but blind to URLs. For the leaderboard WebSocket tier we actually do use NLB (see §6) because WS upgrade is long-lived and L7 inspection adds no value once the upgrade completes. Solves: single ingress for the HTTP API, integrated security (WAF + Shield), and the ability to spray traffic across an auto-scaling fleet.

④ API Gateway

The first stateful thing every request hits after the LB. Does authentication (JWT verify), rate limiting (10 requests/sec/user for /team, 1 req/sec for /wallet/deposit), request validation against an OpenAPI schema, and routes to the right downstream service. Also where we run shadow traffic during canary deploys.

Why a gateway vs hitting services directly? Centralising auth + rate-limit means a new service inherits both for free; if we removed the gateway, every service would re-implement JWT verification and we'd have 8 versions of "is this rate-limited?". Solves: consistency across services, single chokepoint for security policy.

⑤ Team Service

Owns the team-build workflow. Validates salary cap, role quotas (1 WK, 3–5 batsmen, 1–4 all-rounders, 3–5 bowlers), captain/vice-captain selection. Writes team-save events to Kafka rather than directly to a DB — the durable record of "team v3 at 18:54:21" goes onto a Kafka topic, and a downstream consumer materialises it into Cassandra. This is the only way 500 K saves/sec become survivable.

Solves: write absorption during the toss-time spike. The Team Service itself is stateless and elastic; Kafka is the shock absorber between users frantically editing teams and our storage layer.

⑥ Contest Service

Owns the contest catalog and the join workflow. Knows what contests exist, their capacity, their entry fee, their prize structure. The join API is the most critical transaction in the whole system: it must atomically (a) check the contest isn't full, (b) check the user hasn't already joined, (c) debit the wallet, (d) insert the contest entry, (e) seed the leaderboard. We use the saga pattern with idempotency keys — each step is reversible and the whole thing keys on (user_id, contest_id, team_id) so retries are safe.

Solves: "join contest" correctness under retry + concurrent load. Without saga + idempotency, a flaky mobile network double-debits Raj's wallet.

⑦ Wallet Service

Thin service in front of the Aerospike-backed ledger. Two operations: debit(user_id, amount, txn_id) and credit(user_id, amount, txn_id). Both are idempotent — the same txn_id applied twice has the same effect as once. Implemented as a double-entry ledger: every debit on a user account corresponds to a credit on a contest-escrow account; sums must balance to zero across the system.

Solves: the 200 K wallet-ops/sec bottleneck. A vanilla Postgres can't sustain this; Aerospike's hybrid memory architecture (indexes in RAM, data on SSD) does sub-ms reads & writes at this rate.

⑧ Postgres — Users & Contests

The system of record for users, KYC, contests, contest entries, and prize structures. Read-mostly, write-rarely after a contest is created. Replicated primary + 3 read replicas; reads from replicas, writes to primary. Sharded by user_id (consistent hashing across 16 shards) so a user's data colocates.

Why Postgres specifically — read the "Why this, not that" block in §5. Short version: Postgres gives us strong consistency, real ACID, JSONB columns for flexible contest configs, partial indexes, and a vibrant operational ecosystem (logical replication, pg_partman, pg_repack). MySQL would also work; Oracle is overkill for this scale; CockroachDB would be a sensible alternative if we needed multi-region writes.

⑨ Aerospike — Wallet Ledger

An in-memory NoSQL DB optimised for sub-ms read/write at huge throughput. Holds the wallet balances and the ledger of debits/credits, keyed by user_id. Built-in cross-data-center replication (XDR), strong consistency mode (CP), and the ability to write to SSD while keeping indexes in RAM — so durability survives node loss but reads stay fast.

Why Aerospike and not Redis or Postgres? Redis is in-memory but durability is best-effort (RDB snapshots, AOF) — a snapshot lag during a node crash can lose recent debits, which is unacceptable for money. Postgres is durable but caps at ~20 K writes/sec per primary. Aerospike fills the gap: durable like Postgres, fast like Redis, scales linearly. Solves: 200 K wallet ops/sec without losing rupees.

⑩ Kafka — Event Backbone

The central nervous system of the streaming plane. Topics: team-saves (partitioned by user_id), contest-joins (partitioned by contest_id), score-events (partitioned by match_id), fantasy-point-deltas (partitioned by team_id). Retention 7 days — long enough for any consumer to replay and recover.

Why Kafka over SQS or RabbitMQ? SQS is a queue, not a log — once a message is consumed it's gone, so you can't have multiple independent consumers (leaderboard + analytics + audit). Also SQS caps at ~3 K msg/sec per queue (FIFO) — not enough. RabbitMQ would need careful per-queue sharding and lacks Kafka's at-least-once + idempotent-producer guarantees. Kafka handles 1 M+ msg/sec per broker, exposes a replayable log, and lets us add new consumer groups without affecting existing ones. Solves: decoupling + replayability + throughput.

⑪ Score Ingestion

The bridge between the official cricket scoring feed (Sportradar, Opta, or BCCI feed) and our system. Receives ball-by-ball events via webhook, validates the payload signature, deduplicates (same ball can be retransmitted), and writes the raw event onto the score-events Kafka topic. Crucial — keeps the upstream provider isolated from our internal pipeline.

Solves: isolation + idempotency. The feed providers occasionally retry the same ball event 3–4 times; dedup here means downstream Flink jobs don't double-count Kohli's six.

⑫ Flink — Stream Processing

Two Flink jobs power the live experience. Job A: Scoring consumes score-events, applies the cricket scoring rules (6 = 6 fantasy pts, wicket = 25 pts, etc.), emits player-point-deltas. Job B: Fan-Out consumes player-point-deltas, joins against the team-roster table to find every team that picked that player, emits team-point-deltas — which then flow to Cassandra (durable record) and Redis (live leaderboard).

Why Flink over Spark Streaming? Flink is true event-at-a-time (latency ms); Spark Streaming is micro-batched (latency seconds). For a 5-second-end-to-end score SLA, the seconds budget is too tight to spend on batching. Flink also has exactly-once semantics via two-phase commits to Kafka, which we need so a restart never double-credits points. Solves: low-latency stateful stream processing with exactly-once guarantees.

⑬ Cassandra — team_points

A wide-column store for the durable record of "what does every team currently score?" Schema: PRIMARY KEY ((contest_id), team_id) — partitioning by contest_id means all teams in a contest live on the same node, so the post-match leaderboard reconstruction is a single partition scan. We write hard (every point delta), read soft (only at match end + recovery). This is exactly Cassandra's sweet spot.

Why Cassandra and not MongoDB or DynamoDB? MongoDB's write throughput drops with secondary indexes and large collections — and we're inserting tens of millions of updates per match. DynamoDB would work but ties us to AWS and costs more per write at this volume. Cassandra is open source, linearly write-scalable, and we control the partitioning to colocate by contest_id. Solves: durable, append-mostly storage of fantasy-point deltas at fan-out scale (potentially 1 M writes/sec during a wicket).

⑭ Redis ZSET — Per-Contest Leaderboard

One Redis sorted set per contest, keyed lb:{contest_id}, member = team_id, score = total fantasy points. Flink updates it on every point delta with ZINCRBY. Reads use ZREVRANGE 0 99 WITHSCORES for top-100 (O(log N + K)) and ZREVRANK key team_id for a user's own rank (O(log N)). At 12 M members, top-100 returns in ~1 ms.

Why Redis ZSET and not a DB ORDER BY? A SELECT … ORDER BY points DESC LIMIT 100 over 12 M rows is a sort across hundreds of MB — even with an index, it's tens of milliseconds at best and the index itself is constantly being updated. ZSET is a skip-list internally; insert/update is O(log N), and the sorted order is maintained continuously, so reads are free. Cluster mode shards by contest_id across the Redis ring. Solves: ranked-read latency at fan-out scale.

⑮ WebSocket Gateway

A tier of Node.js (or Netty) servers, each holding ~100 K open WebSocket connections. When the user opens the live leaderboard, the client opens a WS to this gateway and subscribes to contest:{contest_id}. Every 3–5 s, a tick job reads top-N + the user's rank from Redis ZSET and pushes the diff over the open socket. We use NLB (L4) in front of this tier, not ALB — see §6 for why.

Solves: push-based real-time updates. Polling every 5 s from 10 M users = 2 M req/sec hitting the API. Push-over-WebSocket collapses that to ~100 K open sockets per gateway node.

⑯ Push Notifications (FCM / APNs)

For users not actively in the app: "Your team is in the top 10!", "Contest about to lock in 10 min", "Match started — Kohli is in your team". Triggered by Flink on milestone events and by a scheduled service for deadline reminders.

Solves: re-engagement. A user who doesn't open the app during the match is a user who isn't retained for the next match.

Walkthrough 1 — Raj builds his team

Setting: It's 7:05 PM. The IPL match — Mumbai vs Bangalore — starts at 8:00 PM. Raj opens the Dream11 app, taps the match card, and starts building his fantasy team. He has a virtual ₹100 salary cap and must pick exactly 11 players from a pool of ~22 (the announced squads) under strict role quotas. This is the user-data plane of the system — high-frequency, bursty writes that absolutely cannot block, because at toss time we'll be doing 500K team saves per second. Here is what fires under the hood.

Step 1 — The match page loads

Raj taps the match card. The app fires GET /v1/matches/{id}/squad. Goes via CDN ② for the player headshots / static metadata, and via ALB ③ → API Gateway ④ → Team Service ⑤ for the dynamic bits: the live player pool, their credit values (Kohli ₹10.5, Bumrah ₹9.5, …), their recent-form stats.

Team Service reads the squad from a hot Redis cache (key squad:{match_id}, TTL 5 min, populated from Postgres at squad-announcement time). Why cache here? Because the squad is read by every user opening the match — 10 M reads at peak — but it changes maybe twice (squad announce, then any last-minute injury swap). A read-through cache absorbs 99.99% of the traffic.

Step 2 — Raj picks his 11 (client-side)

For the next 15 minutes Raj is fiddling. He taps Kohli (₹10.5 cr), then Rohit (₹10.0), then Bumrah, then Cummins… he reaches ₹98 cr and realises he has no wicket-keeper. He drops Cummins (₹9.5), adds Pant (₹10.0), now at ₹98.5 cr — fits the cap. He sets Kohli as captain (2× points) and Bumrah as vice-captain (1.5× points).

All of this is purely client-side. No server round-trip per pick. The app holds the squad + credits in memory and re-runs validation locally on every tap: salary check, role-quota check, captain/VC chosen. Server hits would (a) be wasteful at 200 M users × dozens of taps each, and (b) feel laggy.

Step 3 — Raj taps "Save Team"

Now the team goes to the server. The phone fires:

POST /v1/teams
{
  "match_id":  "mi-vs-rcb-2026-05-29",
  "players":   ["kohli", "rohit", "bumrah", "pant", ...],
  "captain":   "kohli",
  "vice_cap":  "bumrah",
  "client_ts": "2026-05-29T19:20:11Z"
}

→ ALB ③ → API Gateway ④ (JWT verified once, route to /teams/* target group) → Team Service ⑤. Note the client_ts — purely advisory; server time is the authority for the deadline check that's about to happen.

Step 4 — Validation (the gatekeeper)

Team Service runs four checks before it'll accept the save. Any fail returns 400 immediately — no Kafka, no Redis, no downstream noise:

Count — exactly 11 players.
Salary cap — sum of player credits ≤ ₹100. Looks up credits from the cached squad data.
Role quotas — 1 wicket-keeper, 3–5 batsmen, 1–4 all-rounders, 3–5 bowlers, at most 7 from any one team.
Captain / VC — both present, both in the 11, both distinct.

Why re-validate server-side when the client already did? Because the client can be hacked, reverse-engineered, or just buggy. A hostile user could POST a team worth ₹500 cr or with two captains. Server validation is the only authority — the client check is a UX optimisation, not a security control.

Step 5 — The deadline check (fairness, not just UX)

Now the most important check: is the match still pre-toss? Team Service holds match.start_time in a tiny in-process cache (refreshed every 30 s from Postgres). If server_now() ≥ start_time, the response is:

HTTP 423 LOCKED
{"error": "match_started", "message": "Edits are not allowed after the toss."}

Why this matters: if a user could edit their team after the first over, they'd already know who's bowling well, who's looking out of form, who's about to come on to bowl the death overs. That isn't fantasy — that's gambling fraud. The deadline check is the line between game and crime, and we enforce it strictly server-side. (§7 deep-dives the three-layer defense in depth: this check, the consumer-side drop, and the downstream timestamp guard.)

Step 6 — Publish to Kafka (the shock absorber)

Validation passed. Team Service does NOT write to Cassandra directly. It publishes to Kafka ⑩ topic team-saves, partitioned by user_id so a single user's edits stay ordered:

# Topic: team-saves · Partition key: user_id
{
  "user_id":   "raj",
  "team_id":   "team-987",
  "match_id":  "mi-vs-rcb-2026-05-29",
  "version":   7,
  "players":   ["kohli", "rohit", ...],
  "captain":   "kohli",
  "vice_cap":  "bumrah",
  "server_ts": "2026-05-29T19:20:11.243Z"
}

Why Kafka and not Cassandra directly? Because at toss-time we hit 500K saves/sec. Cassandra can sustain ~50K writes/sec per node — we'd need a 10-node cluster running flat out just for team saves, and any GC pause would back-pressure the API. Kafka can ingest 500K/sec on a 5-broker cluster without breaking a sweat — it's a sequential append to a log file. The durable Cassandra write happens asynchronously a few hundred ms later via a consumer.

Why partition by user_id? So a user's own edits stay in order. If Raj saves version 6 then version 7 in quick succession, both land on the same partition and get consumed in order. Two different users' saves can be processed in parallel because they're on different partitions.

Step 7 — Cache the current team in Redis

Right after the Kafka publish, Team Service writes the team to Redis:

SET current_team:raj:mi-vs-rcb-2026-05-29 {team_json} EX 86400

TTL = 24 hours (long enough to cover the match + post-match scoring). Why this cache? Because the moment the match starts, the scoring fan-out (Walkthrough 3) needs to know which players each user has — and reading that from Cassandra for 200 M users every match would be a thundering herd. Redis serves the lookup in < 1 ms. The Cassandra row is the source of truth; the Redis entry is the hot-path read cache.

Step 8 — Return 200 to Raj (optimistic ack)

Team Service returns 200 OK as soon as Kafka has acknowledged the publish — typically under 50 ms from tap to ack. The Cassandra write hasn't happened yet, but we don't wait for it.

Is this safe? Yes, because Kafka is durable — once the broker acks (with acks=all, written to all 3 in-sync replicas), the event won't be lost even if the entire Cassandra consumer pipeline dies. We can rebuild Cassandra from Kafka by replaying the topic. The user's experience is "save was instant"; the durable persistence catches up within a second.

Step 9 — Async: Cassandra Consumer materialises the row

A separate consumer service reads from team-saves and writes to Cassandra ⑬ teams table:

INSERT INTO teams (user_id, match_id, team_id, version, players, captain, vice_cap, server_ts)
VALUES (...);
-- Partition key: (user_id, match_id) · Clustering: version DESC
-- Latest version is always the first row in a partition scan

The consumer also does a second deadline check: if server_ts ≥ match.start_time, it drops the event silently. This catches the rare race where Team Service let an event through right at the deadline boundary due to clock skew (typical: 50–100 ms via NTP). Defense in depth — the same fairness rule enforced twice.

Step 10 — Raj edits, edits, edits

Over the next 50 minutes Raj swaps Pant for KL Rahul, then back to Pant, then changes Kohli's role from captain to vice-captain (and back). Each "Save" repeats steps 3–9. Each save bumps the version field and lands as a new row in Cassandra — we keep the full edit history for audit and dispute resolution.

At 7:59:30 PM Raj makes one last tweak. At 8:00:00 PM sharp the deadline ticks. At 8:00:01 PM Raj realises Bumrah is injured and tries to swap him out → Step 5 returns 423 LOCKED. His team is frozen at version 12, captain Kohli, vice-cap Bumrah. Now the team is locked for the entire match.

The "optimistic ack via Kafka" trick is the single most important pattern in this flow. Synchronous DB writes can't survive 500K/sec; Kafka can. By acking on the Kafka publish and materialising to Cassandra asynchronously, Team Service gets sub-50 ms response times during the toss-time spike while still guaranteeing durability. Redis carries the hot-path read; Cassandra carries the audit trail. Three datastores doing three different jobs — and none of them is on the critical synchronous path.

Walkthrough 2 — Raj joins a contest

Setting: It's 7:24 PM. The IPL match starts at 8:00 PM. Raj has built his fantasy team (Kohli as captain, Bumrah as vice-captain) and taps "Join" on a ₹49 Mega Contest with 12 million entries and a top prize of ₹1 crore. He has exactly ₹49 in his Dream11 wallet from a previous win. This is the transaction plane of the system — money is moving, so every step has to be ACID-safe. Here is what fires under the hood, step by step.

Step 1 — The tap leaves Raj's phone

The tap first goes to the CDN ② for any static asset the join screen needs (icons, contest banner). Static stuff never hits our origin — it is served from an edge POP in Mumbai about 15 ms away. The actual API call — POST /v1/contests/{id}/join — bypasses the CDN and goes straight to the ALB ③. ALB does TLS termination, picks a healthy API Gateway ④ node, and forwards.

Why ALB and not NLB here? This is a normal short-lived HTTP request, not a long-lived WebSocket. ALB gives us per-path routing (/wallet/* vs /contest/* → different target groups) and request-level metrics. NLB would work but throw away that L7 visibility.

Step 2 — Gateway: the bouncer

The API Gateway ④ does four jobs before it lets the request anywhere near the Contest Service:

JWT verification — checks Raj's signed token (issued at login). If expired, return 401 immediately. This protects every downstream service from doing auth work.
Rate limiting — Raj is allowed N joins per minute. Not about cost; about a buggy client retrying in a tight loop and double-charging him. The limit is keyed by user_id in Redis with a sliding window.
Schema validation — body must contain a valid team_id. Reject malformed requests early.
Routes to Contest Service ⑥ (gRPC internally for lower latency).

Step 3 — Idempotency: the "tap twice" problem

Raj's phone has a flaky 4G connection. He taps "Join", sees a spinner, gets impatient, taps again. Without protection, we would double-debit his wallet and try to enter him twice.

Contest Service ⑥ generates an idempotency key: SHA256(user_id + contest_id + team_id). Before doing anything, it checks Redis for this key:

First time → Redis says "not seen". Proceed. Cache the request fingerprint with TTL = 5 minutes.
Replay → Redis says "already processed, here is the result". Return that result without doing the work again.

This is why duplicate retries are safe — and it is the single most important pattern for any payment-adjacent API.

Step 4 — The wallet debit (the money-moving moment)

Now Contest Service calls Wallet Service ⑦: debit(user_id=raj, amount=49, txn_id=abc123). Wallet talks to Aerospike ⑨, which was picked specifically because it does strong-consistency single-record transactions with sub-millisecond latency at the volumes we hit. The operation is atomic on the record's primary node:

BEGIN
  READ balance for raj         → returns 49
  IF balance < 49 THEN FAIL "insufficient"
  WRITE balance = 0
  APPEND ledger {txn_id, type:"debit", amount:49, contest_id, ts}
COMMIT

Why not Postgres for the wallet? Postgres would also work — but at 100K concurrent debits during a contest deadline (8:00 PM, everyone joining at the last minute), Postgres row locks would queue up. Aerospike's CP mode handles ~200K ops/sec per node with single-digit-ms p99.

The ledger entry is critical. The balance is derivable from the ledger (sum of all credits minus debits), so the ledger is the source of truth. The balance field is just a denormalized cache for fast reads. If we ever suspect corruption, we re-derive from the ledger.

Step 5 — The contest entry write (Postgres, the relational anchor)

Wallet debit succeeded. Now Contest Service inserts into Postgres ⑧:

INSERT INTO contest_entries (user_id, contest_id, team_id, paid_amount, txn_id, joined_at)
VALUES ('raj', 'mega-49', 'team-987', 49, 'abc123', now());

A UNIQUE constraint on (user_id, contest_id, team_id) means Raj cannot enter the same contest twice with the same team. If a race condition somehow slipped past the idempotency check (different node, clock skew), the DB rejects the second insert at the storage layer.

On unique-constraint violation (Raj already joined): Contest Service has to undo the wallet debit. It calls Wallet: credit(user_id=raj, amount=49, txn_id=abc123-refund). The refund txn references the original — so the ledger now has a debit + a credit that net to zero. This is the saga pattern: distributed rollback when you cannot use a 2-phase commit across Aerospike and Postgres.

Step 6 — Fire the Kafka event (handoff to the stream plane)

With money debited and entry recorded, Contest Service emits to Kafka ⑩:

# Topic: contest-join · Partition key: contest_id
{
  "user_id":     "raj",
  "contest_id":  "mega-49",
  "team_id":     "team-987",
  "roster":      ["Kohli", "Bumrah", "Rohit", ...],
  "captain":      "Kohli",
  "vice_captain": "Bumrah",
  "joined_at":    "2026-05-28T19:24:11Z"
}

Notice the response to Raj has not gone back yet. But this Kafka publish is async and takes < 5 ms — it is not on the critical path.

Why Kafka here? Three reasons:

Decoupling — Contest Service should not know about leaderboards, push notifications, or analytics. It just announces "Raj joined" and walks away.
Replay — if our leaderboard Redis gets corrupted, we can rewind Kafka to the start of the contest and rebuild from scratch.
Multi-consumer fan-out — Flink reads it for the leaderboard, Push Notifications reads it to send "you're in!" pushes, Analytics reads it for dashboards. One topic, many consumers.

Step 7 — Flink seeds the leaderboard

Flink ⑫ consumes the contest-join event and runs a tiny job: "for every new join, add the team to the contest's sorted set at score 0":

ZADD lb:mega-49 0 team-987

That single Redis command puts Raj's team on the leaderboard at zero points, ranked equal-last alongside the other ~12 M teams. Redis's sorted set (ZSET) is a skip-list internally, so this O(log N) insert is microseconds.

Why not just have Contest Service write to Redis directly and skip Kafka + Flink? Because if Redis is down or slow during the join, you would either fail the join (bad UX — user already paid) or have an inconsistency (paid but not on board). By going through Kafka, the join always succeeds as long as Postgres + Aerospike are alive. The leaderboard seeding happens eventually — typically within 100 ms, but if Redis is degraded, it catches up when Redis recovers.

Step 8 — The HTTP response returns

Contest Service returns success up the chain: → API Gateway → ALB → Raj's phone. The full round trip is typically 250 ms end-to-end (~50 ms network + 150 ms processing + 50 ms response render). Raj sees: "You're in! Match starts in 36 mins."

Trace the planes (the key insight): Steps 4 + 5 (Aerospike + Postgres) are the transaction plane — synchronous, ACID, money-safe. Step 6 (Kafka) is the stream plane — async fire-and-forget. Step 7 (Redis ZSET) is the real-time plane — derived state, eventually consistent. If Kafka were down for 60 s, Raj's join still succeeds — the transaction plane is untouched. His name just shows up on the leaderboard 60 s later when Kafka recovers and Flink catches up. This is the entire point of separating the planes.

Walkthrough 3 — Kohli hits a six

Setting: It is 8:15 PM. Kohli is on strike, ball 12.3 in a death over. He smashes a six over deep midwicket. The stadium roars — and within 5 seconds, every team across India that picked Kohli sees their score jump. This is the most thrilling moment in the system because one event fans out to update ~60 million team scores. If you tried to do that synchronously, you would melt the database. Here is how we do not.

Step 1 — The ball-by-ball feed arrives

The official scoring partner (e.g. Sportradar or the BCCI feed) POSTs to our Score Ingestion ⑪ endpoint:

POST /v1/scores/match-{id}/event
Signature: hmac-sha256=...
{
  "ball":      "12.3",
  "batsman":   "Kohli",
  "bowler":    "Cummins",
  "runs":      6,
  "extras":    0,
  "wicket":    null,
  "timestamp": "2026-05-28T20:15:32Z"
}

The HMAC signature is verified against a shared secret — only the official partner can write to this endpoint. Anything else gets 401.

Step 2 — Deduplication (because feeds re-send)

Cricket feeds famously double-fire. The partner might re-send ball 12.3 if their network blipped, or send a correction after a third-umpire review. Ingestion keeps a 5-minute LRU cache of (match_id, ball_id, version) it has already seen.

New ball → process and remember.
Duplicate of an already-seen ball → drop silently, return 200.
Correction (same ball_id, higher version) → process, and mark the previous one as superseded so downstream Flink can issue a compensating delta.

After dedup, ingestion publishes to Kafka topic score-events, partitioned by match_id — all events for one match go to one partition, so Flink processes them strictly in order.

Step 3 — Flink Job A: scoring rules

Flink ⑫ Job A (Scoring) reads the event. It loads the cricket scoring strategy (a versioned rules config from a control-plane DB) and applies it:

# Kohli scored 6 runs
+6 fantasy points for the runs
+1 boundary bonus (any 4 or 6 gets a bonus)
if this is his 50th run total: +8 milestone bonus
if this is his 100th run total: +16 milestone bonus
Total delta: 7 points

It emits to topic player-point-deltas:

{"match_id":"...", "player":"Kohli", "delta":7, "reason":"six+bonus", "ball":"12.3"}

Why a separate job for scoring? Because the rules change — Dream11 tweaks scoring across formats (T20 vs ODI vs Test). By having a pure "event → point delta" job, the rules live in one place. If we mis-scored a ball, we can replay this job alone, not the whole fan-out.

Step 4 — Flink Job B: the fan-out (the magic)

This is the step that makes the system possible. Flink Job B (Fan-Out) reads the delta and asks: which teams picked Kohli, and at what multiplier?

It holds a broadcast state — a copy of "player → list of teams that picked them" — refreshed every 15 minutes from Cassandra ⑬. At match time, this broadcast is in every Flink subtask's memory. So when the Kohli delta arrives, the lookup is O(1) in-memory; no DB hit at runtime.

# In-memory lookup (broadcast state)
Kohli → [team-101 (C), team-102 (VC), team-987 (C=Raj), ... 60M entries]

For each team, Flink computes the multiplier and emits a per-team delta:

team-101  {captain: Kohli}      → delta = 7 × 2   = 14
team-102  {vice_captain: Kohli} → delta = 7 × 1.5 = 10.5
team-987  {captain: Kohli}      → delta = 7 × 2   = 14   # Raj
team-203  {regular pick}        → delta = 7 × 1   = 7
... 60 M entries

Why broadcast state and not query Cassandra per event? A single Cassandra query is ~5 ms. 60 M queries serially = 83 hours. Even at 10K QPS in parallel = 100 minutes. Broadcast state collapses this to a single in-memory hash lookup per delta — done in microseconds across 200 parallel Flink subtasks.

Flink's parallelism is the trick: Job B runs with parallelism 200. Each subtask owns a slice of teams (sharded by hash(team_id) % 200). For the Kohli event, the broadcast lets each subtask independently find its 300K teams and emit deltas. 60 M deltas ÷ 200 subtasks = 300K per subtask = sub-second.

Step 5 — Sharding the deltas to Kafka

The 60 M output deltas are written to Kafka topic team-point-deltas, partitioned by contest_id. This is important: it means all deltas affecting one contest's leaderboard land on the same partition — preserving ordering for that contest. Two consumers downstream pick this up in parallel.

Step 6 — Durable write to Cassandra

Sink 1: Cassandra ⑬. Each delta is appended:

INSERT INTO team_points (contest_id, team_id, ball_id, delta, total_after, ts)
VALUES (...);

The partition key is contest_id, so all of a contest's deltas live on one node. This is append-only — no updates, no deletes — Cassandra's perfect workload. We write every point delta and never overwrite.

Why durable Cassandra at all if Redis has the leaderboard? Three reasons:

Redis is a cache. If a Redis node dies, we restore the ZSET by replaying the Cassandra rows for that contest (single-partition scan, fast).
Audit trail. Players dispute scores. "I should have got 14 from Kohli's six, but I see 7." We can prove every point with the ledger.
Post-match reconstruction. The final leaderboard is computed by reading all rows for the contest and summing — Redis is just the live view.

Step 7 — Redis ZSET update (the leaderboard moves)

Sink 2: Redis ⑭. For each delta:

ZINCRBY lb:{contest_id} {delta} {team_id}

ZINCRBY atomically adds the delta to the team's score AND re-sorts the ZSET. Both happen in O(log N) — for a 12 M-member ZSET, that is ~24 comparisons. Redis Cluster shards the ZSETs across nodes by contest_id, so even at 60 M total deltas, they distribute across maybe 50 contests × the Redis cluster.

After this sink, the leaderboard is live and correct. Top-100 reads (ZREVRANGE 0 99 WITHSCORES) reflect Kohli's six immediately.

Step 8 — WebSocket Gateway pushes to clients

Meanwhile, Raj's phone has had a WebSocket open since 7:24 PM, subscribed to contest:mega-49. The WebSocket Gateway ⑮ runs a tick loop every 3 seconds:

for every active contest:
  top100 = ZREVRANGE lb:{contest_id} 0 99 WITHSCORES
  for every connected user in this contest:
    my_rank  = ZREVRANK lb:{contest_id} {user.team_id}
    my_score = ZSCORE   lb:{contest_id} {user.team_id}
    send WS frame: {top100, my_rank, my_score}

Why a tick loop and not push-per-delta? Because 60 M deltas in 1 second × 10 M live viewers = trillions of pushes. Instead, we batch: every 3 seconds we send each viewer one frame containing all the changes since the last tick. The user perceives "live"; the system survives.

Step 9 — Raj sees the magic

Raj's phone receives the WS frame at 8:15:35 PM — about 3 seconds after the ball was bowled. His UI animates:

Rank: 89,412 → 38,210 (up 51,202 places)
Score: 247 → 261 (+14, Kohli × 2 captain bonus)
A little confetti burst.

He cheers along with the stadium. The end-to-end latency from "leather hits willow" to "Raj's screen updates" was about 5 seconds — feed latency (1–2 s) + Kafka (~10 ms) + Flink processing (~500 ms) + Redis writes (~100 ms) + tick interval (up to 3 s).

Why fan-out at the stream layer, not at write time: if we tried UPDATE teams SET points = points + 14 WHERE player_includes('Kohli') synchronously on every six, the DB would melt — a 60 M-row write on a single statement, with row locks and index updates everywhere. Score lag would be minutes, not seconds, and one big match would knock the system over. By doing fan-out in Flink with a broadcast join + parallel sinks, we exploit horizontal compute: 60 M deltas distributed across 200 parallel Flink subtasks = 300 K deltas per subtask = sub-second. The DB and Redis just absorb the parallel writes from many sinks. Synchronous transaction plane for money, asynchronous stream plane for fan-out — the single most important architectural decision in the whole design.

Step 5

Database Per Service — Why This DB, Not That

This is the section the user asked for explicitly: "if you use Postgres, tell me why not MySQL or Oracle." Each datastore in the design solves a different problem; here is the reasoning for each pick, with the alternatives compared side-by-side.

5.1 Users, Contests, Entries → Postgres

The transactional, structured, relationship-heavy data — users, KYC, contests, contest entries, prize tiers — lives here. Reads dominate writes (~10:1) and the schema is highly relational (every contest entry FK-joins to a user, a contest, and a team).

Option	What it gives us	What it costs	Verdict
Postgres 16	True MVCC, real ACID, JSONB for flexible contest config, partial indexes, advanced index types (GIN for prize-tier JSON), logical replication, pg_partman for time-partitioning, pg_stat_statements for prod debugging, hot-standby replicas for read scaling.	Single-primary by default — multi-region writes need extensions (BDR) or Citus. Vacuum tuning required at scale.	PICKED
MySQL 8	Faster simple reads via clustered InnoDB primary-key layout. Bigger ops talent pool in India. Group Replication for multi-primary.	Weaker transactional isolation in default config (repeatable-read has phantom-read quirks). JSON column type is slower than Postgres JSONB. No partial indexes. Online schema changes require gh-ost or pt-online-schema-change.	VIABLE
Oracle	Mature, battle-tested, RAC for clustering.	Licensing cost is enormous at this scale (~$47K/core/year). Vendor lock-in. Slower ops cycle. Cloud-native tooling weaker.	SKIP
CockroachDB	Postgres wire protocol + multi-region SERIALIZABLE writes out of the box. Horizontal scaling without sharding logic in app.	Higher write latency (consensus per write). Smaller community. We don't yet need multi-region writes — single region (Mumbai) + DR replica is enough.	FUTURE
MongoDB	Schema flexibility, good aggregation framework.	Multi-document transactions are slow & have caveats. No real foreign keys. We have relational data — forcing it into documents adds work, not removes it.	SKIP

The deciding factor: Postgres's JSONB + partial indexes let us store the prize-tier config as flexible JSON (top 1% gets X, top 10% gets Y) and index into it. MySQL's JSON is a second-class citizen here. We also get logical replication for change-data-capture into our analytics warehouse without an extra Debezium connector. The cost is vacuum tuning — solvable with autovacuum + pg_repack runs during off-peak.

5.2 Wallet Ledger → Aerospike

200 K wallet ops/sec at p99 < 5 ms, durable to disk, strongly consistent. This is the hardest single requirement in the entire system.

Option	What it gives us	What it costs	Verdict
Aerospike	Hybrid memory architecture — indexes in RAM, data on NVMe SSD. Linear scaling, sub-ms ops, strong consistency (CP) mode, XDR (cross-datacenter replication). Battle-tested at this exact use case at PayPal, Snap, Adobe.	Niche skillset compared to Redis/Postgres. Commercial licence above 4 nodes. Operational tooling less polished than Postgres.	PICKED
Redis (with AOF)	Familiar, ms-latency, in-memory.	AOF fsync-every-write is slow; fsync-every-second can lose up to 1 s of writes on crash — unacceptable for money. Cluster mode doesn't support multi-key transactions across shards.	SKIP
DynamoDB	Managed, scales horizontally, transactional API.	Cost at 200 K ops/sec sustained is brutal — ~$60 K/month for the throughput alone. Multi-item transactions are 2× the cost. Vendor lock-in.	EXPENSIVE
Postgres (sharded)	Strong consistency, familiar.	Caps around 20 K writes/sec per primary. To get 200 K we'd need 10+ shards just for the wallet, and cross-shard transactions for the contest-join saga become painful.	SKIP
FoundationDB	True ACID across shards, sub-ms latency.	Schemaless KV — we'd build the ledger layer ourselves. Smaller community in India.	VIABLE

The deciding factor: Aerospike's combination of in-memory speed + on-disk durability + strong consistency is unique. Redis is fast but loses data on crash; Postgres is durable but slow; DynamoDB is both but expensive and AWS-locked. Aerospike is what PayU and PayPal use for their wallet workloads at India scale — and Dream11 publicly talks about using it for wallets too.

5.3 team_points Wide Table → Cassandra

An append-mostly log of point deltas for every team in every contest. Write-heavy (~1 M writes/sec at peak), read-light (only at match end + occasional cache rebuild). Per-contest queries.

Option	What it gives us	What it costs	Verdict
Cassandra	Write-optimised log-structured merge tree (LSM). Linear horizontal scaling. Tunable consistency (we use QUORUM on writes, ONE on reads). Partitioning by `contest_id` means contest-end queries hit a single partition.	Read-modify-write is awkward (no real transactions). Compaction tuning is an ops burden. Eventual consistency across replicas.	PICKED
MongoDB	Document model, secondary indexes, ad-hoc queries.	Write throughput collapses with secondary indexes & large collections. Sharding requires careful shard-key picking. Not built for append-mostly workloads.	SKIP
DynamoDB	Managed, auto-scales, similar wide-column model.	Cost at 1 M writes/sec is >$100 K/month. AWS lock-in.	EXPENSIVE
ScyllaDB	Cassandra-compatible API, C++ implementation, 2–10× higher throughput per node.	Smaller ecosystem. Some Cassandra-specific tooling doesn't work yet.	FUTURE
HBase	Wide-column, Hadoop ecosystem.	Operational complexity (HDFS + Zookeeper + RegionServer); slower than Cassandra for our pattern; weaker tunable consistency.	SKIP

The deciding factor: Cassandra's LSM-tree storage is purpose-built for write-heavy append workloads. Our access pattern — "lots of writes during the match, one big read per contest at match end" — fits its strengths perfectly. The fact that we control partitioning means the post-match leaderboard reconstruction for a 12 M-entry contest is a single-partition scan, not a fan-out.

5.4 Live Leaderboard → Redis (Sorted Sets)

The headline use case — see §9 for the deep dive. The decision here is short because the alternative ("SQL ORDER BY at scale") is simply not viable.

Option	What it gives us	What it costs	Verdict
Redis ZSET (Cluster)	Skip-list backed sorted set, O(log N) insert/update, O(log N + K) range read. ZINCRBY / ZRANGE / ZREVRANK / ZSCORE are all sub-ms. Cluster mode shards by hash slot of the key, so different contests live on different nodes naturally.	In-memory only — losing a node loses that contest's leaderboard until rebuild from Cassandra. Mitigation: 2× replication + periodic snapshot + Cassandra as recovery source.	PICKED
Memcached	Faster GET/SET than Redis.	No sorted-set primitive. We'd have to materialise top-N ourselves on every write — defeats the purpose.	SKIP
Elasticsearch	Sortable, has full leaderboard semantics.	Indexing latency is seconds, not ms. Heavy resource footprint. Built for search, not ranked counters.	SKIP
DragonflyDB	Redis-compatible API, much higher per-node throughput.	Newer; ZSET implementation slightly different in edge cases.	FUTURE

The deciding factor: Redis ZSET is a leaderboard data structure. No translation needed — it's a 1-to-1 fit. Every other choice would mean building a leaderboard on top of a primitive that wasn't designed for it.

Step 6

Load Balancer Choice — ALB vs NLB vs HAProxy

The user asked: "if you use a load balancer, tell me which one — ALB, ELB — and where it fits." Short answer: we use two different load balancers for two different traffic patterns. Here is the breakdown.

LB	Layer	Best for	Where we use it	Verdict
AWS ALB	L7 (HTTP/HTTPS)	Path-based routing, host-based routing, TLS termination, sticky cookies, WAF integration, gRPC support, content-aware health checks.	In front of the REST API tier — Team / Contest / Wallet services. We need path routing (`/team/*` → Team Svc) and HTTP-aware health checks.	PICKED for REST
AWS NLB	L4 (TCP/UDP)	Extreme throughput (millions of req/sec), preserves source IP, static IPs, ultra-low latency, ideal for long-lived TCP / WebSocket connections.	In front of the WebSocket gateway — millions of long-lived connections for live leaderboards. We don't need L7 inspection once WS is established, and NLB scales further with less overhead.	PICKED for WS
Classic ELB	L4/L7 (legacy)	Backwards compatibility.	Deprecated by AWS. Lower throughput than NLB, fewer features than ALB.	SKIP — legacy
HAProxy / Envoy (self-hosted)	L4 + L7	Full control, advanced traffic shaping, on-prem hybrid.	We'd run this if we were not all-in on AWS, or needed Envoy's xDS / service-mesh integration. For our case, the managed ALB+NLB save ops time without losing capability.	VIABLE OFF-CLOUD
Global Accelerator	Anycast routing	Sub-50ms TCP path optimisation globally.	We'd put this in front of both LBs for NRI users to reduce round-trip-time to Mumbai region. Optional add-on, not a replacement.	ADD-ON

Why two LBs and not one ALB everywhere? ALB does support WebSocket — but each WS upgrade keeps a connection open for the duration of the match. ALB charges per LCU (Load Balancer Capacity Unit), and active connections are a primary axis of that cost. At 10 M concurrent WebSockets, ALB costs roughly 3× NLB while offering no extra value after the HTTP upgrade completes. NLB is L4, charges per active flow at a flatter rate, supports millions of concurrent connections per LB, and preserves the source IP — useful for our rate-limiter to see the real client.

Routing rule on the ALB:

# ALB listener rules — evaluated top-down
if path starts /auth/*      → target-group: auth-svc      # priority 100
if path starts /wallet/*    → target-group: wallet-svc    # priority 110
if path starts /contest/*   → target-group: contest-svc   # priority 120
if path starts /team/*      → target-group: team-svc      # priority 130
if path starts /leaderboard/* → target-group: leaderboard-svc # priority 140
default                          → target-group: gateway-svc

Step 7

Team Creation & the Deadline Lock

The hardest fairness requirement: once the match starts, no team can be edited. This is the line between a game and gambling fraud — if a user could edit their team after seeing the first over, they'd already know who's in form. We enforce it strictly in the data path, not as a UI hint.

7.1 The write path

Team save is high-frequency (users tweak their teams dozens of times before locking in) and bursty (peak in the last 5 min before toss). We don't write directly to a DB — that would die. Instead the write path is:

sequenceDiagram actor U as User participant TS as Team Service participant V as Validator participant K as Kafka (team-saves) participant CON as Cassandra Consumer participant CA as Cassandra participant R as Redis (current_team:user_id) U->>TS: POST /team {match_id, player_ids[11], cap, vc} TS->>V: validate(salary, roles, count) alt valid & before deadline V-->>TS: OK TS->>K: publish TeamSaved {user_id, match_id, version, ts} TS->>R: SET current_team:{user_id}:{match_id} (TTL until deadline) TS-->>U: 200 OK (optimistic) K-->>CON: deliver event CON->>CA: INSERT INTO teams (durable record) else after deadline V-->>TS: 423 LOCKED TS-->>U: "Match has started — no edits allowed" end

Notice the trick: the response goes back to the user as soon as Kafka acknowledges the event. The actual write to Cassandra happens asynchronously through the consumer. This is the only way we survive a 500 K writes/sec toss-time spike — Kafka is way faster than Cassandra at ingestion, and the durable record catches up within a second.

7.2 Enforcing the lock — two layers

🛡️ Layer 1: server-side deadline check

Every team save reads match.start_time from a small in-memory cache (refreshed every 30 s from Postgres). If now() ≥ start_time, the validator returns 423 LOCKED and the event is rejected before it touches Kafka.

Why this layer: the client can't be trusted. It can lie about the time, modify the request payload, or replay an old save. Server time is the only authority.

🛡️ Layer 2: idempotent late-arrival filter

If an event somehow lands in Kafka past the deadline (e.g., it was in flight when the deadline ticked), the Cassandra consumer drops it. Cassandra's primary key includes the team version; only versions written before the deadline ever land.

Why two layers: defense in depth. A clock skew between server boxes (typical: 50–100 ms via NTP) could let a save sneak through layer 1 — the consumer's idempotent check catches it.

One more guard: the live leaderboard / score-fanout system reads each team's last-known-version stamped with a timestamp from Cassandra. Any team write with ts ≥ deadline is treated as if it never happened. Triple insurance — server-side check, consumer filter, downstream timestamp guard. Money + fairness don't trust one wall.

Step 8

Score Ingestion & Fan-Out

This is the most interesting pipeline in the system. A single ball event can mean updating points for 60 million teams in 5 seconds. Done naively, it's impossible. Done with Flink + broadcast joins + sharded sinks, it's routine.

flowchart LR P([Sportradar Feed]) IG[Score Ingestion] K1[Kafka
score-events] FA[Flink Job A
Scoring] K2[Kafka
player-point-deltas] FB[Flink Job B
Fan-Out join] K3[Kafka
team-point-deltas] CS1[Sink 1
Cassandra writer] CS2[Sink 2
Redis ZSET updater] CA[(Cassandra
team_points)] RZ[(Redis ZSET
leaderboards)] P -- webhook --> IG IG --> K1 K1 --> FA FA --> K2 K2 --> FB FB --> K3 K3 --> CS1 --> CA K3 --> CS2 --> RZ style P fill:#4a90d9,stroke:#4a90d9,color:#fff style IG fill:#3cbfbf,stroke:#3cbfbf,color:#000 style FA fill:#3cbfbf,stroke:#3cbfbf,color:#000 style FB fill:#3cbfbf,stroke:#3cbfbf,color:#000 style K1 fill:#3cbfbf,stroke:#3cbfbf,color:#000 style K2 fill:#3cbfbf,stroke:#3cbfbf,color:#000 style K3 fill:#3cbfbf,stroke:#3cbfbf,color:#000 style CS1 fill:#38b265,stroke:#38b265,color:#fff style CS2 fill:#38b265,stroke:#38b265,color:#fff style CA fill:#ec5d8a,stroke:#ec5d8a,color:#fff style RZ fill:#e05252,stroke:#e05252,color:#fff

8.1 The broadcast join — the magic trick

Job B (Fan-Out) needs to answer: "given that player X just scored Y points, which teams have player X?" The naive way is to query a DB on every event — but that's 12 K queries/match against 150 M rows. The Flink way is a broadcast state:

Once per 15 minutes, a side job dumps "for each match, for each player, which team_ids picked them" from Cassandra and publishes it onto a Kafka topic player-team-index.
Flink Job B subscribes to player-team-index as a broadcast stream. Every Flink subtask keeps a local copy of the full player→team_ids map in RAM.
Job B's main stream is player-point-deltas. For each delta, it looks up the team_ids in the local broadcast state — O(1), no network call — and emits one delta per team.
Output deltas are keyed by contest_id, partitioned across Kafka, and consumed by the sinks in parallel.

Why broadcast state and not a Cassandra lookup per event? A Cassandra lookup is ~5 ms. 12 K events × 60 M teams average lookup volume = unworkable. The broadcast state pre-loads the data so the lookup is in-process memory. Trade-off: the map is up to 15 min stale, so team edits in the last 15 min are processed off a slightly outdated index — but team edits are locked at toss time anyway, so the index is fresh enough by definition.

8.2 Captain & vice-captain multipliers

Each team picks one captain (2× points) and one vice-captain (1.5× points). The multiplier is applied during fan-out, not at the score event — because the multiplier is per-team, not per-player. Job B's broadcast map contains player_id → [(team_id, role), …] where role ∈ {regular, vc, c}, and emits the appropriate delta per team.

8.3 Idempotency & exactly-once

Flink + Kafka with the transactional producer gives us exactly-once across Kafka topics. But the sinks (Cassandra, Redis) need to be idempotent. Each delta carries a unique (match_id, ball_id, team_id) tuple. Cassandra writes use this as a clustering key — duplicate insert just overwrites. Redis ZINCRBY isn't naturally idempotent, so we wrap it with a Lua script that records the processed (ball_id, team_id) in a small per-contest set; duplicate deltas are no-ops.

Step 9

Live Leaderboard — Redis ZSET Deep Dive

The leaderboard is the user-facing payoff of the whole system. It must rank a 12 M-entry contest in real-time, show the user their own rank, and stream updates every few seconds. Every byte of every choice here is forced by Big-O.

9.1 The data structure

# One ZSET per contest. Sharded across Redis Cluster by contest_id.
ZADD     lb:c_8421  0    t_91123   # seed at join
ZINCRBY  lb:c_8421  7    t_91123   # Kohli six → +7
ZREVRANGE lb:c_8421 0 99 WITHSCORES   # top-100 → ~1ms at 12M members
ZREVRANK lb:c_8421 t_91123             # my rank → O(log N) ~100µs
ZSCORE   lb:c_8421 t_91123             # my score → O(1)

Internally, Redis ZSET is a skip-list keyed by score (with a hash side-index for O(1) ZSCORE). Insert/update is O(log N); the top-K read is O(log N + K). At N = 12 M, log₂(N) ≈ 24 — fast.

9.2 Sharding strategy

Each ZSET lives on one shard (Redis Cluster doesn't split a single key across nodes). So the question is: does a single contest's ZSET fit on one node?

A 12 M-entry ZSET ≈ 12 M × (8-byte member + 8-byte score + skip-list overhead) ≈ 1.5 GB. A 64 GB Redis node holds ~40 of these. We have ~10 K contests/match — across the cluster we need ~250 nodes worth. Sharded by CRC16(contest_id) mod 16384, this distributes naturally across the cluster.

Hot contest hazard: the IPL final Mega Contest has 12 M entries — it's a hot shard. We mitigate by (a) putting the largest 5 contests on their own dedicated shards (manual placement via hash tags), (b) keeping read replicas of those shards for ZRANGE queries.

9.3 Read patterns the user actually sees

🏆 Top-100

Shown when you scroll to top. ZREVRANGE lb:c_X 0 99 WITHSCORES — single round-trip, ~1 ms, served from a read replica during heavy load.

📍 My rank + neighbours

Show me my rank ±5. ZREVRANK then ZREVRANGE rank-5 rank+5. ~200 µs total.

📈 Rank history

Show how my rank moved over the match. Not from Redis — sampled every 30 s into a small time-series in Cassandra. Cheap and recovered on demand.

9.4 The WebSocket tick loop

Each WebSocket Gateway node runs a tick every 3 seconds. For every active subscription (user_id, contest_id) on that node, it reads (top-N + user's rank) from Redis in pipelined batches (1 batch = 100 subscriptions = 200 ZSET ops), computes a diff against the last frame sent, and pushes the diff over the socket. A typical frame is < 200 bytes — sustaining 10 M concurrent WS at 3-s ticks is ~7 MB/s aggregate per gateway node.

Why 3 seconds, not real-time? A six can produce ~60 M Redis updates. Pushing the new top-N to every interested socket on every six would saturate the gateway. 3-second batching trades a tiny visible lag for tractable network usage — users perceive it as "live" because cricket events are bursty (one event every few seconds anyway, not continuous).

Step 10

Contest Engine & Joining

The most failure-mode-heavy operation in the system. A "join" must be atomic across two different storage systems (Aerospike wallet + Postgres entries) under retry-heavy mobile networks. We use the saga pattern with idempotency keys.

10.1 The saga

sequenceDiagram actor U as User participant CS as Contest Service participant ID as Idempotency Store (Redis) participant W as Wallet Service participant AS as Aerospike participant PG as Postgres (contest_entries) participant K as Kafka U->>CS: POST /contest/join {contest_id, team_id, idempotency_key} CS->>ID: GETSET key=ik:user_X result=PENDING NX alt key already had result ID-->>CS: existing result CS-->>U: return cached result (idempotent retry) else fresh request CS->>W: debit(user, fee, txn_id=ik) W->>AS: balance -= fee, write ledger entry AS-->>W: OK W-->>CS: OK CS->>PG: INSERT contest_entries (txn_id=ik UNIQUE) alt INSERT conflict (e.g., contest full) PG-->>CS: 409 CS->>W: credit(user, fee, txn_id=ik+"-refund") CS->>ID: SET ik:user_X = FAILED CS-->>U: 409 Conflict else success PG-->>CS: OK CS->>K: publish ContestJoined (for leaderboard seeding) CS->>ID: SET ik:user_X = OK CS-->>U: 200 OK end end

10.2 Why this design

› Idempotency key comes from the client (UUID generated at tap time). The same key replayed (network retry, app crash mid-request) returns the same result without re-debiting.
› Compensating action — if the Postgres insert fails (contest full, duplicate join), we credit the wallet back with the same txn_id suffixed -refund. Aerospike's idempotency dedups this if it's retried.
› No distributed transaction — we avoid 2PC because the participants (Aerospike + Postgres + Kafka) don't share a transaction manager and the latency cost of XA would be punitive at 200 K joins/sec.
› Capacity check — contest size limit (max_entries) is enforced in Postgres with a CHECK-like constraint via a partial unique index + a row-count counter cached in Redis (precomputed, decremented atomically before INSERT, restored on rollback).

What about cross-contest atomicity? A user can join multiple contests for the same match. Each join is its own saga — they don't need to be atomic with each other. If one fails the wallet only reflects the successful ones, which is exactly right behaviour.

Step 11

Wallet & Payments

The wallet is two things: (a) a high-throughput counter for in-app moves (contest entry, prize credit), (b) a regulated payment system for external moves (deposit via UPI / cards, withdraw to bank). The split between Aerospike (a) and a more careful flow (b) is intentional.

⚡ Internal moves — Aerospike

Contest entry debits, prize credits. 200 K ops/sec at peak. Implemented as a double-entry ledger: every user-account debit has a matching contest-escrow credit. Sums across the system must equal zero (auditable invariant).

Aerospike record per user: {balance, version, last_txn_id}
Atomic update via Aerospike CAS (compare-and-swap on version)
Ledger entries stored in Cassandra for audit (append-only)

🏦 External moves — Saga + Razorpay / UPI

Deposits go through Razorpay (or BillDesk, PayU); withdrawals settle to user's bank via IMPS/NEFT. These flow through the payment-system saga (see our Payment System HLD) — they're slower (seconds, not ms) but require KYC + RBI compliance.

Deposit: user → Razorpay → webhook → wallet credit
Withdraw: KYC check → request to bank → settlement callback → wallet debit
Reconciliation: nightly job matches our ledger against Razorpay's settlement file

Why a separate wallet store at all? Mixing wallet rows into Postgres alongside contest data creates a hot table — every contest join hits both. Separating wallet into Aerospike lets each store scale on its own access pattern, and lets the wallet team operate independently (different on-call, different SLAs).

Step 12

Caching Strategy — Five Tiers

Caching is what turns a system that would melt under load into one that hums. We have five distinct cache layers, each with a different invalidation strategy.

Tier	What it caches	Tech	TTL / Invalidation
1. CDN edge	Player photos, app bundle, match thumbnails	CloudFront	1 day–7 days; purge on rare content change
2. CDN dynamic	Player pool JSON, upcoming matches list	CloudFront w/ short TTL	60 s; or push-purged when admin updates
3. Service local	Player metadata (Caffeine in-JVM)	Caffeine / in-process LRU	5 min; bounded size (10 K entries)
4. Redis hot row	User profile, contest config, current team	Redis (separate cluster from leaderboard)	Write-through; invalidate on update
5. Read replica	Postgres reads (browse contests, team history)	Postgres standby replicas (3×)	Async replication; ~100 ms lag

Cache stampede protection: when a popular contest cache expires, we don't want 10 000 servers all hitting the DB at once. Each service uses request coalescing (singleflight) — only one request goes to the DB; the others wait on the same in-flight future. Combined with a jittered TTL (e.g. 300 s ± 30 s) so different cache keys don't expire at exactly the same moment.

Step 13

Trade-offs & Interview Talking Points

Decision	Alternative	Why this choice
Three-plane split (txn / stream / real-time)	One monolith with one DB	Each plane has different consistency & latency targets — forcing them onto one store either crushes the DB or makes everything strongly consistent at the cost of latency.
Aerospike for wallet	Postgres + heavy sharding	200 K wallet-ops/sec at sub-ms with durability — Postgres can do durability but not the throughput; Redis can do throughput but loses writes on crash.
Redis ZSET for leaderboards	SQL `ORDER BY` over a contest table	SQL sort over 12 M rows is hundreds of ms even with index; ZSET reads top-100 in 1 ms by construction. Trade-off: leaderboard not durable — Cassandra is the source of truth for recovery.
Cassandra for team_points	DynamoDB	1 M writes/sec at DynamoDB's per-write pricing is >$100 K/month. Cassandra at the same rate is operational cost only.
ALB for REST, NLB for WebSocket	One ALB for everything	ALB charges per active LCU which makes 10 M long-lived WS connections ~3× the cost of NLB; NLB does L4-only which is exactly what we need post-upgrade.
Flink (event-at-a-time) over Spark Streaming (micro-batch)	Spark Streaming	End-to-end 5-s SLA can't afford the multi-second micro-batch latency Spark introduces. Flink also has cleaner exactly-once via Kafka 2PC.
Kafka for event backbone	SQS / RabbitMQ / Kinesis	SQS lacks multi-consumer replay; RabbitMQ caps lower per-broker; Kinesis is AWS-locked and shard-rebalance is painful. Kafka's broker throughput + replayable log fits exactly.
Synchronous response on team save (before Cassandra write)	Wait for Cassandra durability	Cassandra at peak ingest is too slow for sub-300ms response. Kafka durability is enough — even if Cassandra is briefly down, the event is safe on Kafka and replays.
15-min refresh for player→teams broadcast state	Real-time DB lookup per event	Real-time lookup is impossible at 12K events × 60M teams. 15-min staleness is fine because team edits are locked at toss anyway.
Push leaderboard updates every 3 s	Push on every score event	A six produces 60M ZSET updates — pushing them all live would saturate the gateway. 3-s tick batches updates and gives users a "live" feel without melting the network.

Step 14

Interview Q&A — The Questions That Get Asked

How do you guarantee a user can't edit their team after the match starts?

Three layers of enforcement. (1) Server-side check in the Team Service against match.start_time on every save; rejection returns 423 LOCKED. (2) The Cassandra consumer drops any event with timestamp ≥ deadline — handles clock skew between app servers. (3) The score-fanout pipeline reads each team's last version at or before the deadline timestamp — so even if a write somehow lands, it's never used for scoring. Defense in depth, because cheating here is a regulatory + trust failure.

A user reports their wallet was debited but they aren't in the contest. What happened & how do you fix?

Almost certainly the saga's compensating refund didn't fire — likely because Aerospike was briefly unreachable when the Postgres insert failed. Solution: (1) The Contest Service writes a SagaState row to Postgres before kicking off each step, so on restart a recovery worker can detect "wallet debited but no entry & no refund" and replay the compensating action. (2) Idempotency keys make the replay safe — same txn_id+"-refund" hits Aerospike's dedup and credits exactly once.

What happens if Redis loses a contest's leaderboard ZSET (node crash)?

Two-layer recovery. (1) Redis Cluster replicates each master to 2 slaves; a slave is promoted in seconds. Most node failures are invisible. (2) If we lose master + slaves simultaneously, we rebuild the ZSET from Cassandra — a single-partition scan of team_points WHERE contest_id = ? returns all team_id+score pairs, which we ZADD back. For a 12 M-entry contest this takes ~30 seconds with parallel chunks. During rebuild, leaderboard reads return "loading" to clients.

Why not use a single Postgres with read replicas for everything?

Throughput math. Wallet alone is 200 K ops/sec. A Postgres primary tops out around 20 K writes/sec. Read replicas don't help with writes. Even with 16 shards we'd hit limits on cross-shard transactions for the join saga. Aerospike is purpose-built for the throughput; Postgres for the relational invariants. Use each where it earns its place.

How do you handle Sportradar sending the same ball event twice?

Score Ingestion maintains a 5-minute LRU of (match_id, ball_id, sequence). Duplicates are dropped before they hit Kafka. Even if a duplicate sneaks through, Flink's exactly-once via Kafka 2PC ensures each event is processed once into the output topics, and the sinks themselves are idempotent (Cassandra primary key dedup, Redis Lua-wrapped ZINCRBY with a "processed" set).

Why ALB for REST but NLB for WebSocket — and why not just use ALB everywhere?

ALB does support WebSockets. The issue is cost & value at our connection scale. ALB charges per LCU and "active connections" is one of the LCU dimensions — 10 M long-lived WS connections drive LCU sky-high. NLB charges per active flow at a flatter rate and is L4 (no per-packet HTTP inspection), so it scales further at lower cost. The L7 features ALB offers (path routing, header inspection) are useless after the WebSocket upgrade completes — the connection is just a TCP stream from then on.

How would you scale this 10× — say 100M concurrent users?

Three moves. (1) Region-shard by user geography — Mumbai region for west India, Bengaluru for south, etc. Each region runs an independent stack with its own DB shards. (2) Move Postgres to CockroachDB or AlloyDB for multi-region writes without app-level sharding logic. (3) Add a second Flink job tier specifically for the top-3 hot contests, with their own Kafka partitions and dedicated Redis shards — isolation prevents one viral contest from starving others.

Walk me through what happens if Kafka has a 60-second outage during a live match.

Score events get backlogged at the producer (Score Ingestion) — Kafka producers buffer up to a configurable limit. Team saves return success to the user but the Cassandra-write consumer is paused. Leaderboard updates pause — users see stale ranks. Money is unaffected — wallet ops go through Aerospike + Postgres directly, not Kafka. When Kafka recovers, consumers catch up in seconds (each Flink subtask processes thousands of events/sec). The hot-path that matters most (wallet, contest joins) keeps working; the experiential part (live leaderboard) gracefully degrades and recovers. This is exactly why the three-plane split exists.

Step 15

Production Checklist

Three-plane split (transaction / stream / real-time) — each with its own storage
Aerospike wallet with idempotency keys + double-entry ledger sum-to-zero check (alerts on drift)
Postgres for users / contests with logical replication into the analytics warehouse
Cassandra for team_points partitioned by contest_id (single-partition recovery for leaderboard rebuild)
Kafka topics partitioned by the right key per topic (user_id for team-saves, contest_id for contest-joins, match_id for score-events)
Flink jobs with exactly-once via Kafka transactional producer + idempotent sinks
Redis ZSET per contest, sharded by contest_id, with replication + Cassandra-backed rebuild path
ALB (L7) for REST API tier with path-based routing; NLB (L4) for WebSocket gateway
API Gateway in front of services for JWT auth + per-endpoint rate limits
CDN (CloudFront) for static assets + short-TTL dynamic data (player pool JSON)
Five-tier caching with stampede protection (singleflight + jittered TTL)
Saga + idempotency for contest join; SagaState table for recovery worker
Triple-layer deadline lock (server check + consumer filter + downstream timestamp guard)
Broadcast state in Flink for player→teams index (15-min refresh) to avoid per-event DB lookups
3-second tick loop on WebSocket Gateway pushing diffs, not full snapshots
Per-region failover path (active-passive across Mumbai & Hyderabad)
Reconciliation jobs nightly: wallet ledger sum-to-zero, contest entries vs leaderboard count, wallet vs Razorpay settlement file

One last test: for every box in your diagram, ask "what would break tomorrow if I removed this?" If you can't answer in a sentence, it doesn't belong. Every component on this page passed that test.

Dream11 — Fantasy Sports at IPL Scale