Payment System (Razorpay / PayU)

Read this with the framework in mind

This deep-dive applies the 4-step HLD interview framework. As you read, map each section to Requirements → Entities → APIs → High-Level Design → Deep Dives, and notice which of the 8 common patterns and key technologies are at play.

Framework → 8 Patterns → Tech Cheat Sheet →

Step 1

What is a Payment System?

Priya is checking out at 14:02:06 IST on a Tuesday. The cart says one pair of running shoes, ₹6,499. She taps "Place order". Within a few seconds, four things must happen and they must happen together: her HDFC credit card gets charged ₹6,499, the merchant's escrow balance goes up by ₹6,499, the inventory drops by one pair, and a confirmation SMS lands on her phone. If any of those four steps fails — and the rest succeed — somebody loses money. If we charge her card but never tell the merchant, Priya paid for shoes that won't ship. If we tell the merchant but the charge fails, the merchant ships shoes we never collected for. If Priya retries because her Jio fibre blinked, we must not charge her twice.

That's what a payment system is: a piece of software whose entire purpose is moving money between three parties — the buyer's bank, the merchant's bank, and the platform's escrow account — without ever creating, destroying, or duplicating a single paisa. Razorpay, PayU, Cashfree, Paytm all solve this same problem at scale; Stripe and Adyen do it globally.

The two questions that drive every design decision below: (1) How do we make a multi-step money transfer atomic across systems we don't control (card networks, banks, our own DB)? (2) How do we guarantee that retries — from network drops, server crashes, double-clicks — never result in a double-charge or a leak?

Why this is harder than a normal CRUD app: in a normal app, a failure means "user sees an error, retries, no harm done". In a payment system, a failure between step 2 (card charged) and step 3 (merchant credited) means real money has left Priya's bank and is sitting in nobody's account. That's not a bug — that's an RBI-visible incident. So the architecture is built to make those middle-state losses literally impossible.

Step 2

Requirements & Goals

Before drawing a single box, pin down what the system must do — and explicitly what it does not. In an interview, asking these questions is half the score.

✅ Functional Requirements

Charge cards — accept a payment from a customer's card / wallet / bank
Refunds — partial or full reversal of a previous charge
Split payments — one charge, money goes to multiple recipients (marketplace use case)
Recurring subscriptions — auto-charge on a schedule
Payouts to merchants — move escrow money to merchants' bank accounts on a schedule
Multi-currency — accept INR, USD, EUR, etc.; FX conversion when needed (RBI's LRS / FEMA rules apply for cross-border)

⚙️ Non-Functional Requirements

ACID-strict — no money ever lost or duplicated, period
Idempotent — every mutation safe to retry, same result every time
Highly available — payment downtime is revenue lost forever
Sub-second response — Priya cannot wait 5 seconds at the checkout page (plus OTP time for > ₹5,000 txns)
Audit-ready — every paisa traceable to a state change with a timestamp (RBI inspectors can demand the trail on 24-hr notice)
PCI-DSS & RBI compliant — card data handled per industry regulations, payment data localised in India per RBI's 2018 directive

🚫 Out of Scope

Card-network internals — we integrate with Razorpay / Visa / Mastercard / RuPay, we do not be them
UPI / NPCI rails — covered in a separate UPI HLD; this page focuses on the card-payment flow
Fraud-model training pipeline — we consume the model, not build it

The non-functional requirements are the hard part. Charging a card via Razorpay's API is a 5-line problem. Making the charge survive a network drop mid-flight, never double-charge on retry, balance to the paisa against Razorpay's daily settlement file, and stay PCI-DSS + RBI-compliant — that's the architecture.

Step 3

Capacity Estimation & Constraints

Numbers are not optional in HLD. They drive sharding, ledger sizing, and how big our database needs to be. Let's pick a Razorpay-like mid-size scale for a card-payments platform serving Indian merchants.

Traffic estimates

Assume 1 million transactions per day, average ticket size ₹4,000, peak load is Flipkart's Big Billion Days / Amazon's Great Indian Festival at roughly 40× the daily average concentrated into a few hours.

Avg TPS

~12 TPS

1M / 86400

Peak TPS

~500 TPS

Big Billion Days spike

GMV / day

~₹400 Cr

1M × ₹4,000 avg

API latency

p99 < 1s

End-to-end (excl. OTP)

Storage estimate (5-year regulatory retention)

Each transaction record (transaction + payment + ledger entries + audit log) is roughly 1 KB across the relevant tables. 1M × 1KB = 1 GB/day. Over 5 years (the typical regulatory retention window for financial records): ~2 TB total. With indices, replicas, and audit logs: ~6 TB provisioned.

Ledger entry volume

Every transaction writes at least 2 ledger entries (debit + credit), often more for split payments and FX. Realistic average is 3 entries per transaction. 1M × 3 = 3M ledger rows/day = ~1B rows/year. This is the dominant table by row count and drives the partitioning strategy in §14.

Metric	Value	Why it matters
Avg TPS	`12/s`	Drives base sizing — small box can do this
Peak TPS	`500/s`	Drives autoscale ceilings & rate-limit budgets
GMV / day	`₹400 Cr`	Defines the cost-of-error — every minute of downtime is ~₹28 lakh of failed checkouts
5-yr storage	`2-6 TB`	Forces partitioning of the ledger by account_id
Ledger rows / yr	`~1 B`	Append-only — no UPDATE, ever; only INSERT

Step 4

System APIs

Four mutating endpoints carry the bulk of the value, plus webhooks for the async events the merchant cares about. Note the Idempotency-Key header is mandatory on every mutation — this is the contract that lets clients retry safely.

REST API surface

// Create a payment — mutation, requires Idempotency-Key
POST /v1/payments
Headers: { "Idempotency-Key": "1f3b9c2a-..." }
{
  "amount":            649900,         // in paise (₹6,499)
  "currency":          "INR",
  "payment_method_id": "token_abc123", // Razorpay/RBI-mandated token, no PAN
  "customer_id":       "cus_priya42",
  "merchant_id":       "mer_myntra99",
  "description":       "Order #5512 — running shoes"
}
→ 201 Created  { "id": "pay_...", "status": "succeeded", ... }

// Refund — mutation, requires Idempotency-Key
POST /v1/refunds
Headers: { "Idempotency-Key": "..." }
{ "payment_id": "pay_...", "amount": 649900, "reason": "requested_by_customer" }
→ 201 Created

// Read a payment — safe, no key needed
GET /v1/payments/:id
→ 200 OK { ... }

// Payout to a merchant's bank — IMPS/NEFT/RTGS, requires Idempotency-Key
POST /v1/payouts
Headers: { "Idempotency-Key": "..." }
{ "merchant_id": "mer_myntra99", "amount": 5000000, "currency": "INR",
  "mode": "IMPS",      // IMPS < ₹5L instant, NEFT batched, RTGS > ₹2L same-day
  "vpa":  "myntra@hdfc" // or beneficiary ifsc + account_no }
→ 201 Created

// Webhooks — async events fired to merchant's HTTPS endpoint
POST → merchant.example.com/webhook
{ "type": "payment.succeeded", "data": { ... } }
   // Retried with exponential backoff for up to 3 days on non-2xx

Why Idempotency-Key is mandatory on every mutation: Priya's phone might lose signal right after sending the request — Jio's 4G in her area is patchy. Her browser auto-retries. Without an idempotency key, the second request is indistinguishable from a fresh charge, and Priya pays ₹12,998 for one pair of shoes. With the key, the second request is recognized as a replay, the original result is returned, and she pays ₹6,499 once. The key is a UUID generated by the client and included in the request header — analogous to a receipt number on a paper invoice.

Tokens, not PANs: the API never accepts raw card numbers (the PAN — Primary Account Number, the 16-digit digits on the front of the card). RBI's 2022 tokenization mandate makes this not just best practice but legally required — Indian merchants are forbidden from storing PANs at all. The client SDK posts the card directly to Razorpay's hosted iframe and gets back an opaque token_... reference. Our backend only ever sees the token. This is the single biggest PCI-DSS scope reducer (more in §12).

Step 5

Database Schema

A payment system has four core tables, and the relationships matter enormously. Account represents any party that holds a balance — customer, merchant, the platform's own escrow, the platform's revenue. Transaction is a single business event ("Priya pays ₹6,499 to Myntra"). Payment attaches the payment-method specifics. LedgerEntry is the append-only double-entry record — the source of truth for every paisa. We will use Postgres with strict serializability here; the trade-offs against NoSQL are covered in §15.

erDiagram ACCOUNT { string id PK bigint balance_paise string currency string customer_id FK string type } TRANSACTION { string id PK string idempotency_key UK string source_account_id FK string dest_account_id FK bigint amount_paise string status timestamp created_at } PAYMENT { string id PK string transaction_id FK string payment_method_id string status string error_code } LEDGER_ENTRY { string id PK string account_id FK string transaction_id FK bigint amount_paise string entry_type timestamp created_at } ACCOUNT ||--o{ TRANSACTION : "source" ACCOUNT ||--o{ TRANSACTION : "dest" TRANSACTION ||--|| PAYMENT : "has" TRANSACTION ||--o{ LEDGER_ENTRY : "produces" ACCOUNT ||--o{ LEDGER_ENTRY : "owns"

Why each table looks the way it does

🔑 `idempotency_key UNIQUE` on Transaction

The single most important constraint in the schema. Even if the Redis idempotency cache is down or evicted, the database will reject a duplicate insertion with a primary-key violation. This is your last line of defense against double-charges, and it is enforced at the storage layer not the application — meaning even a buggy application server cannot bypass it.

📒 `LedgerEntry` is append-only

No UPDATE, no DELETE. Ever. Reversing a transaction does not delete the original entries — it adds new entries that compensate. This gives us a perfect audit trail: the entire history of every account is reconstructable by replaying the ledger from the first day. Regulators love this; engineers learn to love it after their first incident.

💰 `balance_paise` as a derived snapshot

The balance on Account is not the source of truth — it's a cached snapshot derived from SUM(amount_paise) FROM ledger_entry WHERE account_id = ?. The ledger is truth; the balance is convenience. This is what makes the system reconcilable.

💳 Payment ≠ Transaction

A transaction is a business event ("₹6,499 from Priya to Myntra"); a payment is the mechanism ("HDFC Visa card ending 4242, auth code XYZ, captured at 14:02:06 IST"). One transaction has exactly one payment, but separating them lets us swap the payment mechanism (card → UPI → wallet → netbanking) without changing the business semantics — important in India where ~70% of online checkouts happen on UPI, not card.

The audit invariant — write it on the wall: for every transaction, the sum of its ledger entries must equal zero (debits and credits balance). Globally, SUM(amount_paise) FROM ledger_entry across all accounts must equal zero too. If that sum ever diverges from zero, money has been created or destroyed in our system — that is a P0 incident that wakes up the on-call engineer.

Step 6 · CORE

High-Level Architecture — From Naive to Production

This is the section that wins or loses the interview. We'll build the architecture in three passes: the simplest thing that could plausibly work, why it shatters the moment real money flows through it, and the production shape where every box exists to plug a specific failure mode.

Pass 1 — The naive design (and why it breaks)

One app server. It receives the checkout POST, charges the card via Razorpay's API, then in the same request handler updates the merchant's balance in the DB, then sends a confirmation SMS/email, then returns 200 to the browser. Three calls in series, one happy path.

Four catastrophic failure modes show up the moment this hits production:

💥 Network drop after step 1, before step 2

Razorpay charged Priya's HDFC card for ₹6,499. Our app server crashed before writing the merchant balance. Priya's HDFC statement shows the charge. Our DB shows nothing. The merchant has no idea Priya paid and never ships shoes. Priya is angry, Myntra is angry, and nobody can find the ₹6,499 — it's sitting in our Razorpay nodal account with no internal record. Money has effectively leaked out of the system.

💥 Priya retries — double charge

Browser timed out after step 1, Priya hits "Place order" again. Our app receives a fresh request that looks identical. We run the whole flow again — Razorpay charges Priya twice. Now ₹12,998 has left her account for one pair of shoes. She files a chargeback, the bank yanks back ₹6,499 plus a ~₹500 dispute fee, and we still owe Myntra for the shoes they shipped.

💥 Hot-row contention on merchant balance

Big Billion Days: Myntra receives 500 orders per second. Every order does UPDATE merchant SET balance = balance + amount WHERE id = 'myntra'. Postgres serializes these updates on the same row. Lock queue grows, p99 latency climbs from 50ms to 5 seconds, and Priya sees a spinner on her checkout page just when she's most price-sensitive — she abandons and goes to Amazon.

💥 No audit trail, no recovery path

A single UPDATE balance overwrites history. We can never answer "what was Priya's balance at 14:02:05?" or "did this charge actually happen?" RBI inspectors want a ledger of every state transition. Without one, we cannot pass an inspection and we cannot debug incidents — we can only guess.

Pass 2 — The mental model: Idempotency + Saga + Double-Entry Ledger

The production design is built on three ideas. Each one solves exactly one of the failure modes above. Get these three right and the architecture writes itself.

🎟️ Idempotency Key

Every mutation request carries a UUID generated by the client — a receipt number. The server keeps a record of every key it has seen. If the same key shows up twice, the server returns the original result instead of executing the operation again. Same as a hotel handing back the same room key when you re-show the same booking confirmation — they don't re-book the room.

Solves: double-charge on retry. Priya can mash the Place order button 100 times — the first request runs, the next 99 return the same result, no extra money moves.

🔄 Saga Pattern

A payment is not one operation — it is a workflow: authorize, then capture, then ledger update, then notify. Each step has a paired compensating action (auth → reverse-auth, capture → refund, ledger → reversing entry). An orchestrator drives the workflow, retrying transient failures, and on permanent failure runs the compensating actions in reverse to undo the partial work. Like a recipe with explicit "if you've already cracked the eggs but ran out of flour, throw the eggs out" instructions.

Solves: partial-failure leaks (Pass 1 problem #1). No state where the card was charged but the merchant wasn't credited.

📒 Double-Entry Ledger

Every transaction writes two entries that sum to zero — a debit on one account and a credit on another. Money is never created or destroyed; only moved. The system's invariant is mathematical: SUM(all ledger entries) = 0. Borrowed directly from 700-year-old accounting practice, because accountants solved this problem long before computers existed. Reconciliation becomes a single SQL query.

Solves: hot-row contention (Pass 1 problem #3) and audit gaps (Pass 1 problem #4). Append-only, no UPDATE locks; balances computed on demand from append history.

Crucially, these three ideas compose. Idempotency makes individual saga steps safe to retry. The ledger gives the saga's compensating actions a clean way to record reversals. The saga orchestrator commits ledger writes in transactions to guarantee step atomicity. Take any one out and the other two break — that is why all three appear in every serious payment system on the planet.

Pass 3 — The production shape

Now the full picture. Every node is numbered ①–⑬ — find its matching card below for what it does and what would break without it. The architecture is split into four planes by responsibility: Ingest (accept the request), Orchestration (drive the workflow), Ledger (record the truth), and Risk (decide whether to allow the transaction).

flowchart TB CL["① Client SDK — web / mobile checkout"] subgraph INGEST["Ingest Plane"] GW["② API Gateway"] API["③ Payment API Server"] IDC[("④ Idempotency Cache — Redis")] end subgraph ORCH["Orchestration Plane"] ORC["⑤ Payment Orchestrator — Temporal"] GA["⑥ Payment Gateway Adapter"] end subgraph LEDGER["Ledger Plane"] LS["⑦ Ledger Service"] LDB[("⑧ Ledger DB — Postgres serializable")] SNAP[("⑨ Account Snapshot Cache")] end subgraph RISK["Risk Plane"] FRAUD["⑩ Fraud Detection"] NOTIF["⑪ Notification Service"] REC["⑫ Reconciliation Service"] AUD[("⑬ Audit Log — S3 append-only")] end CL --> GW GW --> API API --> IDC API --> ORC ORC --> FRAUD ORC --> GA GA --> STRIPE["Razorpay / Visa / RuPay"] ORC --> LS LS --> LDB LS --> SNAP ORC --> NOTIF ORC -.events.-> AUD REC -.daily.-> LDB REC -.daily.-> STRIPE style CL fill:#e8743b,stroke:#e8743b,color:#fff style GW fill:#171d27,stroke:#9b72cf,color:#d4dae5 style API fill:#171d27,stroke:#e8743b,color:#d4dae5 style IDC fill:#171d27,stroke:#3cbfbf,color:#d4dae5 style ORC fill:#171d27,stroke:#4a90d9,color:#d4dae5 style GA fill:#171d27,stroke:#4a90d9,color:#d4dae5 style LS fill:#171d27,stroke:#38b265,color:#d4dae5 style LDB fill:#171d27,stroke:#38b265,color:#d4dae5 style SNAP fill:#171d27,stroke:#3cbfbf,color:#d4dae5 style FRAUD fill:#171d27,stroke:#e05252,color:#d4dae5 style NOTIF fill:#171d27,stroke:#d4a838,color:#d4dae5 style REC fill:#171d27,stroke:#d4a838,color:#d4dae5 style AUD fill:#171d27,stroke:#9b72cf,color:#d4dae5 style STRIPE fill:#0f1520,stroke:#7b8599,color:#d4dae5

Component-by-component — what each numbered box does

Use the numbers in the diagram to find the matching card below. Each one answers what is it, why is it here, and what would break without it.

① Client SDK

The Razorpay Checkout SDK (or PayU's BOLT SDK) loaded in Priya's browser or mobile app. It does two critical jobs: (1) it tokenizes her card by posting the raw PAN directly to Razorpay's hosted iframe, never to our servers — getting back a token like token_abc123 that our backend can use without ever touching real card data (per RBI's 2022 tokenisation mandate). (2) It generates a fresh UUID for the Idempotency-Key header, locking in the receipt number before the user even clicks "Place order".

Solves: two huge problems at once — keeps PCI scope off our servers and complies with RBI's no-card-storage rule (§12), and gives us the idempotency primitive we depend on for safe retries.

② API Gateway

The first thing inbound traffic hits. Terminates TLS, enforces rate limits per API key (e.g., a customer can't fire 1000 charges/sec from the same key), validates auth tokens, and forwards clean requests to the payment API server. AWS API Gateway, Kong, or Envoy all fit.

Solves: a misbehaving or malicious client trying to brute-force a fraud attempt. Without the gateway, every bad actor's request reaches our application logic — wasting CPU and risking DB exhaustion. With it, 99% of abuse is rejected at the edge.

③ Payment API Server

Stateless service. Validates the request body, looks up the customer and merchant, checks the idempotency key in Redis ④, and if it's a new key kicks off a workflow on the Orchestrator ⑤. Returns a synchronous response within ~200ms — even though the underlying work may continue asynchronously, we tell the client whether the payment succeeded synchronously by waiting on the orchestrator's first decisive result.

Solves: isolating the synchronous request/response contract from the multi-step workflow. The API server's job is "answer the client cleanly"; the saga's job is "finish the work durably". Splitting them lets each scale on its own dimension.

④ Idempotency Cache (Redis)

Before starting any workflow, the API server runs SETNX idempotency:<key> lock with a 24-hour TTL. First time the key is seen, it gets the lock and proceeds. Second time, Redis says "key already exists" and the server returns the original cached response. The cache is the fast path; the database UNIQUE constraint on Transaction.idempotency_key is the slow but bulletproof backup.

Solves: double-charge on retry. Without this, every retry is a fresh charge. With it, retries are safe — you can build clients that retry aggressively without ever fearing a duplicate.

⑤ Payment Orchestrator (Temporal)

The brain. A Temporal workflow that codifies the payment saga as a sequence of steps: fraud-check → authorize → capture → write-ledger → notify-merchant. Temporal persists the workflow state at every step boundary, automatically retries transient failures, and runs compensating actions in reverse if a step permanently fails. If the orchestrator pod dies mid-workflow, another picks up exactly where it left off — no work is lost, no work is duplicated.

Solves: the partial-failure leak from Pass 1. Without an orchestrator, your app server crashes between "card charged" and "ledger updated" leave money in limbo. With Temporal, the workflow resumes after the crash and either finishes the work or undoes it.

⑥ Payment Gateway Adapter

A thin wrapper around Razorpay, PayU, Cashfree, or whichever upstream processor we use. It exposes a uniform internal API (authorize, capture, refund) so the orchestrator does not need to know which gateway it's talking to. This is also where we do gateway-specific retries with exponential backoff and circuit-breaker logic — if Razorpay is timing out (their last big outage was Jan 2023, 4 hours), we fail fast instead of pile-driving more requests.

Solves: vendor lock-in and gateway outages. Without an adapter layer, Razorpay-specific code is sprinkled through the codebase — making "fail over to PayU on Razorpay outage" a multi-month project. With it, we change one adapter implementation. This matters more in India than abroad because no single PA holds > 50% market share — most large merchants run 2–3 PAs in parallel.

⑦ Ledger Service

The most carefully-guarded service in the system. Its only job is to accept a transaction and write the corresponding double-entry ledger rows in a strict-serializable Postgres transaction. Validates the audit invariant (sum of entries = 0) before committing. Refuses to write anything that violates double-entry. Single-writer per partition for max consistency.

Solves: the source of truth for all money. Without a dedicated ledger service, ledger writes happen scattered across business-logic code paths — meaning a single bug in any caller can violate the audit invariant. With a dedicated service, every write goes through one validated path.

⑧ Ledger DB (Postgres serializable)

Postgres in SERIALIZABLE isolation level, with synchronous replication to a hot standby in another availability zone. Append-only LedgerEntry table partitioned by account_id (§14). Every transaction is a single Postgres TX that writes 2+ rows atomically — either all entries land or none do, and the database guarantees this even under crash.

Solves: ACID for money. A NoSQL store with eventual consistency would let us briefly observe a state where the debit landed but the credit hadn't — and a balance read in that window would lie. Postgres serializable says: no, you cannot ever observe such a state.

⑨ Account Snapshot Cache

Computing SUM(amount_paise) FROM ledger_entry WHERE account_id = ? is slow once an account has a million entries. We materialize a periodic snapshot — every N minutes a job rolls up the ledger into a per-account balance row, and reads hit the snapshot first. If a snapshot is stale, we read the snapshot plus the small delta of entries since the snapshot timestamp. Truth is still the ledger; this is just a fast cached read.

Solves: read latency on hot accounts. Without snapshots, fetching the platform escrow balance — a single account that touches every transaction — would scan billions of rows. With snapshots, it's an O(1) lookup plus a tiny tail.

⑩ Fraud Detection

Two-stage gate the orchestrator runs before authorizing the card. Stage 1: real-time deterministic rules — velocity checks (this card just tried 10 charges in 60 seconds), IP blacklists, BIN risk. Sub-100ms, blocks the obvious. Stage 2: an ML model (gradient-boosted trees scoring features like amount-vs-customer-history, geo mismatch, device fingerprint) runs in ~100ms and either approves, declines, or flags for manual review.

Solves: chargebacks. A chargeback costs us the disputed amount plus a ~₹500 dispute fee and counts against our gateway's risk score. Razorpay (and the underlying Visa/Mastercard rails) will throttle or terminate accounts whose chargeback rate exceeds 1%. Catching even 50% of fraud before authorization pays for the fraud team many times over.

⑪ Notification Service

The fan-out for downstream effects. Sends Priya's confirmation email and SMS (via MSG91 with TRAI-mandated DLT templates), fires the merchant webhook (POST merchant.example.com/webhook with the payment payload), pushes a real-time event to the merchant dashboard. Critically, this is async — the orchestrator queues notifications and moves on; if Myntra's webhook endpoint is slow, Priya's checkout doesn't wait. Webhooks retry with exponential backoff for 3 days.

Solves: coupling latency. Without async notifications, every checkout's p99 includes the slowest merchant webhook in the system. With them, the orchestrator commits the ledger and returns success in ~500ms regardless of how slow the merchant's server is.

⑫ Reconciliation Service

A daily batch job. Pulls Razorpay's settlement file (every charge they processed for us yesterday, delivered as a CSV/S3 object) and joins it against our internal ledger. Every Razorpay row should match exactly one of our LedgerEntry rows; every one of our captured payments should match exactly one Razorpay row. Discrepancies get flagged for a human and must be resolved within 24 hours per RBI expectations.

Solves: the "did we actually move the money we think we moved" question. Bugs happen. Network glitches happen. Reconciliation is how we catch them within a day instead of discovering at the next RBI inspection that we've been off by ₹50 lakh for six months.

⑬ Audit Log (S3 append-only)

Every state transition in the system — workflow started, fraud checked, card authorized, 3DS completed, ledger committed, notification sent — emits an immutable event record to S3 (ap-south-1 bucket per RBI data-localisation rules) with object-lock turned on. The bucket is configured so even an admin with root credentials cannot delete or modify written objects. Mirrored to an on-premise audit store as well, because RBI inspections expect records to be accessible even if AWS is unavailable.

Solves: regulatory and forensic requirements. After any incident, we need to reconstruct exactly what happened. After any RBI inspection, the regulator needs proof that records cannot have been tampered with. S3 object-lock plus append-only is the cheapest, most defensible answer.

Concrete walkthrough — Priya buys ₹6,499 running shoes at 14:02:06 IST

It's 14:02:06 IST on a Tuesday afternoon in Bengaluru. Priya is on Myntra in Chrome on her MacBook from her apartment in Koramangala. The cart shows one pair of Asics Gel-Nimbus 25, UK 6, ₹6,499 even. Her finger is hovering over the pink "Place order" button. The next ~700 milliseconds will pass through thirteen components, two databases, two external networks (Visa and the issuing bank's 3-D Secure server), and four state changes that must either all happen or all undo. Let's slow time down and watch them in order.

We'll do this twice: once on the happy path — every server up, every network call clean — and once on a nightmare path where the orchestrator pod loses power between charging her card and writing the ledger entry. The second scenario is the one that earns the architecture its keep.

One quick note on the India context: RBI mandates 3-D Secure (the OTP step) on every domestic card transaction above ₹5,000 and on every recurring mandate, regardless of amount. So unlike the US flow where auth can run silently, Priya will see her HDFC OTP screen pop up mid-checkout. The architecture treats this as a sub-step of authorization handled by the gateway adapter — the orchestrator just sees "auth succeeded" or "auth failed", same shape as before.

✅ Happy path — every component does its job (~700ms end-to-end, plus OTP)

T = 0 ms — Priya's finger hits "Place order". The Myntra checkout page has loaded the Client SDK ① (in our case, the Razorpay Checkout SDK) as a hidden iframe pointed at checkout.razorpay.com. Two things happen in Priya's browser before any request leaves her laptop:

Card tokenization. The Razorpay SDK takes the raw PAN 4111 1111 1111 1111 she typed earlier and POSTs it directly to api.razorpay.com/v1/tokens — not to Myntra, and not to us. Razorpay responds with an opaque token token_NkXyZ.... This token is the only thing our backend will ever see; we are completely out of PCI-DSS scope because raw card data never touched our servers (§12). India's 2022 RBI tokenization mandate makes this not just best practice but legally required — merchants are forbidden from storing PANs at all.
Idempotency key generation. The SDK calls crypto.randomUUID() and gets XYZ-abc-123. This UUID is now locked in — every retry of this same checkout, even if Priya's Jio fibre flickers and her browser auto-retries 5 times, will carry the same key. The receipt number was decided before she clicked.

The browser then fires:

POST https://api.payments.example/v1/payments
Idempotency-Key: XYZ-abc-123
Authorization: Bearer <Myntra's API key>
Content-Type: application/json

{
  "amount_paise": 649900,
  "currency": "INR",
  "payment_method_token": "token_NkXyZ...",
  "merchant_id": "mrch_myntra",
  "customer_id": "cust_priya_42",
  "description": "Asics Gel-Nimbus 25, UK 6"
}

Two India-specific details in the body: amounts are in paise (1 ₹ = 100 paise) stored as integers — never floats, because floats can't represent ₹0.01 exactly and a rounding error of a paisa per transaction at 1M tx/day adds up to a regulator-visible drift. And currency is INR, which determines which acquiring bank Razorpay will route through and which RBI rules apply.

T = 30 ms — Request hits the API Gateway ②. The TLS handshake finished at T=20ms; AWS API Gateway in the ap-south-1 (Mumbai) region now does its three jobs in ~5ms: (a) validates the bearer token belongs to a known merchant — Myntra checks out, (b) checks the per-key rate limit — Myntra's quota is 2,000 req/sec (they handle big Big Billion Days spikes), we've seen 340 from them this second, fine, (c) forwards the cleaned request to the next hop. If Priya's request had come from a fraudster firing 10,000 attempts/sec, it would have died right here — never reaching application code, never wasting a database connection.

T = 45 ms — Payment API Server ③ receives the request. This is a stateless Go service running 50 pods behind an internal load balancer, also in ap-south-1 because RBI's data-localisation rule requires Indian payment data to be stored and processed within India. The pod that gets Priya's request does five things in fast succession:

Schema validation — does amount_paise parse as a positive integer? Is currency in our supported list? Does merchant_id exist and is it active? All yes.
Idempotency lookup against Redis ④. The pod runs SET idempotency:XYZ-abc-123 "in-flight" NX EX 86400. The NX flag means "only set if it doesn't exist"; the EX 86400 sets a 24-hour TTL. Redis returns OK — the key is new. We have the lock. (If Priya's browser were retrying a 2nd attempt right now, Redis would return nil and we'd jump straight to returning the in-progress or cached response.)
Database row insert — defensive second layer. The pod INSERTs into the transaction table with idempotency_key='XYZ-abc-123' as a UNIQUE constraint. Even if Redis is wiped between attempts, Postgres will reject the duplicate. Both layers must fail for a double-charge — astronomically unlikely.
Kick off the workflow on the Orchestrator ⑤. The pod calls temporalClient.startWorkflow("ProcessPayment", { ...request, internalTxId: "tx_001" }) and gets back a workflow handle.
Block on the first decisive result — the pod holds Priya's HTTP connection open and awaits a signal from the workflow. It will return either when the workflow reaches a terminal state or when 90 seconds elapse (longer than the US flow because the 3DS OTP step can take Priya 30+ seconds to read her SMS and type the code).

T = 80 ms — Payment Orchestrator ⑤ (Temporal) starts the saga. Temporal persists the workflow's initial state to its own Postgres cluster before running a single line of business logic. This persistence is the trick: if any worker pod dies at any point, another picks up exactly here. The workflow code reads like straight-line Java:

@WorkflowMethod
public PaymentResult processPayment(PaymentRequest req) {
    FraudScore fs   = fraudActivity.score(req);          // step 1
    if (fs.declined()) return PaymentResult.declined(fs);
    AuthResult  auth = gatewayActivity.authorize(req);   // step 2 (includes 3DS for INR > ₹5K)
    CaptureResult cap = gatewayActivity.capture(auth);   // step 3
    ledgerActivity.writeDoubleEntry(req, cap);           // step 4
    notifyActivity.fanout(req, cap);                     // step 5
    return PaymentResult.succeeded(cap);
}

T = 120 ms — Step 1: Fraud Detection ⑩. Temporal schedules the fraud activity on a fraud-worker pod. Stage 1 — deterministic rules — runs in 8ms: Priya's card has been on file 2 years, this is her 14th purchase, IP 49.207.x.x in Bengaluru matches her shipping pincode 560034, BIN 411111 is an HDFC Visa (low risk). No rule triggers. Stage 2 — the gradient-boosted model — scores in 85ms using features like amount_vs_avg_for_customer = 1.3 (close to her normal ₹5,000 average), geo_mismatch = false, device_seen_before = true, hour_of_day_z_score = 0.3. Final score: 0.05 (where 1.0 is certainly-fraud). Approved. Without this step, Priya's checkout flow looks identical — but if a stolen card from a Pakistan IP tried to buy 5× ₹6,499 shoes in 30 seconds, the velocity rule would catch it before any money moves.

T = 220 ms — Step 2: authorize the card via Gateway Adapter ⑥. Temporal hands off to a gateway-worker pod. The adapter — a thin Razorpay-specific wrapper — calls:

POST https://api.razorpay.com/v1/payments/create
Authorization: Basic <rzp_live_key>
Idempotency-Key: tx_001-authorize        // derived from our internal tx ID
amount=649900&currency=INR&token=token_NkXyZ...&capture=manual

Razorpay routes through the Visa network to HDFC (Priya's issuing bank). Because the amount is over ₹5,000, HDFC triggers 3-D Secure: it sends Priya an OTP via SMS to her registered mobile number and returns a redirect URL to its OTP page. The orchestrator passes this URL back to the API server, which streams it down to Priya's browser. The Razorpay iframe expands to show the HDFC OTP screen.

T = 4 s to T = 35 s — Priya types her OTP. Most users take 15–30 seconds; some take longer if their SMS is delayed. HDFC's OTP page submits directly back to Razorpay (not through us). Razorpay completes the 3DS verification with Visa, gets a Cardholder Authentication Verification Value (CAVV), and returns to the orchestrator: { id: "pay_xyz", status: "authorized", auth_id: "auth_99" }. HDFC has now put a hold on ₹6,499 of Priya's credit line. No money has actually moved yet — we've just reserved the amount. If anything fails between here and capture, we can just walk away and HDFC auto-releases the hold within 5 days per Visa rules. Note the idempotency key we passed to Razorpay — derived deterministically from our transaction ID — so if our adapter retries this call after a timeout, Razorpay will return the existing authorization rather than creating a second hold.

T = 35.2 s — Step 3: capture the authorization. The adapter immediately follows up:

POST https://api.razorpay.com/v1/payments/auth_99/capture
Idempotency-Key: tx_001-capture
amount=649900&currency=INR

Razorpay tells HDFC "yes, actually take the ₹6,499". HDFC debits Priya's available credit and queues a transfer to Razorpay's nodal account (held with ICICI). The actual settlement to our merchant account happens via NEFT in T+1 working day per RBI's settlement cycle, not the T+2 ACH cycle the US uses. Razorpay returns { status: "captured", balance_transaction: "rcpt_99" } in 110ms. This is the moment real money has moved. ₹6,499 has left Priya's HDFC credit line; it is now Razorpay's, owed to us, owed to Myntra. If we crash now and never write the ledger, we owe Myntra ₹6,499 and our internal records won't show it — a money-leak bug.

T = 35.4 s — Step 4: write the double-entry ledger via Ledger Service ⑦. The orchestrator now does the most important step. It calls the ledger service with:

POST /internal/ledger/write
{
  "transaction_id": "tx_001",
  "entries": [
    { "account_id": "acct_priya_hdfc",    "amount_paise": -649900 },
    { "account_id": "acct_myntra_escrow", "amount_paise": +649900 }
  ]
}

The ledger service first validates the audit invariant: -649900 + 649900 = 0 ✓. Any request where the sum is non-zero is rejected outright — you cannot create or destroy money through this service. Then it opens a SERIALIZABLE Postgres transaction on the Ledger DB ⑧, INSERTs both rows, COMMITs. The COMMIT does not return until the rows are also durable on the synchronous standby in another availability zone (we run primary in Mumbai ap-south-1a, standby in Hyderabad ap-south-1b) — so if our primary catches fire in the next 50ms, the records survive. Total time: 65ms. The Account Snapshot Cache ⑨ has a background job that will roll these new entries into the per-account balance snapshots within the next minute; until then, reads against Priya's account compute "last snapshot + small delta".

T = 35.5 s — Step 5: fan out notifications via Notification Service ⑪. The orchestrator enqueues four messages on Kafka — but does not wait for any of them to be delivered. This is critical: Priya's checkout doesn't have to wait for Myntra's webhook server (which lives somewhere in their Bengaluru data center and has a 2-second response time on bad days). The four messages are:

Email to priya@gmail.com with the receipt — including the mandatory GST breakdown (CGST + SGST on intra-state, IGST on inter-state).
SMS to +91 98xxx xxxxx via the merchant's DLT-registered template (Indian SMS regulator TRAI requires every commercial SMS to use a pre-registered template).
POST to https://myntra.example/webhooks/payments with the full payment payload, signed with our webhook secret so Myntra can verify it's really us.
Realtime push to the merchant dashboard so Myntra's ops team sees the sale within seconds.

If Myntra's webhook endpoint is down, the notification service retries with exponential backoff for 3 days — Priya's checkout doesn't care.

T = 35.6 s — Workflow returns success to the API server. The API server caches the final response in Redis (under the same XYZ-abc-123 key, replacing the "in-flight" marker), then returns to Priya's browser:

HTTP/1.1 200 OK
Content-Type: application/json

{
  "id": "pay_tx_001",
  "status": "succeeded",
  "amount_paise": 649900,
  "currency": "INR",
  "created_at": "2026-05-26T08:32:41.600Z"
}

Note the timestamp is in UTC (08:32 UTC = 14:02 IST) — internal records always store UTC, and the display layer converts to IST when rendering to Priya.

T = 35.7 s — Priya sees the confirmation page. The Myntra frontend swaps to "Order placed! Delivery by Friday." with her order number. Meanwhile, every state transition along the way — workflow.started, fraud.scored, gateway.authorized, 3ds.completed, gateway.captured, ledger.committed, notifications.enqueued, workflow.completed — has been written to the Audit Log ⑬ on S3 (ap-south-1 bucket with object-lock) and mirrored to our on-premise audit store, because RBI inspections expect access to payment audit records even if AWS is unavailable. Twelve hours from now, the Reconciliation Service ⑫ will pull Razorpay's settlement file, find this ₹6,499 transaction in it, and cross-check it against our ledger row. Match. Green tick. Priya's running shoes ship the next morning.

End-to-end wall-clock time was ~35 seconds, but only ~700 ms of actual server work — the remaining ~34 seconds was Priya reading her HDFC OTP SMS and typing it in. The architecture's compute time is essentially identical to the US flow; the latency you see is dominated by the regulator-mandated OTP step.

⚠️ Failure path — the orchestrator pod dies between capture and ledger write

Same Priya, same shoes, same 14:02:06 IST. Steps 1 through 3 above all execute identically — fraud check passed, 3DS OTP entered successfully, ₹6,499 captured at HDFC. At T = 35.4 s — the instant after Razorpay confirmed the ₹6,499 capture but before the orchestrator could call the ledger service — the worker pod running Priya's workflow gets evicted by Kubernetes because the node it's on starts failing health checks (maybe an EBS volume in ap-south-1 is degraded; Mumbai's monsoon is hard on data centers). ₹6,499 of real money has moved out of Priya's HDFC credit line, into Razorpay's pipeline, owed to Myntra — and our system has lost the pod that was tracking it. Without Temporal, this is exactly the money-leak we feared in Pass 1.

T = 35.4 s — pod dies. The pod stops sending heartbeats. The Razorpay HTTP response (the success notification for the capture) is somewhere in the kernel buffer of the dead pod, lost.

T = 35.5 s — Priya's browser is still waiting. The HTTP connection to the API server is still open; the API server is still blocked waiting on Temporal to signal a result. So far, Priya sees a spinner.

T = 45 s — Temporal notices. Temporal's service detects the worker hasn't checked in for 10 seconds and marks the workflow as available for reassignment. It looks at the persisted workflow history — which it saved at every step boundary — and sees: step 1 completed (fraud approved), step 2 completed (auth auth_99, 3DS verified), step 3 completed (capture rcpt_99). Step 4 has not started. Temporal does not re-run steps 1–3. Critically, this means Priya is not asked to enter her OTP a second time — Temporal knows the 3DS step already succeeded. It schedules step 4 on a fresh worker pod, with the captured state from step 3 already in hand.

T = 45.2 s — new pod resumes at step 4. It calls the ledger service with the exact same parameters the dead pod would have used: tx_001, -₹6,499 Priya, +₹6,499 Myntra. The ledger service writes the rows, COMMITs in Postgres. Step 5 (notifications) runs. The workflow completes. The books are balanced.

What about Priya's browser? If her HTTP request timed out at the 90-second mark with no response, the Myntra checkout page either auto-retries or shows her a "we're confirming your payment, give us a moment" UI. Either way, when she eventually clicks "Place order" again — same idempotency key XYZ-abc-123, because the Razorpay SDK generates the key once per cart session.

T = ~90 s — Priya's retry hits the API. The API server runs the same SET NX against Redis. This time Redis returns nil — the key exists. The API server reads the existing value: at this point either still "in-flight" (if the workflow is still resuming) in which case the server polls Temporal for the workflow's status, or the final success payload from the original workflow. Either way, no second workflow is started, no second authorization, no second OTP, no second capture, no second ledger write. Priya gets back the same confirmation as the original request. One charge. One ledger entry. One pair of shoes shipped.

The really bad case — what if the ledger DB itself is down for an hour? Temporal's retry policy keeps re-running step 4 with exponential backoff. After ~5 minutes of failures, the workflow code triggers a compensating action: it calls the gateway adapter again with refund(rcpt_99). Razorpay issues a ₹6,499 refund back to Priya's HDFC Visa — refunds in India settle in 5–7 working days per RBI rules, slower than the original capture but unavoidable. The audit log records: capture executed, ledger write permanently failed, capture refunded — auditors (and RBI inspectors, who can demand transaction-level audit trails on 24-hour notice) can reconstruct exactly what happened. Priya eventually sees a "payment failed, please try again" page. Her HDFC statement shows a charge today and a matching refund 5 working days later. No money lost. No money duplicated. No money in limbo.

So what: the architecture exists because money is allergic to "almost". Every box in the diagram earns its keep in one of those ~700 milliseconds of compute (or in the recovery from one of them not going to plan). Idempotency ④ means Priya's retry — whether triggered by a Jio outage at T=200ms or a 90-second timeout — is free; no duplicate charge possible. The orchestrator ⑤ means the pod dying at T=35.4s doesn't lose her money, and crucially doesn't force her to re-do her OTP; another pod resumes the exact next step. The double-entry ledger ⑦⑧ means the books are always balanced — if step 4 ever fails permanently, the compensating refund at Razorpay brings the world back to zero. The audit log ⑬ means we can prove every paisa to an RBI inspector. Take any one of those four out and you ship a system that, on its first bad Tuesday, charges someone twice or loses ₹6,499 into the void — and you find out about it a month later from an angry customer or, worse, a regulator.

Step 7

Idempotency — The Foundational Property

If you take only one idea from this page, take this one. Every other guarantee in the system depends on idempotency working correctly.

An operation is idempotent if executing it twice has the same effect as executing it once. SET balance = 100 is idempotent (no matter how many times you run it, balance = 100). balance = balance + 6499 is not idempotent — running it twice charges ₹12,998. The entire point of the idempotency-key contract is to make non-idempotent operations look idempotent to the caller, so retries are safe.

How idempotency keys work end-to-end

sequenceDiagram participant CL as Client SDK participant API as Payment API participant R as Redis participant DB as Postgres participant ORC as Orchestrator Note over CL: Generate UUID once, before request CL->>API: POST /v1/payments
Idempotency-Key: XYZ-abc-123 API->>R: SETNX idempotency:XYZ-abc-123
TTL=24h alt Key is new R-->>API: OK (lock acquired) API->>DB: INSERT Transaction(idem_key=XYZ-abc-123)
UNIQUE constraint DB-->>API: OK API->>ORC: Start workflow ORC-->>API: Result API->>R: SET idempotency:XYZ-abc-123 = result API-->>CL: 200 + result else Key already seen R-->>API: EXISTS — return cached result API-->>CL: 200 + cached result (no re-execute) end

Two layers of defense: Redis is the fast path (1ms lookup), Postgres UNIQUE constraint on Transaction.idempotency_key is the bulletproof backup. Even if Redis is wiped, the database INSERT will fail with a unique-violation error on the duplicate, and our error handler reads the existing row and returns the original result. There is no race condition where a duplicate gets through both layers.

Rules every idempotency-aware service must follow

✅ Do

Generate the key on the client side before the first attempt — so all retries reuse it
Use a UUID v4 — sufficient entropy, no coordination needed
Store the full response, not just "yes done" — so retries see the same payload as the original
Tie the key to the exact request body — if the body differs, treat it as a different request and reject
TTL the key (24h is typical) — keys aren't kept forever

❌ Don't

Generate the key on the server — defeats the purpose; client's retry would generate a new key
Use the request body hash as the key — if the user genuinely wants to charge twice for the same amount, you've blocked them
Allow same key with different bodies — that's a bug, not an idempotent retry; return 422
Forget to make every step in the workflow itself idempotent — Razorpay's API also takes idempotency keys; pass them through

End-to-end idempotency is a chain. The client passes a key to our API; our API passes a derived key to Razorpay; the orchestrator's ledger-write step uses a third derived key. If any link in the chain is non-idempotent, retries can produce duplicates somewhere in the system. Audit each integration explicitly — Razorpay takes X-Idempotency-Key headers, Postgres takes UNIQUE constraints, internal services take UUIDs in the request body.

Step 8

Double-Entry Ledger

The ledger is the single most important piece of the system, and the idea behind it is 700 years old. Italian merchants in the 1400s figured out that if you record every transaction as two equal-and-opposite entries, errors become detectable and money becomes traceable. Modern payment systems are doing the same thing, just with Postgres instead of leather-bound books.

The mechanic

Every transaction generates at least two LedgerEntry rows. One DEBITs an account, one CREDITs another, and the amounts sum to zero. Priya's ₹6,499 running-shoe purchase looks like this:

flowchart LR TX["Transaction tx_001
Priya buys shoes ₹6,499"] TX --> E1["Entry A — DEBIT
account: priya_hdfc
amount: −₹6,499"] TX --> E2["Entry B — CREDIT
account: myntra_escrow
amount: +₹6,499"] E1 --> SUM["SUM = ₹0 ✓
audit invariant holds"] E2 --> SUM style TX fill:#171d27,stroke:#e8743b,color:#d4dae5 style E1 fill:#171d27,stroke:#e05252,color:#d4dae5 style E2 fill:#171d27,stroke:#38b265,color:#d4dae5 style SUM fill:#171d27,stroke:#38b265,color:#d4dae5

The two entries land in the same Postgres transaction, so either both commit or neither does. After the commit: Priya's payment-method account is ₹6,499 lower, Myntra's escrow is ₹6,499 higher, and the system as a whole has the same total amount of money it had before.

Why this beats a simple `balance` column

📊 Auditability

Every paisa ever moved is a row. You can answer "what was Myntra's balance at 14:02:05" by summing entries up to that timestamp. With a single balance column, that history is gone the moment the next transaction overwrites it.

🔐 The mathematical invariant

SUM(amount_paise) FROM ledger_entry GROUP BY account_id gives every account's balance. SUM(amount_paise) FROM ledger_entry across all accounts must equal zero. If it doesn't, money was created or destroyed — and you have a P0 incident regardless of which row caused it.

⚡ No hot-row contention

Updating Myntra's balance with UPDATE merchant SET balance = balance + 649900 serializes every Myntra transaction on one row. Inserting a new ledger entry serializes nothing — Postgres can append in parallel. Big Billion Days goes from "spinner of death" to "actually responsive".

🔄 Reversals are clean

Refunding Priya doesn't UPDATE or DELETE the original entries. It writes new entries: +₹6,499 to her account, −₹6,499 from Myntra's escrow, with a reference to the original transaction. The original history is preserved; the reversal is a separate audit event.

More complex example — split payment with platform fee

Priya pays ₹10,000 for a service. The platform takes a 3% fee (with 18% GST on top of the fee, payable to government). Four ledger entries, sum still zero:

Entry	Account	Type	Amount
1	priya_hdfc	DEBIT	`−₹10,000.00`
2	provider_escrow	CREDIT	`+₹9,646.00`
3	platform_revenue	CREDIT	`+₹300.00`
4	gst_payable	CREDIT	`+₹54.00`
Sum			`₹0.00 ✓`

The gst_payable account is settled monthly to the government via the GSTR-3B filing. Treating GST as its own ledger account makes month-end statutory filings trivial — one SQL query gives the total liability.

The on-call sanity check: a payment system is healthy when SELECT SUM(amount_paise) FROM ledger_entry returns zero. We run that query as a Datadog metric every minute. If it ever drifts, an alarm fires before the next transaction even completes — because something is fundamentally broken and every additional transaction makes it worse.

Step 9

Saga Pattern with Temporal

A payment is not an atomic operation — it spans multiple systems we don't control. We can't take a global lock across Razorpay, our DB, and the merchant's webhook endpoint. The saga pattern is how we get atomicity-like guarantees without distributed transactions.

The saga as code (Temporal workflow)

payment workflow — pseudocode

workflow processPayment(req):
  // each step is an "activity" — auto-retried, persistent
  fraudResult = checkFraud(req)
  if fraudResult.declined: return failure("fraud")

  authId = authorize(req.amount, req.payment_method_id)   // step 1
  try:
    captureId = capture(authId)                             // step 2
    try:
      writeLedger(req.amount, req.source, req.dest)        // step 3
      try:
        notifyMerchant(req)                                // step 4 — async, best-effort
      catch:
        // step 4 failure does NOT roll back; webhooks retried separately
        log("webhook will retry")
    catch e:
      refund(captureId)                                    // compensate step 2
      throw
  catch e:
    reverseAuth(authId)                                    // compensate step 1
    throw

  return success

What Temporal gives us for free

💾 Durable state

Every step's input and output is persisted before the next step runs. If the orchestrator pod dies between step 2 and step 3, a different pod resumes at step 3 — never re-running step 2 (which already moved real money).

🔁 Auto-retry with backoff

Transient failures (Razorpay 502, network blip, DB timeout) are retried with exponential backoff up to a configured max. Permanent failures (declined card, validation error) escalate immediately to compensating actions.

🧯 Compensating actions

Each step has an inverse. If we capture a charge but can't write the ledger, the orchestrator runs refund against Razorpay to undo the capture. Priya sees an error; her card ends up with a charge-and-refund pair within 5–7 working days (RBI refund timeline). No half-completed state survives.

Saga vs. 2-Phase Commit (2PC)

2PC is the classical answer to multi-system atomicity: a coordinator asks every participant "can you commit?", then if all say yes, says "commit". It does not work for us for two reasons:

🚫 Razorpay doesn't support 2PC

Card networks have no "prepare-to-commit" stage. Once we tell Razorpay to capture, the money moves. You cannot ask the world's payment networks to please pause their transaction until our other systems are ready.

🚫 2PC blocks under coordinator failure

If the 2PC coordinator dies after sending "prepare" but before "commit", participants hold locks indefinitely waiting for the verdict. In a payment context, that means the merchant's account row is locked for hours. Saga has no global lock, so failures degrade gracefully.

Saga vs. 2PC in one line: 2PC tries to give you atomicity by holding everything hostage until everyone agrees; saga gives you atomicity by being willing to undo what's already been done. The first works only inside one DB; the second is the only thing that works across systems you don't own.

Step 10

Fraud Detection

Fraud is the single biggest non-engineering risk to a payment platform. A 1% chargeback rate gets you throttled by Razorpay (and the underlying Visa/Mastercard rails); a 2% chargeback rate gets your merchant account terminated and you lose your business overnight. Catching fraud before the charge is therefore worth a lot of latency budget. The good news in India: RBI's mandatory 3-D Secure step on every > ₹5,000 card transaction already kills a big chunk of card-not-present fraud at the network level — but it doesn't catch everything (account-takeover fraud where the attacker also intercepts the OTP, friendly fraud, etc.).

Two-stage gate

Stage 1 — Real-time deterministic rules

Sub-100ms. Cheap, fast, blocks the obvious. Examples:

Velocity check — same card seen 10 times in 60 seconds across our system → block
IP blacklist — known fraudster IP / Tor exit node / data-center IP
BIN risk — the card-issuer prefix tells us if it's a prepaid card from a high-fraud country
Geo mismatch — billing address in Bengaluru, IP in Lagos, shipping address in Moscow
Amount threshold — first-ever charge on a brand-new account for ₹4,99,000 (just under the ₹5 lakh CTR threshold) → manual review

Implemented as a Redis-backed counter set + hot rule list, evaluated in <50ms.

Stage 2 — ML scoring

~100ms. Catches the subtle. A gradient-boosted-tree model trained on years of historical fraud data scores every transaction on a 0-1 risk scale based on ~200 features:

Customer history — average transaction size, time-since-signup, prior chargebacks
Merchant profile — chargeback rate, vertical, whether this is a typical purchase
Device signals — fingerprint, browser, OS, language
Transaction shape — amount-vs-typical, time-of-day, items in cart

Score > 0.9 → auto-decline. Score 0.5-0.9 → manual review queue. Score < 0.5 → approve.

The latency-vs-precision trade-off: we pay roughly 150ms of fraud-check latency on every transaction, which gets us a chargeback rate near 0.1% (industry-leading). Skipping fraud entirely would let us return responses in 400ms instead of 550ms — and would also kill the business inside a year. There is no useful version of "skip fraud to go faster".

Step 11

Reconciliation

"Trust but verify" is the entire job. Every day, a batch process compares our internal ledger against the source-of-truth settlement files from Razorpay, the card networks, and the banks we settle with. The two views must match to the paisa. If they don't, an engineer is paged.

The daily reconciliation pipeline

02:00 IST — pull yesterday's settlement file from Razorpay's S3 bucket (or via their API). Lists every charge they processed for us, with their internal IDs (pay_xxx) and our merchant order IDs.
02:15 IST — query our LedgerEntry table for every successful payment from yesterday (IST day boundary, 18:30 to 18:30 UTC).
02:30 IST — full outer join on (payment_id, amount, currency). Three buckets: (a) match on both sides — green; (b) in Razorpay but not in our ledger — red, money received but unrecorded; (c) in our ledger but not in Razorpay — red, we think we got paid but Razorpay disagrees.
02:45 IST — non-empty red buckets page the on-call. Issue must be triaged within 24 hours per RBI's payment-system operator (PSO) expectations.

Common discrepancy patterns

⏱️ Timing skew

A transaction captured at 23:59 IST might land in our ledger today and Razorpay's settlement file tomorrow. Resolved by widening the window — match against today's and yesterday's Razorpay file.

🔄 Async webhook lag

Razorpay webhooks for capture-success can arrive after the next-day file. We use the workflow's own state, not just webhooks, as the truth — webhooks are a hint, not the source.

🐞 Real bug

Rare but real: an idempotency-key collision, a saga that didn't compensate properly, a manual UPDATE someone ran in production. Hands-on-deck investigation; ledger gets corrective entries (never modified) once root cause is found.

The recon dashboard is the heartbeat of trust. Investors, auditors, and RBI inspectors all look at the same number: yesterday's reconciliation green-rate. 100% green every day is not a luxury — it is the whole reason a payment platform is allowed to operate under an RBI PA/PG licence.

Step 12

PCI-DSS Compliance

PCI-DSS is the payment-card industry's global data-security standard. RBI layers an Indian-specific regime on top of it — most notably the 2018 data-localisation directive (all payment data must be stored in India) and the 2022 tokenisation mandate (merchants cannot store PANs at all). It is non-optional for anyone handling card data, and the cost of compliance scales sharply with how much of our infrastructure touches card numbers. The goal is therefore to never see a real card number anywhere on our servers.

Tokenization — the single most important PCI scope reducer

sequenceDiagram participant U as Priya's Browser participant SDK as Razorpay iframe participant ST as Razorpay Vault participant API as Our API participant DB as Our DB U->>SDK: Types card 4111-1111-1111-1111 Note over SDK: Card data NEVER leaves the iframe
Posted directly to Razorpay SDK->>ST: Vault the card (network token, RBI-compliant) ST-->>SDK: token = "token_abc123" SDK->>U: Page receives only token U->>API: POST /v1/payments
{ payment_method_id: "token_abc123" } Note over API: Backend never sees PAN —
only the opaque network token API->>DB: INSERT (..., payment_method_id="token_abc123") API->>ST: charge using token_abc123

The Razorpay iframe is hosted on Razorpay's domain, so it doesn't even share the same browsing context as our checkout page. Card data is captured by Razorpay, vaulted by Razorpay (as a network token issued by Visa/Mastercard's token-service-provider — the only legal way to store a card reference in India since the RBI 2022 mandate), and we receive an opaque token like token_abc123 that we can use to charge but cannot reverse-engineer back into a real card number. This single design decision moves our PCI scope from "we are a processor" (PCI Level 1, hundreds of pages of audit) to "we are a tokenized merchant" (PCI SAQ-A, a checklist).

The other PCI + RBI guardrails

🔐 Network isolation

Servers that handle payment tokens live in a tightly-firewalled VPC in AWS ap-south-1 (Mumbai) with no inbound internet access. Egress is whitelisted to Razorpay's IPs and our other services only. Audit logs all egress for review. Data never leaves Indian soil per RBI 2018 directive.

📜 Logged everything, redacted forever

Application logs run through a redaction pipeline that strips anything matching a card-number pattern (Luhn-checkable digit sequences) before writing to the log store. Even if a developer accidentally logs req.body, the PAN never lands on disk.

🔑 Secrets via vault, rotated quarterly

Razorpay API keys, signing secrets, and DB credentials are held in HashiCorp Vault or AWS Secrets Manager (Mumbai region). Apps fetch at boot, no secrets in env files or git. Rotated every 90 days minimum.

📋 Annual audit + RBI system audit

An external QSA (Qualified Security Assessor) audits us yearly for PCI-DSS. RBI also requires an annual System Audit (SAR) by a CERT-In empanelled auditor, plus quarterly ASV scans. Findings tracked to closure within 30–90 days depending on severity.

The one-line summary an auditor wants to hear: "We don't store, process, or transmit primary account numbers — we tokenize at the client and only handle opaque network tokens server-side, hosted entirely in our Mumbai region." If that sentence is true and provable, you've reduced ~80% of the PCI burden and you're aligned with RBI's localisation + tokenisation rules in one move.

Step 13

Multi-Currency

A US tourist pays ₹6,499 worth of items but their card is denominated in USD; the merchant Myntra settles in INR. Or the reverse: an NRI in Dubai uses an INR-card on a USD-priced overseas merchant via Razorpay's international corridor. Naïvely you might think "convert the amount, save the converted number". That breaks the audit invariant — the FX spread has to live somewhere too, and the conversion rate at transaction time has to be locked or you can't reconcile.

The rule — entries are always in the account's native currency

An account has a fixed currency. tourist_card_usd is USD. myntra_escrow_inr is INR. A ledger entry's amount is always denominated in that account's currency. FX conversion is itself a transaction with multiple legs — and the spread is credited to a platform revenue account. (Cross-border flows in India also need to comply with FEMA reporting; the FX revenue account makes that filing a simple SQL query.)

US tourist pays $78 USD, Myntra settles INR at rate 1 USD = ₹83.32

#	Account	Currency	Amount	Note
1	tourist_card_usd	USD	`−$78.00`	Card debited
2	platform_fx_pool_usd	USD	`+$78.00`	USD enters platform
3	platform_fx_pool_inr	INR	`−₹6,499.00`	INR leaves platform pool
4	myntra_escrow_inr	INR	`+₹6,499.00`	Merchant credited at 1 USD = ₹83.32
5	platform_fx_revenue_inr	INR	`+₹0.00`	Spread / margin (if applicable)

Each currency's entries sum to zero on their own — USD entries (1, 2) sum to zero, INR entries (3, 4, 5) sum to zero. The platform takes on the FX risk; we hedge by holding currency pools and rebalancing them periodically with our AD-Category-I banking partners (the only banks RBI-authorised to handle FX).

The locked-rate rule: the FX rate used for entries 3 and 4 is captured at transaction time, written into the Transaction row, and never recalculated. If the rate later moves, that's our P&L — but the ledger never changes. This is what makes multi-currency reconcilable and what makes monthly FEMA filings auditable.

Step 14

Data Partitioning

1 billion ledger rows per year does not fit on one box once we factor in indices, replicas, and operational headroom. We partition the LedgerEntry table by account_id.

Why account_id, not transaction_id or time

✅ Account_id (chosen)

Most queries are "give me all entries for account X" — balance lookup, statement generation, audit. Sharding by account_id keeps an account's full history co-located on one shard, so SUM queries are local and fast.

❌ Transaction_id

Spreads each transaction's debit and credit entries across different shards — meaning every transaction commit becomes a distributed write. Atomic ledger writes become hard.

❌ Time

Today's shard is hot; yesterday's is cold. Hot shard becomes the bottleneck. Better used as a secondary partition (sub-partition by month within each account-shard) for archival.

Hot accounts — the platform escrow problem

Most accounts are tiny — a customer might have a few transactions per year. But the platform escrow account touches every single transaction we process. At 1M tx/day, escrow has 2M+ entries/day on a single shard. That shard becomes a write bottleneck.

Solution: sub-shard hot accounts. The platform escrow is virtually represented as N "shards" (escrow_001, escrow_002, …, escrow_032). Writes are randomly assigned to one of the N. Reads roll up across all N. The balance is SUM over the sub-shards. Net effect: we trade a tiny amount of read complexity for 32× write throughput on the hot account.

Partition by access pattern, not data structure. Account_id is the right key because that's how we query. If our access pattern were "give me yesterday's transactions across all accounts" (e.g., for reconciliation), time-partitioning would win. Pick the partition key by looking at the queries, not the schema.

Step 15

Fault Tolerance

Every component in the system can fail. The interesting question is: does that failure cause money to be lost, money to be duplicated, or just a temporary outage? The first two are unacceptable; the third is recoverable. The architecture is built so that every plausible failure mode lands in bucket three.

What fails	What happens	How we recover
Payment API pod crashes mid-request	Client times out, retries with same idempotency key	Retry hits Redis or DB unique-constraint, returns original result
Orchestrator pod dies between steps	Workflow paused	Temporal reschedules workflow on a healthy pod, resumes at next step
Razorpay gateway has 30-min outage	New payments fail on authorize step	Circuit breaker fails fast; client sees error; no money moved; saga not started; merchant traffic optionally failed over to PayU adapter
Razorpay drops mid-capture	Workflow doesn't know if capture succeeded	Idempotent retry to Razorpay; if still uncertain, query Razorpay API for capture status
Ledger DB primary loses a disk	Sync replica promoted; ~30s of write blockage	Postgres synchronous replication; orchestrator retries blocked writes
Redis idempotency cache wiped	First retry would re-execute	DB UNIQUE constraint on Transaction.idempotency_key catches it; original row read and returned
Notification service down	Webhooks not delivered	Async retry with exponential backoff for 3 days; payment itself unaffected
Whole AZ goes down	~30% of capacity lost	Multi-AZ deployment: traffic shifted to remaining AZs; sync replicas promoted; degraded for <5min

The mental model: failures are loud, not silent. Every failure surface is either (a) auto-retried by Temporal, (b) blocked by an idempotency-key, (c) caught by reconciliation the next morning, or (d) raised as an alert. There is no quiet failure path where money disappears and nobody knows for a week.

Step 16

Interview Q&A

How do you ensure the same payment isn't charged twice if the user double-clicks?

Idempotency keys, two layers of defense. The client SDK generates a UUID before the first request and reuses it on every retry. Server-side: Redis SETNX idempotency:<key> on the fast path, plus a UNIQUE constraint on Transaction.idempotency_key in Postgres as the bulletproof backup. Even if Redis is wiped, the DB rejects the duplicate insert and we return the original cached response. The client can mash "Buy" 100 times — only the first request runs, the rest return the same result.

What happens if we capture the card but our DB write fails?

Temporal saga compensating action. The orchestrator persisted "capture succeeded" before attempting the ledger write. When the ledger write fails permanently (after retries), the saga runs the compensating action — refund(captureId) against Razorpay — undoing the capture. Priya sees a payment-failed error; her HDFC card shows a charge and a matching refund within 5–7 working days (RBI refund timeline). The audit log records all five state transitions: authorized → captured → ledger-failed → refund-issued → refund-confirmed. Money is never lost or duplicated; only briefly in flight.

Why double-entry ledger over a simple balance column?

Auditability, the math invariant, and zero hot-row contention. A balance column overwrites history (you can't answer "what was the balance at 2pm yesterday?"). Double-entry is append-only — every paisa's movement is a row, queryable forever. The mathematical invariant SUM(amount_paise) FROM ledger_entry = 0 globally is a free correctness check; if it ever drifts from zero, money has been created or destroyed and we can detect it within a minute. And appending parallel rows scales infinitely; updating one balance row serializes everything on Postgres row locks.

How would you build the refund flow?

It's a brand-new transaction, not a mutation of the original. A refund request creates a new Transaction with its own idempotency key, a reference to the original payment, and writes new ledger entries that are equal-and-opposite to the original (debit Myntra's escrow, credit Priya's payment-method-account, both for ₹6,499). The orchestrator then calls Razorpay's refund API. Original transaction stays in the ledger forever — refund is recorded as a separate event with full audit trail. Partial refunds work the same way with a smaller amount. RBI mandates refunds reach the customer within 5–7 working days; we monitor refund-aging dashboards to catch breaches.

How does a saga differ from a 2-phase commit?

2PC takes locks; saga takes responsibility. 2PC asks every participant "can you commit?", holds locks, then says "commit". This requires every participant to support the protocol — and Razorpay / Visa / Mastercard / RuPay do not. 2PC also blocks indefinitely if the coordinator dies after "prepare". Saga gives up trying to be atomic, instead defining a compensating action for every step. When something fails, the saga undoes the partial work in reverse. Saga works across systems we don't own; 2PC works only inside one DB.

How do you keep PCI scope small?

Tokenize at the client, never see the PAN server-side. The Razorpay Checkout SDK (a Razorpay-hosted iframe) collects the card number directly in the browser and ships it to Razorpay's vault, which mints a network token via the Visa/Mastercard TSP and returns it as an opaque token_.... Our backend only ever stores tokens. We don't store, process, or transmit primary account numbers — which moves us from PCI Level 1 (full audit) to PCI SAQ-A (a checklist), and also aligns us with RBI's 2022 tokenisation mandate that flat-out forbids merchants from storing PANs. Backed up by network isolation, secret rotation, log redaction, quarterly ASV scans, annual QSA audit, and an annual RBI System Audit Report.

How do you handle a Razorpay outage mid-transaction?

It depends where in the saga we were. If we hadn't called Razorpay yet (before authorize), the circuit breaker fails fast — client sees error, no money moved, no compensation needed; large merchants optionally fail over to a secondary PA (PayU/Cashfree) at this point. If we'd authorized but not captured, Temporal retries capture with exponential backoff; if Razorpay stays down for hours we run the reverse-auth compensating action and tell the client we can't fulfill. If we'd captured but couldn't confirm, we use Razorpay's idempotency-key feature to safely re-query — never re-charge. After Razorpay recovers, the reconciliation job verifies our internal ledger matches Razorpay's settlement file to the paisa.

How do you scale the ledger to a billion rows per year?

Partition by account_id; sub-shard hot accounts. Most accounts have tiny histories so partitioning by account_id keeps each account's full record co-located for fast SUM queries. The platform escrow account — which touches every single transaction — gets sub-sharded into 32 virtual shards (escrow_01..escrow_32) so writes don't bottleneck on one row. Reads roll up by summing across the sub-shards. Snapshot caches materialize per-account balances every few minutes so balance reads are O(1) instead of O(N) over the entire ledger.

The one-line summary the interviewer remembers: "It's a saga-orchestrated workflow with strict idempotency keys writing into a double-entry ledger — the saga handles partial-failure recovery, the keys make retries safe, and the ledger guarantees that SUM(all entries) = 0 always holds. Every other component exists to support those three properties."

Payment System