Card Tokenization System — 50K TPS

What is card tokenization, and why does it exist?

Imagine Raj buys a coffee at a Starbucks app. He types his 16-digit Visa card number — 4111 2222 3333 4444 — into the checkout screen. That string of digits is called the PAN (Primary Account Number — the number printed on the front of every credit card). For Starbucks to charge Raj, the PAN has to travel from his phone to a payment processor like Stripe, then to Visa, then to Raj's bank. So far so good — that's just a payment.

Now imagine Raj uses Starbucks every morning for two years. Starbucks would love to never ask him for his card again — they want to remember it. The naive solution is "store the PAN in the database." That instantly creates three nightmares:

Nightmare 1 — Breach

A database dump (SQL injection, a stolen backup, a rogue admin) leaks millions of card numbers. Each PAN is worth $5–$50 on the dark web. The 2013 Target breach exposed 40 million PANs and cost Target ~$292M.

Nightmare 2 — PCI-DSS

PCI-DSS (Payment Card Industry Data Security Standard) is the legal contract Visa/Mastercard force on anyone storing PANs. The moment your DB has even one PAN, every machine that DB lives on falls "in scope" — quarterly audits, network segmentation, $250K+/year compliance bill.

Nightmare 3 — Sprawl

The PAN gets cached in logs, copied into analytics, replicated to a backup region, printed in a debug trace. Now you have to find and scrub every copy. Most breaches happen because of the copies, not the original.

Tokenization is the way out. It says: "We'll keep the PAN in exactly one vault, locked behind hardware encryption. Everywhere else — your application DB, your analytics, your logs, your CRM — we'll store a meaningless lookup string called a token that looks like a card number (tok_8392_1947_2055_8821) but maps to the real PAN only inside the vault." Now if your application DB leaks, the attacker walks away with tokens that are worthless without access to the vault. Your PCI scope shrinks from "everything" to "just the vault."

        The one-line definition: tokenization is a format-preserving replacement for sensitive data that is reversible only by a tightly controlled service. It's not encryption (the token doesn't contain the PAN) — it's a swap, with the real value held elsewhere.
      

This page designs a tokenization service that handles 50,000 tokenize/detokenize calls per second — roughly the volume of every major Indian payment gateway combined. The bulk of the page is the high-level design (capacity, architecture, network topology, deployment, multi-region, observability). The low-level Java code lives at the bottom as an appendix.

Clarify requirements — functional & non-functional

Before drawing boxes, pin down what the system must do, what it must guarantee, and — just as important — what it will not do. This list is what you'd hand the interviewer in the first 5 minutes.

Functional requirements (what)

Tokenize: accept a PAN, return a token. Same PAN → same token (deterministic) or a fresh token every time (random) — both options must be supported.
Detokenize: accept a token, return the PAN. Only callers with the right scope/role get the full PAN; others may get the last-4 digits only.
Card metadata lookup: given a token, return BIN (first 6 digits, identifies the bank), last-4, network (Visa/Mastercard), and expiry — without revealing the full PAN. This is the most common read.
Revoke / rotate: mark a token as dead (card lost) or re-tokenize with a new key (key rotation).
Multi-tenancy: separate merchants must have isolated tokens — Merchant A's token must not detokenize for Merchant B even if they share infra.
Audit log: every detokenize call is recorded — who, when, why, from where. Required by PCI-DSS § 10.

Non-functional requirements (how well)

Throughput: 50,000 TPS sustained, 100K TPS peak. Read:write ratio ~ 10:1 (most calls look up metadata; far fewer create new tokens).
Latency: p50 ≤ 5 ms, p99 ≤ 25 ms end-to-end. Payments time out fast — anything over 50 ms breaks checkout flows.
Availability: 99.99% (≈ 52 minutes of downtime/year). A vault outage kills every merchant integration simultaneously.
Durability: zero token loss. Losing a token means losing a customer's card-on-file forever.
Security: PCI-DSS Level 1 compliant. PAN is never logged, never sent over an unencrypted channel, never leaves the HSM (Hardware Security Module) in plaintext for more than the milliseconds it takes to compute the token.

        Out of scope (state this explicitly): we are not building the card-acquiring rails (Visa/Mastercard switch), the fraud-scoring engine, the chargeback flow, or the 3DS step-up authentication. We're building only the vault — the thing that holds PANs and hands out tokens. Saying "out of scope" early prevents the interviewer from steering you into a 3-hour discussion.
      

Back-of-the-envelope — does 50K TPS even fit?

Numbers ground every design decision. Before choosing Redis or Cassandra or Postgres, work out how much data we're storing, how much we're moving across the network, and how many machines we'll need. The interviewer is watching for whether you can do this without a calculator.

Traffic

Metric	Calculation	Result
Sustained throughput	given	50,000 req/s
Peak throughput (2× headroom)	50K × 2	100,000 req/s
Daily requests	50K × 86,400 s	~4.3 billion/day
Reads (detokenize + metadata)	91% of total	~45,500 req/s
Writes (new tokenize)	9% of total	~4,500 req/s

Storage

Each row in the vault is roughly: token (24 B) + encrypted PAN (32 B) + merchant_id (16 B) + BIN (6 B) + last4 (4 B) + expiry (4 B) + metadata (50 B) + timestamps (16 B) ≈ ~200 bytes per record (with indices and overhead, budget 500 B).

Horizon	Calculation	Result
Active card-on-file count (industry)	500M unique cards globally	~500M rows
Steady-state storage	500M × 500 B	~250 GB (fits 1 Postgres node easily; we shard anyway)
New tokens/day	4,500/s × 86,400 × 0.5 dedup	~200M/day if no dedup; ~10M/day with dedup of repeat cards
Audit-log writes (every call)	50K/s × 200 B	~10 MB/s ≈ ~860 GB/day — needs cheap blob storage (S3) + a query tier (ClickHouse)

Network & CPU

Resource	Calculation	Result
Inbound bandwidth (50K × 500 B request)	50K × 0.5 KB	~25 MB/s (200 Mbps) per region — trivial
HSM operations/sec	4,500 tokenize writes/s (HSM only on write)	Need cluster of 4–6 HSMs (each handles ~2K ops/s)
Cache hit ratio target	hot tokens (BIN/last4 reads)	≥ 95% — keeps Postgres reads at ~2,500/s, well within one shard

        So what: 250 GB of data is small. The hard parts are the HSM bottleneck on writes, the p99 latency budget (25 ms), and the audit log firehose (860 GB/day). Most of the architecture exists to solve those three, not to "scale storage."
      

Actors & use cases — who calls this thing, and why?

The vault is an internal service; humans never touch it directly. The "actors" are other services inside our own payments platform. Naming them in a flowchart prevents the common mistake of designing the vault in isolation from the apps it serves.

flowchart LR M([Merchant App
e.g. Starbucks]) P([Payment Gateway
e.g. Stripe internal]) F([Fraud Service]) A([Analytics / BI]) O([Ops Console]) M --> T1[Tokenize PAN at checkout] M --> T2[Charge with token] P --> T3[Detokenize before card-network call] F --> T4[Read BIN + last4 for risk scoring] A --> T5[Read masked metadata only] O --> T6[Revoke token / key rotation] style M fill:#e8743b,stroke:#e8743b,color:#fff style P fill:#4a90d9,stroke:#4a90d9,color:#fff style F fill:#9b72cf,stroke:#9b72cf,color:#fff style A fill:#38b265,stroke:#38b265,color:#fff style O fill:#d4a838,stroke:#d4a838,color:#fff

Notice three things from this diagram. First, the Merchant App never sees the PAN after the initial tokenize call — it only holds tokens. Second, the Payment Gateway is the only caller allowed to fetch the full PAN (and only when it's about to send the transaction to Visa). Third, Analytics can read masked fields (BIN, last4) without unlocking the vault — this is where most "useful queries" actually happen and we want them to be cheap and safe.

API design — five endpoints, no more

A good vault has a deliberately small API. Every extra endpoint is another attack surface. Five endpoints cover 100% of the use cases above.

Endpoint	Purpose	Auth	Idempotent?
`POST /v1/tokens`	Tokenize a PAN. Body: `{ pan, expiry, merchant_id, idempotency_key }`. Returns: `{ token, bin, last4, network }`.	mTLS + service JWT	Yes (via idempotency_key — replays return same token)
`GET /v1/tokens/{token}`	Detokenize. Returns full PAN. Caller must have `scope=pan:read`.	mTLS + JWT + scope	Yes
`GET /v1/tokens/{token}/metadata`	Cheap read — BIN, last4, network, expiry. Used by Fraud, Analytics, UI.	mTLS + JWT	Yes
`DELETE /v1/tokens/{token}`	Revoke. Future detokenize returns 410 Gone. Metadata still readable for forensics.	mTLS + JWT + scope	Yes
`POST /v1/tokens/{token}/rotate`	Re-encrypt PAN under a new KEK (Key Encryption Key). Token stays the same; the ciphertext underneath changes. Run during quarterly key rotation.	mTLS + ops scope	Yes

Why no PATCH /tokens/{token}? Cards are immutable from the vault's perspective — if the customer's card changes, the merchant tokenizes the new PAN and gets a new token. Mutability would create audit-trail nightmares.

        Idempotency matters. Payments retry aggressively (network blips are constant). If the merchant retries POST /v1/tokens three times, all three must return the same token. We pass an idempotency_key (UUID generated by the merchant) and store it in Redis for 24h. Without this, you create duplicate tokens for the same card on every retry — a silent data-quality bug.
      

High-Level Architecture — built up in three passes

Rather than dumping the final diagram and labelling boxes, let's discover the architecture the way you'd reason through it on a whiteboard. Three passes: naive → mental model → production. At each step a real failure forces the next piece into existence.

Pass 1 — The naive design (and exactly where it breaks)

The simplest thing that could work: one Postgres database with a table (token, pan, merchant_id, ...). Stripe → Postgres → done. Two endpoints, one table.

flowchart LR C[Merchant App] --> S[Tokenize Service] S --> DB[(Postgres
token, pan, ...)] style DB fill:#e05252,stroke:#e05252,color:#fff

Now list what breaks. This is the most important sentence of any architecture interview: "This naive design fails because…"

Breaks at security

The PAN is stored in plaintext (or with a key that lives on the same box as the data). One pg_dump and the attacker has 500 million cards. PCI-DSS would refuse to certify this design on day one.

Breaks at throughput

One Postgres node handles ~10K writes/s with fsync on. At 50K TPS sustained you'd hit lock contention, WAL backpressure, and 100% disk utilization within minutes. There's no horizontal scaling story here.

Breaks at latency

Every metadata lookup hits disk. A cold Postgres read is 5–10 ms by itself. Add network + auth + serialization and you're already past the 5 ms p50 budget. No room for the payment gateway, fraud check, etc.

Each of these three failures will drive one specific component into the production design — a Vault Service with HSM for security, sharded storage for throughput, and a read-through cache for latency.

Pass 2 — The mental model: split the "answer" from the "secret"

Here's the central architectural idea, and it's the one most candidates miss: the PAN and the token are two completely different objects with completely different access patterns.

The PAN side — "the secret"

Touched only by the Vault Service. Stored encrypted, key inside an HSM. Read by one caller (the Payment Gateway), and only at the moment of card-network submission. Very few reads. Locked-down network. PCI-DSS scope.

Optimize for: security, not speed.

The Token side — "the answer"

Touched by everyone — merchant apps, fraud, analytics, UI, BI dashboards. Mostly reads. Mostly metadata (BIN, last4) not the PAN itself. Cache-friendly. No PCI scope as long as it never holds the PAN.

Optimize for: speed, scale, cheap reads.

Once you internalize this split, every other choice falls out naturally: encrypted PAN goes in a separate datastore from the metadata, the cache holds only metadata (never PAN), the HSM sits in front of the PAN-store and nothing else. The architecture is just the careful drawing of this fence.

        Analogy: think of a hotel safe. The contents of the safe (your passport, cash) come out maybe once a week. But the front desk looks up "is room 304 occupied? what name? what time did they check in?" hundreds of times a day. You don't open the safe to answer those questions — you ask the front desk. Tokenization works the same way: PAN = safe contents, metadata = front-desk register, vault service = the only key to the safe.
      

Pass 3 — The production shape (numbered)

Now we draw the real architecture. Every numbered box is a component we added to solve a specific failure mode from Pass 1 or to honor the split from Pass 2. After the diagram, each number gets its own card with "what it does" and "what problem it solves."

flowchart TB subgraph CLIENT [Client Tier] M([① Merchant SDK
browser / mobile]) end subgraph EDGE [Edge / Auth] LB[② API Gateway + mTLS] AZ[③ Auth Service
JWT · scopes · rate-limit] end subgraph CORE [Tokenization Core] TS[④ Tokenize Service
stateless · auto-scaled] CA[⑤ Redis Cache
token → metadata] DT[⑥ Detok Service
PAN-read path · isolated] end subgraph VAULT [Vault Zone · PCI-DSS in-scope] HSM[(⑦ HSM Cluster
FIPS 140-2 L3)] VDB[(⑧ Vault DB
encrypted PAN
sharded by token-hash)] end subgraph META [Metadata Tier · out of PCI scope] MDB[(⑨ Metadata DB
BIN · last4 · expiry)] end subgraph AUDIT [Audit + Async] K[⑩ Kafka
audit events] CH[(⑪ ClickHouse
audit queries)] S3[(⑫ S3 Glacier
7-year retention)] end M -->|HTTPS + idempotency-key| LB LB --> AZ AZ --> TS AZ --> DT TS -->|write| VDB TS -->|encrypt PAN| HSM TS -->|write metadata| MDB TS -->|invalidate| CA DT -->|cache hit ~95%| CA CA -.miss.-> MDB DT -->|PAN read| VDB DT -->|decrypt PAN| HSM TS --> K DT --> K K --> CH K --> S3 style M fill:#e8743b,stroke:#e8743b,color:#fff style LB fill:#4a90d9,stroke:#4a90d9,color:#fff style AZ fill:#4a90d9,stroke:#4a90d9,color:#fff style TS fill:#9b72cf,stroke:#9b72cf,color:#fff style CA fill:#38b265,stroke:#38b265,color:#fff style DT fill:#9b72cf,stroke:#9b72cf,color:#fff style HSM fill:#e05252,stroke:#e05252,color:#fff style VDB fill:#e05252,stroke:#e05252,color:#fff style MDB fill:#38b265,stroke:#38b265,color:#fff style K fill:#d4a838,stroke:#d4a838,color:#fff style CH fill:#d4a838,stroke:#d4a838,color:#fff style S3 fill:#3cbfbf,stroke:#3cbfbf,color:#fff

Component-by-component — what each numbered box does

Use the numbers in the diagram above to find the matching card below. Every card has the same shape: what it does in plain language, then what would break if we removed it.

① Merchant SDK

A small JavaScript / iOS / Android library that the merchant embeds in their checkout page. When Raj types his card, the SDK collects the PAN in the browser, opens a TLS connection straight to our API gateway, and hands back a token to the merchant's own backend. The merchant's server never sees the raw PAN — that's the magic that keeps the merchant out of PCI scope.

Solves: "iframe trick" — if the SDK didn't exist and the merchant POSTed the PAN from their own backend, the merchant's web server, load balancer, and logs would all be in PCI scope. The SDK shrinks scope to just our infra.

② API Gateway + mTLS

The front door. Terminates TLS 1.3, enforces mutual TLS (the client presents a certificate proving "I am Merchant 42's checkout SDK"), drops malformed traffic, rate-limits by client cert. Built on Envoy or NGINX. Stateless — easy to scale horizontally to hundreds of pods behind an L4 load balancer.

Solves: raw exposure of the Tokenize Service. Without a gateway you'd put auth, rate-limiting, and TLS termination inside the app — every change forces a redeploy of business code. The gateway lets us update certificates and rate-limit rules with zero impact on the core service.

③ Auth Service

Mints and validates short-lived JWT tokens that carry the caller's identity (merchant_id, service_name) and scopes (pan:read, token:create, token:revoke). A merchant SDK gets token:create only — never pan:read. The Payment Gateway gets both. This separation is what stops a compromised merchant key from unlocking PANs.

Solves: "anyone with the right URL gets PAN" — without scope-aware auth, every authenticated caller could detokenize. PCI-DSS § 7 explicitly mandates least-privilege access to PANs.

④ Tokenize Service (write path)

The hot write path. Stateless, auto-scaled, written in Go or Java. For every POST /tokens: (1) check idempotency key in Redis; (2) call HSM to encrypt PAN with the active DEK (Data Encryption Key); (3) generate a token (UUIDv7 or HMAC-based, see §8); (4) write (token, ciphertext, KEK_id, merchant_id) to Vault DB; (5) write (token, BIN, last4, expiry) to Metadata DB; (6) publish audit event to Kafka. Returns the token to the caller.

Solves: hand-rolling all of these steps in the merchant or gateway code. Centralizing them here means there's exactly one place that knows how to mint a token correctly — and one place to audit.

⑤ Redis Cache (metadata)

A 6-node Redis cluster holding the hot token → {BIN, last4, expiry, network} mapping. TTL 1 hour, evicted LRU. Holds no PAN ever — only the safe-to-display fields. Hit ratio ≥ 95% in steady state because the same cards are charged repeatedly (subscriptions, recurring buyers).

Solves: the 5 ms p50 latency budget. Without the cache, every metadata read hits Postgres and you're at 10–15 ms p50. With the cache, hot reads are ~0.5 ms and the DB sees only 5% of traffic.

⑥ Detokenize Service (PAN-read path)

A separate service from Tokenize, on a separate network segment, with stricter IAM. Only callers with scope=pan:read reach it. For every call: verify scope, fetch ciphertext from Vault DB, ask HSM to decrypt, return PAN. Every call writes an audit event before the response leaves the service.

Solves: blast radius. If the Tokenize Service is compromised, the attacker can mint tokens but cannot read PANs. The two paths share zero code and zero credentials.

⑦ HSM Cluster (FIPS 140-2 Level 3)

Hardware Security Modules — physical, tamper-resistant boxes from Thales or AWS CloudHSM. They hold the KEK (Key Encryption Key) that wraps every DEK (Data Encryption Key). The DEK encrypts the PAN. The PAN's plaintext exists only for the microseconds the HSM has it; it never enters our application memory. Active-active cluster of 4–6 boxes for failover and throughput (~2K ops/s each).

Solves: the moral hazard of "where is the master key?" If the master key lives in a config file or environment variable, anyone who reads it owns every PAN. An HSM physically prevents key extraction — even root on the box can't dump it.

⑧ Vault DB (encrypted PAN store)

Postgres or DynamoDB, sharded by hash(token) across 16 shards. Stores (token, ciphertext, kek_id, merchant_id, status, created_at). PAN is encrypted with AES-256-GCM using a DEK that's itself encrypted by the HSM's KEK (envelope encryption). PCI-DSS in-scope — full disk encryption, network-isolated, no analytics access.

Solves: a single-DB hotspot. With 16 shards, write load is spread evenly. Sharding by hash(token) — not by merchant_id — avoids one big merchant becoming a hot shard.

⑨ Metadata DB (out of PCI scope)

A separate Postgres (or DynamoDB) with (token, BIN, last4, expiry, network, merchant_id, status). No PAN, no ciphertext. Because it never holds PAN data, it's not in PCI scope — analytics, fraud, BI can query it freely. Replicated to a read replica for analytics.

Solves: "every analyst is a PCI risk." When metadata lives on a non-PCI box, your data team can join it with order tables without escalating scope.

⑩ Kafka (audit event bus)

Every tokenize / detokenize / revoke call writes a structured event to a Kafka topic: {event_id, ts, actor, action, token, merchant_id, ip, scope_used}. Producer waits for acks=all on the write path because PCI-DSS § 10 says the audit log must be durable before the operation completes — if the log is missing, the operation didn't happen.

Solves: "did you actually log every PAN read?" Synchronous Kafka write with acks=all gives durability without coupling your latency to a slow log DB.

⑪ ClickHouse (audit query tier)

Kafka events are streamed into ClickHouse via Kafka Engine tables. Auditors query: "show me every detokenize on token X in the last 90 days" — ClickHouse answers in < 100 ms over billions of rows. Compaction + columnar layout keeps storage cheap.

Solves: running ad-hoc queries on Postgres would slow down the write path. A columnar OLAP store is built exactly for this audit pattern.

⑫ S3 Glacier (long-term retention)

PCI-DSS demands 7 years of audit logs. Storing all that in ClickHouse would cost a fortune. After 90 days, events tier down to S3 Glacier — pennies per GB, retrieval takes hours but auditors don't need millisecond access for 5-year-old events.

Solves: the long-tail-storage bill. ClickHouse handles the last 90 days hot, Glacier handles the next 6 years cold, deletion happens at year 7.

Concrete walkthrough — Raj's first Starbucks purchase

Let's trace one real request through every numbered component. Watch how the data-plane / control-plane split from Pass 2 keeps the PAN in one place.

Raj types 4111 2222 3333 4444 into Starbucks checkout. The Merchant SDK ① running in his browser collects the PAN and POSTs it to our API Gateway ② over TLS 1.3, presenting Starbucks' client cert.
Gateway validates the cert, forwards to Auth Service ③, which checks the scope is token:create only (no pan:read), then routes the request to Tokenize Service ④.
Tokenize Service checks Redis for the idempotency key — miss. It calls HSM ⑦ to encrypt the PAN under the current DEK, getting back ciphertext + DEK-wrapped-by-KEK. It generates a UUIDv7 token tok_018f....
Tokenize Service writes the ciphertext row to Vault DB ⑧ and the metadata row (BIN 411122, last4 4444) to Metadata DB ⑨ in a two-phase commit.
Tokenize Service publishes an audit event to Kafka ⑩ with acks=all, then returns the token to Starbucks. Total latency: ~7 ms.
Five months later, Raj clicks "Buy" on his daily latte. Starbucks' backend sends { token: tok_018f..., amount: 250 } to its payment gateway.
The payment gateway calls Detokenize Service ⑥ with scope=pan:read. Detok service hits Cache ⑤ first for the BIN routing (95% hit), then fetches ciphertext from Vault DB ⑧, asks HSM ⑦ to decrypt, gets the PAN back for 3 ms, and immediately submits to Visa.
Both calls are written to Kafka → ClickHouse ⑪ in real time. 90 days later they tier down to Glacier ⑫.

        So what: Raj's PAN entered our system once, lived inside the Vault Zone (⑦⑧) only, and was decrypted exactly twice in five months — once at tokenize, once at charge time. Everyone else (merchant, fraud, analytics) operated on the token or the masked metadata. That's the entire job of a tokenization platform in one sentence.
      

Latency Budget — every millisecond accounted for

"p99 ≤ 25 ms" sounds easy until you add up where the milliseconds actually go. The 25 ms budget has to cover network round-trips, mTLS handshakes (mostly cached), auth, the HSM RPC, two DB writes, a Kafka acks=all, and the response serialization. If any one of these blows its share, the budget is gone. Let's pull each flow apart by component so you can defend it on a whiteboard.

Tokenize (write path) — target p99 = 18 ms

Tokenize is the most expensive call because it hits the HSM and writes to two databases. Here's the millisecond-by-millisecond breakdown a senior engineer would walk through:

Gateway+mTLS 0.5

Auth 0.4

Idem-key lookup 0.3

HSM encrypt 4–6 ms

Vault DB write 1.5

Metadata DB write 1.5

Kafka acks=all 1.5

Response 0.3

Sum p50 ≈ 10 ms · p99 ≈ 18 ms (HSM tail dominates)

Hop	p50	p99	What's happening
TLS handshake (amortized)	0.05 ms	0.5 ms	TLS 1.3 0-RTT or session resumption; only first call on a connection pays the full 1-RTT cost.
API Gateway routing	0.2 ms	0.6 ms	Envoy decode, route match, header rewrite. Tail is GC pause inside Envoy.
Auth Service (JWT verify)	0.3 ms	0.8 ms	Public-key verify (RS256) — CPU bound. Cached JWKs avoid network calls.
Idempotency lookup (Redis)	0.3 ms	1 ms	`SET NX EX` single round-trip in same AZ.
HSM encrypt (the bottleneck)	4 ms	10 ms	Wrap DEK + AES-GCM encrypt PAN. CloudHSM RPC + crypto = ~3 ms median; tail is HSM queue depth.
Vault DB write	1 ms	3 ms	One `INSERT` with sync commit on Postgres primary in same AZ.
Metadata DB write	1 ms	3 ms	Second INSERT, parallel with vault write under best implementation.
Kafka `acks=all`	1.5 ms	4 ms	Replicate to 2 ISRs before ack. We accept this cost because PCI § 10 demands durable audit before responding.
Response serialization	0.2 ms	0.5 ms	JSON encode + send.

Detokenize (PAN-read) — target p99 = 15 ms

Detokenize skips the metadata write but still touches the HSM. The hot read is from Vault DB (which is small enough to fit fully in page cache) rather than disk.

Gateway 0.5

Auth+scope 0.5

Vault DB read 1.5

HSM decrypt 3–5

Audit Kafka 1.5

Response 0.3

Sum p50 ≈ 8 ms · p99 ≈ 15 ms

Metadata read (the hot 95%) — target p99 = 5 ms

The cheap path. No HSM, no Vault DB. Just gateway + auth + Redis. This is where most of the 50K TPS goes.

Gateway 0.5

Auth 0.4

Redis GET 0.5

Resp 0.2

Sum p50 ≈ 1.6 ms · p99 ≈ 5 ms

Where the tail (p99) actually comes from

HSM queue depth

Single HSMs handle ~2K ops/sec. When all 4 boxes hit 80% utilization, queuing latency adds 5–10 ms to p99. Mitigation: keep utilization under 60% via aggressive autoscaling on HSM-pool depth metric.

JVM/Go GC pauses

Go's STW pause is < 1 ms with current GC; older Java apps can pause 50+ ms during full GCs. Pin the Tokenize Service to G1GC with low pause targets, or use Shenandoah/ZGC.

Kafka leader election

If a Kafka broker fails mid-request, the producer waits for a new leader to be elected (typically < 6 s). During this window every audit-blocking write spikes — we use very short retry budgets to fail fast instead of waiting.

        Interview gold: when asked "why is your p99 25 ms?" — point to the HSM line. Hardware crypto modules are the dominant tail in every real tokenization system. The fix is not "use a faster algorithm" — it's "pool more HSMs and keep them lightly loaded."
      

Tokenization strategies — how do you actually generate the token?

"Generate a token" sounds trivial — give it a UUID and call it a day. In practice there are three families of approaches, each with very different consequences for the database, the cache, and the security posture.

A. Random / vault-based

Generate a cryptographically random ID (UUIDv4 or v7), store the row (token → PAN) in the vault DB. Token has zero relation to the PAN — knowing the token tells the attacker nothing.

Pros: Strongest security. Format flexible. Easy to revoke (delete row).

Cons: Requires DB lookup every detokenize (mitigated by cache). Same PAN tokenized twice gives different tokens unless we dedupe.

Use when: default choice for 90% of vaults. This is what we use.

B. Deterministic / HMAC-based

Token = HMAC-SHA256(PAN, secret_key) truncated to 16 bytes. Same PAN always gives the same token. No DB lookup needed for tokenize.

Pros: Idempotent for free. Useful when merchants want "same card = same token" for analytics across sessions.

Cons: If the secret leaks, an attacker can tokenize a guessed PAN and check if it matches your stored tokens — known-PAN attack. Mitigation: per-merchant secret. We still need a DB to detokenize.

Use when: the merchant explicitly needs cross-session card recognition without storing PANs themselves.

C. Format-Preserving Encryption (FPE)

Use NIST FF1/FF3-1 algorithms to encrypt the PAN into a string that also looks like a PAN (16 digits, valid Luhn checksum). The "token" is the encrypted PAN itself — no vault lookup needed.

Pros: Token fits in existing card-number columns in legacy systems. No DB hop for detokenize — just decrypt.

Cons: Compromise of the FPE key compromises every token. Tokens are not revocable independently — you'd rotate the key for everyone. Slower than HMAC.

Use when: retrofitting legacy mainframe systems that cannot hold a non-numeric token.

        Our choice — random vault tokens with a deterministic option: by default we mint a UUIDv7 (time-ordered for index locality). If the merchant sends "strategy":"deterministic" in the tokenize call, we instead compute HMAC-SHA256(PAN, merchant_specific_pepper). The merchant-specific pepper means a leak in one merchant's keys doesn't blow up every merchant.
      

Token format we emit: tok_<base32(uuidv7)> — e.g. tok_018f7a2b9c4d6e8f9a1b2c3d. Prefix lets logs/dashboards tell tokens apart from other IDs at a glance. Base32 (not base64) avoids URL-encoding issues.

Data model — ER diagram

We deliberately keep the PAN-bearing table in a different schema (and on a different DB) from the metadata. The ER diagram below shows both schemas as if they were one logical model — but in production they live behind different network boundaries.

erDiagram TOKEN { string token PK string merchant_id FK string kek_id FK bytes ciphertext_pan bytes wrapped_dek string status datetime created_at datetime revoked_at } METADATA { string token PK,FK string bin string last4 string network string expiry_yyyymm string merchant_id FK string status datetime created_at } MERCHANT { string merchant_id PK string name string pci_tier string pepper_id FK datetime onboarded_at } KEK { string kek_id PK string hsm_serial int version datetime created_at datetime rotated_at } AUDIT_EVENT { string event_id PK string token FK string actor string action string scope string ip datetime ts } MERCHANT ||--o{ TOKEN : "owns" MERCHANT ||--o{ METADATA : "owns" TOKEN ||--|| METADATA : "shares token" KEK ||--o{ TOKEN : "wraps DEK" TOKEN ||--o{ AUDIT_EVENT : "logged in"

Three things to note:

TOKEN and METADATA share the same PK (token) but live in different physical databases. The join happens only inside the Vault Service — the Metadata DB never knows the ciphertext exists, and the Vault DB never holds BIN/last4 (so a vault leak doesn't reveal which cards belong to whom).
KEK has versions. When we rotate the master key (quarterly per PCI), we don't decrypt and re-encrypt every row immediately — we mark kek_id_v2 active for new tokens, and rotate old rows in a background job. The token PK never changes.
AUDIT_EVENT is append-only. It lives in Kafka + ClickHouse, not in the relational vault. Never updated, never deleted before the 7-year retention window expires.

Sequence diagrams — the three hot flows

Tokenize (write path)

sequenceDiagram actor User as Raj (browser) participant SDK as Merchant SDK participant GW as API Gateway participant TS as Tokenize Service participant ID as Redis (idempotency) participant HSM as HSM participant VDB as Vault DB participant MDB as Metadata DB participant K as Kafka User->>SDK: enters PAN SDK->>GW: POST /v1/tokens {pan, idem_key, merchant_id} GW->>TS: mTLS-validated request TS->>ID: GET idem_key alt cache miss (new request) TS->>HSM: wrap(DEK) — encrypt PAN HSM-->>TS: ciphertext + wrappedDEK TS->>VDB: INSERT vault row TS->>MDB: INSERT metadata row TS->>K: publish AUDIT event (acks=all) K-->>TS: ack TS->>ID: SET idem_key = token TTL 24h TS-->>SDK: 201 {token, bin, last4} else cache hit (retry) ID-->>TS: existing token TS->>MDB: GET metadata TS-->>SDK: 200 {token, bin, last4} end

Detokenize (PAN-read path) — separate service, stricter auth

sequenceDiagram participant PG as Payment Gateway participant GW as API Gateway participant AZ as Auth Service participant DT as Detokenize Service participant CA as Redis Cache participant VDB as Vault DB participant HSM as HSM participant K as Kafka participant VN as Visa Network PG->>GW: GET /v1/tokens/{token} (scope=pan:read) GW->>AZ: verify JWT + scope AZ-->>GW: ok, scope confirmed GW->>DT: forward DT->>VDB: SELECT ciphertext WHERE token=? VDB-->>DT: row alt status != ACTIVE DT-->>PG: 410 Gone else status == ACTIVE DT->>HSM: unwrap(DEK) + decrypt(PAN) HSM-->>DT: PAN (plaintext) DT->>K: AUDIT {actor=PG, action=DETOK} K-->>DT: ack DT-->>PG: 200 {pan, expiry} PG->>VN: charge Note over DT: PAN bytes zeroized after return end

Metadata-only read (hot path, 95% of traffic)

sequenceDiagram participant Caller as Fraud / Analytics participant DT as Tokenize Service participant CA as Redis Cache participant MDB as Metadata DB Caller->>DT: GET /v1/tokens/{token}/metadata DT->>CA: GET token:meta alt cache hit (95%) CA-->>DT: {bin, last4, network, expiry} else cache miss (5%) CA-->>DT: nil DT->>MDB: SELECT bin, last4, network, expiry MDB-->>DT: row DT->>CA: SETEX token:meta 3600 end DT-->>Caller: 200 {bin, last4, network, expiry} Note over DT: HSM and Vault DB NOT touched

        Notice the asymmetry: the hot metadata read never touches the HSM, the Vault DB, or even the Detokenize Service. It's the cheap path because that's where the volume is. The expensive path (real PAN) only runs ~4,500 times/sec instead of 50,000.
      

Token lifecycle — state diagram

Every token goes through one of these states. Transitions are the only writes that mutate an existing row.

stateDiagram-v2 [*] --> ACTIVE : tokenize ACTIVE --> REVOKED : DELETE /tokens/{t} ACTIVE --> EXPIRED : card expiry_yyyymm passed ACTIVE --> ROTATING : quarterly KEK rotation begins ROTATING --> ACTIVE : re-encrypt complete REVOKED --> [*] : after 7-year audit window EXPIRED --> [*] : after 7-year audit window

State	Reads return	Writes allowed?
`ACTIVE`	PAN (with scope) + metadata	only `revoke`, `rotate`
`REVOKED`	metadata only (last4 + network for forensics); detokenize returns 410 Gone	none — terminal until purge
`EXPIRED`	metadata only; PAN refused	none — terminal until purge
`ROTATING`	PAN still readable through new KEK; old KEK kept until job completes	only background re-encryption worker

Network Topology & PCI Zones — drawing the fence

PCI scope is defined by network reachability, not by what your code does. If a machine can open a TCP connection to a machine that holds PANs, the auditor considers it in-scope. So the network diagram is literally how you keep the auditor away from 95% of your fleet. Think of it as four concentric rings, with firewall rules that only allow traffic going inward, never sideways or outward.

flowchart TB subgraph INET [Public Internet] USER([Merchant SDKs / Customers]) end subgraph DMZ [Zone 1 · DMZ · public subnet] ALB[AWS ALB / CloudFront] GW[API Gateway · Envoy] end subgraph APP [Zone 2 · App tier · private subnet] AUTH[Auth Service] TOK[Tokenize Service] META[Metadata API] REDIS[(Redis Cluster)] MDB[(Metadata DB)] end subgraph PCI [Zone 3 · PCI-DSS in-scope · isolated subnet] DETOK[Detokenize Service] VDB[(Vault DB · encrypted PAN)] end subgraph SECURE [Zone 4 · HSM subnet · air-gap admin] HSM[(HSM Cluster · FIPS L3)] end subgraph AUDIT [Zone 5 · Audit · append-only] KAFKA[Kafka cluster] CH[(ClickHouse)] S3[(S3 Glacier)] end USER --> ALB --> GW GW --> AUTH GW --> TOK GW --> META GW --> DETOK TOK --> REDIS TOK --> MDB TOK --> VDB TOK --> HSM DETOK --> VDB DETOK --> HSM TOK --> KAFKA DETOK --> KAFKA KAFKA --> CH KAFKA --> S3 style DMZ fill:#171d27,stroke:#4a90d9 style APP fill:#171d27,stroke:#38b265 style PCI fill:#171d27,stroke:#e05252 style SECURE fill:#171d27,stroke:#9b72cf style AUDIT fill:#171d27,stroke:#d4a838 style HSM fill:#e05252,stroke:#e05252,color:#fff style VDB fill:#e05252,stroke:#e05252,color:#fff

Zone	What lives here	Who can talk in	PCI scope?
Zone 1 · DMZ	ALB, API Gateway (Envoy). TLS termination, mTLS, WAF.	Public internet (port 443 only)	Connected — minimally in scope
Zone 2 · App	Auth Service, Tokenize, Metadata API, Redis, Metadata DB	Only DMZ; outbound to Zone 3 + 5 allowed	Tokenize is in-scope (touches PAN); rest is connected-only
Zone 3 · PCI-in-scope	Detokenize Service, Vault DB (encrypted PAN)	Only Zone 2 services with specific IAM role; nothing else	Fully in scope
Zone 4 · HSM	HSM cluster, key-rotation tooling	Only Tokenize + Detokenize via dedicated VPC endpoint; admin via separate jump host	In scope — physical + logical isolation
Zone 5 · Audit	Kafka, ClickHouse, S3 archive	Inbound only — no outbound back to Zones 2/3	Connected (holds audit data, not PAN)

Why the segmentation matters — the "blast radius" view

If an attacker compromises a pod in Zone 2 (say, a vulnerable image in the Metadata API), here's what they can and cannot reach:

Can reach (within Zone 2)

Other Zone 2 pods over the cluster network
Metadata DB (BIN + last4 only — useful but not PAN)
Redis cache (metadata only — no PAN)

Cannot reach

Vault DB — firewall blocks the port from Zone 2 IP ranges
HSM — same; HSM API gateway only accepts requests from specific service accounts
Audit Kafka — Zone 5 is inbound-only; no replay attacks possible from a compromised pod

        Defense-in-depth recap: the PAN is protected by (1) HSM-managed encryption, (2) IAM scopes on the API, (3) JWT-level least privilege, and (4) raw network segmentation. An attacker has to defeat all four to exfiltrate PANs. The network fence is the cheapest of the four and probably the most effective.
      

PCI-DSS & security — the rules you cannot violate

PCI-DSS (Payment Card Industry Data Security Standard) is the rulebook every system touching PANs must follow. It has 12 top-level requirements; the ones that actively shape this design are the ones below. Treat them as hard constraints, not nice-to-haves — failing any of them means the auditor revokes your right to process cards.

PCI § 3 — Protect stored PAN

PAN at rest must be unreadable. We use envelope encryption: the DEK (per-record AES-256 key) encrypts the PAN; the KEK (master key) encrypts the DEK; the KEK lives inside an HSM and never appears in our application memory. A dump of the Vault DB yields ciphertext that needs the HSM to decrypt — and the HSM refuses anyone without the right IAM role.

PCI § 4 — Encrypt PAN in transit

TLS 1.2+ everywhere. mTLS between services. Old TLS, SSL, and any cipher with known weaknesses (RC4, 3DES) are disabled at the gateway. Within our VPC we still use TLS — not "trust the network" — because lateral movement is the most common attack vector after initial compromise.

PCI § 7 — Least privilege

JWT scopes split token:create from pan:read. The Merchant SDK can only mint tokens; only the Payment Gateway gets pan:read. Even within the team, on-call engineers cannot read PANs — only incident-time elevated access through a separate break-glass workflow that pages the security team.

PCI § 10 — Audit everything

Every tokenize, detokenize, revoke, and admin action writes an immutable Kafka event with acks=all before the response leaves the service. Logs are retained for 7 years (90 days hot in ClickHouse, then S3 Glacier). Failed audit writes block the operation — yes, this can cause an outage; that's by design.

PCI § 3.6 — Key management

KEKs rotate quarterly. Rotation is online: new tokens get the new KEK immediately; old tokens are re-wrapped in the background. Old KEKs are retained (read-only) until the last token using them is rotated or purged. Key generation, retirement, and destruction are all logged.

PCI § 9 — Physical security

HSMs are FIPS 140-2 Level 3 — tamper-resistant, alarmed, audited. On AWS we use CloudHSM (managed) or self-host Thales boxes in a co-located facility. The HSM management network is air-gapped from the application network.

Defense-in-depth — what we add beyond PCI minimums

Token format reveals nothing. A token tok_018f... contains no embedded BIN, no merchant_id, no expiry — leaking a token gives the attacker an opaque blob.
Detokenize rate-limit per actor. A compromised payment-gateway credential should not be able to bulk-exfiltrate PANs. We rate-limit detokenize to ~50 req/s per actor in non-charge contexts; bulk reads require an out-of-band approval workflow.
Anomaly alerts. ClickHouse runs continuous queries: "actor X did 5× their 30-day average detokenize rate in the last 5 min" → page on-call.
Per-merchant pepper. Even in deterministic mode, the HMAC key is per-merchant. Compromise of one merchant's pepper doesn't allow guessing tokens across merchants.
No PAN in logs, ever. Logger formatters strip anything that looks like a 13–19 digit number passing the Luhn check. We unit-test this; CI fails if a print statement contains pan.

Deployment Topology — Kubernetes, HSM cluster, and the autoscaler

The architecture diagram tells you what services exist. The deployment topology tells you where they run — which K8s namespace, how many replicas, how they autoscale, and how the HSM (which is not a Pod — it's literally a physical box) plugs in. This is where interviewers test whether you've actually run a production system, or just drawn boxes.

flowchart TB subgraph REG [Region us-east-1] subgraph AZ1 [AZ us-east-1a] subgraph K8S1 [EKS cluster · same VPC] NS1[ns: tokenize
Tokenize Pods × 30] NS2[ns: detokenize
Detok Pods × 20] NS3[ns: gateway
Envoy × 12] REDIS1[(Redis primary × 6)] end HSM1[HSM box × 2
baremetal] VDB1[(Vault DB primary
16 shards)] MDB1[(Metadata DB primary)] end subgraph AZ2 [AZ us-east-1b] subgraph K8S2 [EKS · same cluster, different AZ] NS1B[Tokenize × 30] NS2B[Detok × 20] NS3B[Envoy × 12] REDIS2[(Redis replica × 6)] end HSM2[HSM box × 2] VDB2[(Vault DB sync replica)] MDB2[(Metadata DB async replica)] end end NS1 --> HSM1 NS2 --> HSM1 NS1B --> HSM2 NS2B --> HSM2 VDB1 -. sync replication .- VDB2 MDB1 -. async replication .- MDB2 style HSM1 fill:#e05252,stroke:#e05252,color:#fff style HSM2 fill:#e05252,stroke:#e05252,color:#fff style VDB1 fill:#e05252,stroke:#e05252,color:#fff style VDB2 fill:#e05252,stroke:#e05252,color:#fff

Pod sizing — back of the envelope

Service	Per-pod capacity	Pods at 50K TPS	Resources
API Gateway (Envoy)	~8K req/s	~14 (with headroom 24)	2 vCPU · 2 GB · no PV
Tokenize Service (Java/Go)	~400 req/s (HSM-bound)	~50 (with headroom 60)	2 vCPU · 1 GB
Detokenize Service	~400 req/s (HSM-bound)	~30 (with headroom 40)	2 vCPU · 1 GB
Metadata Read Service	~3K req/s (Redis-bound)	~20 (with headroom 30)	1 vCPU · 512 MB
Auth Service	~5K req/s	~12	1 vCPU · 512 MB

Autoscaling — what the HPA actually watches

Most teams default to CPU-based HPA. For a tokenization service that's wrong because the bottleneck is the HSM, not CPU. We use custom metrics via Prometheus Adapter:

Tokenize Service HPA

Scale up if hsm_pool_depth > 100 OR p95_latency_ms > 15. Scale down if both metrics have been below 50% threshold for 10 min.

Gateway HPA

Scale on requests_per_pod > 6000 (= 75% of capacity). Drop a pod only if RPS < 3000 for 5 min — slow scale-down avoids thrash.

Metadata Read HPA

Scale on redis_connection_pool_pct > 70%. Redis is the gating resource; CPU rarely matters here.

HSM cluster — outside Kubernetes

HSMs are not Pods. They're physical (or AWS-managed) appliances reached over a VPC endpoint. We run 4 HSMs per region (2 per AZ for HA). The app reaches them via an internal HSM proxy that:

Pools connections — the HSM SDK has a limited connection budget; we share connections across pods via a sidecar/sidecar-less proxy (HashiCorp Boundary or custom).
Hides the vendor SDK — application code talks gRPC; vendor swap (Thales → CloudHSM) only changes the proxy.
Load-balances across HSMs — round-robin with health checks. A dead HSM is taken out of rotation in < 5 s.
Implements circuit breaker — if 50% of HSM calls fail within 30 s, the breaker opens and tokenize returns 503 instead of queuing forever.

        Interview prompt: if the interviewer asks "why not run HSM crypto inside the application?" — the answer is FIPS 140-2 Level 3 compliance. Self-managed keys in app memory wouldn't pass the audit; you'd need to extract them, audit-log every operation in hardware, and prove tamper-evidence. An HSM does all that out of the box.
      

Multi-Region & Disaster Recovery — how to survive a region outage

One region is enough for capacity. Two regions are required for survival. The hard question is which kind of two regions: active-active (both serve writes) or active-passive (one serves, one waits). For a payment vault, the answer is active-passive with regional pinning, and we'll walk through why active-active is a trap here.

The active-active trap

Active-active feels obviously better — more throughput, faster regional failover, lower latency for users. But for tokenization, it creates a correctness problem you can't paper over:

The idempotency split-brain

Imagine the merchant retries a tokenize call. The first hits us-east, the retry (after a 1-second timeout) hits us-west. Both regions check the idempotency key — both see "not found" because cross-region replication takes 30 s. Both encrypt the PAN under different DEKs, write to their own Vault DB, return different tokens. Now the merchant has two tokens for the same card, and nobody knows.

The token-uniqueness violation

Two regions, both minting UUIDv7s. UUIDv7 is unique only within the bounds of a clock-sync assumption — drift between regions can in theory collide (vanishingly small, but possible). More importantly, the audit log now lives in two places with no single source of truth for PCI § 10.

The chosen design — active-passive with merchant pinning

flowchart LR subgraph US_E [us-east-1 · PRIMARY] GW_E[Gateway] APP_E[Services] VDB_E[(Vault DB)] HSM_E[(HSMs)] K_E[Kafka] end subgraph US_W [us-west-2 · WARM STANDBY] GW_W[Gateway · standby] APP_W[Services · 25% capacity] VDB_W[(Vault DB replica)] HSM_W[(HSMs · separate keys)] K_W[Kafka mirror] end subgraph EU [eu-west-1 · LOCAL for EU merchants] GW_EU[Gateway] APP_EU[Services] VDB_EU[(Vault DB · EU PANs only)] HSM_EU[(HSMs)] end R53[Route 53 · health-checked] R53 -->|US merchants| GW_E R53 -.failover.-> GW_W R53 -->|EU merchants| GW_EU VDB_E -- async logical repl 30s --> VDB_W K_E -- MirrorMaker 2 --> K_W style US_E fill:#171d27,stroke:#38b265 style US_W fill:#171d27,stroke:#d4a838 style EU fill:#171d27,stroke:#4a90d9

What replicates, and how

Data	Mechanism	Lag	Why this way
Vault DB (encrypted PAN)	Postgres logical replication, async	~30 s RPO	Sync would tie us-east p99 to cross-region latency (60+ ms). We accept 30 s of potential data loss for healthy latency.
Metadata DB	Aurora Global async replica	~5 s	Read-from-replica for analytics in west region; failover RTO ~ 1 minute.
Kafka audit topic	MirrorMaker 2	~10 s	Audit logs replicate to both regions so a region loss does not erase compliance trail.
Redis cache	Don't replicate	n/a	Cache is rebuildable. After failover, west's Redis is cold for ~5 min; we accept the cache-miss storm and pre-warm top tokens during failover drills.
HSM keys	Manual sync (operator-driven)	quarterly	HSM key material never traverses the network. Operators export wrapped KEK from primary HSM, import to DR HSM, audit both sides. Done as part of the quarterly rotation drill.

Failover playbook

Detection (≤ 30 s). Route 53 healthchecks fail on us-east gateway. PagerDuty fires.
Decision (≤ 2 min). On-call confirms primary-region failure (not just a flapping check). Pages incident commander.
Promote (≤ 3 min). Promote us-west Vault DB replica to primary. Aurora handles this with aws rds failover-db-cluster. Repoint app config via consul.
Scale up (≤ 5 min). West region was running at 25% capacity (cost saving). HPA + cluster autoscaler bring it to 100% within minutes.
Cut DNS (≤ 1 min). Route 53 weighted-routing shifts all traffic to us-west. TTL is 30 s, so worldwide propagation completes in ~1 min.
Verify audit trail (continuous). Check that MirrorMaker 2 has caught up; if Kafka audit lag > 60 s, alert security on a possible compliance gap.

        RTO ≈ 10 min, RPO ≈ 30 s. For a payments vault this is acceptable — losing 30 s of fresh tokenize calls means a handful of merchants must retry their checkout flow. We do not lose the PAN, because the PAN is already at Visa once charged. The only thing at risk is brand-new card-on-file additions in the 30 s window before failover.
      

Why a separate EU region

EU GDPR + PSD2 require EU cardholder data to stay in the EU. EU merchants are pinned to eu-west-1 at the Route 53 layer; their PANs never touch us-east. The EU region is its own isolated tokenization vault with its own HSM keys — a US-region breach doesn't expose EU cards. The cost is operating two parallel vaults, but compliance leaves no choice.

Scaling to 50K TPS — where the bottlenecks actually live

You almost never scale "the database" first. The three real bottlenecks in a tokenization vault, in order, are: HSM throughput, audit-log durability, and cache miss storms. Solving them is most of the scaling work.

A HSM throughput

A single HSM does ~2,000 wrap/unwrap operations per second. At 4,500 tokenize writes/sec we need at least 3 HSMs; with headroom, run 6 in active-active and load-balance via an HSM proxy. Critically: the 45K metadata reads/sec do NOT hit the HSM — only PAN encrypt/decrypt does. That's why the cache exists.

B Audit log durability

Synchronous Kafka writes with acks=all + min.insync.replicas=2 on a 6-broker cluster. 50K events/sec × 200 B ≈ 10 MB/s — Kafka handles that with one hand tied. The trick is keeping producer batching tight enough that acks=all doesn't add > 3 ms p99.

C Cache miss storm

If Redis loses its hot keys (failover, eviction storm), suddenly all 50K reads hit Postgres. Mitigations: (1) Redis cluster with replicas + automatic failover; (2) "request coalescing" — when N concurrent requests miss on the same key, only one hits the DB; (3) warm-cache job on cold start.

Sharding the Vault DB

500M tokens × 500 B = 250 GB. Fits one node, but we shard for write throughput and blast radius:

Shard key: hash(token) with consistent hashing across 16 shards. Why not merchant_id? Because one huge merchant (Amazon) would be a hot shard. Hashing the token guarantees uniform distribution.
Per-shard: primary + 2 replicas (sync to one, async to the other for cross-region DR). Postgres logical replication on AWS RDS Multi-AZ, or Aurora Global if going cross-region.
Routing: the Vault Service holds the shard map in memory, refreshed from a coordination service (etcd) every 30 s.

Service tier scaling

Both Tokenize Service and Detokenize Service are stateless — scale by adding pods behind an L4 load balancer. Per-pod budget: ~500 req/s comfortably (Go) or ~300 req/s (Java with JIT warm). For 50K TPS plan ~150 pods total across regions. CPU scales linearly; memory is mostly Redis client pools and connection pools.

Observability — SLOs, golden signals, and the alerts you can't sleep through

A payment vault that's "down" but not alerted is worse than no vault at all — every payment in flight times out, merchants disable cards, customer support floods. So observability isn't optional plumbing; it's part of the product. Three layers: SLOs set the contract, metrics + traces tell you whether you're meeting it, and alerts wake the on-call when you're not.

Service Level Objectives (SLOs)

SLI (what we measure)	SLO (the target)	Error budget
Availability (any 5xx ratio per minute)	99.99% over 28 days	~4 min downtime / 28 d
Tokenize p99 latency	< 25 ms over 28 days	1% of requests may exceed
Detokenize p99 latency	< 30 ms over 28 days	1%
Audit log completeness	100% — no missing events	0 (hard requirement)
Failed-tokenize PAN-not-stored guarantee	100% — never partial state	0

Once the error budget is burned, all new feature deploys are frozen until the budget resets. This is the SRE contract — it forces the team to invest in reliability, not just feature velocity.

Golden signals — what every dashboard shows

Latency

p50 / p95 / p99 per endpoint, per region. Broken out by tokenize vs detok vs metadata. Histograms with 1-min granularity; 7-day retention.

Traffic

RPS per endpoint per region. Per-merchant top-10 to spot a single tenant blasting the API.

Errors

4xx + 5xx ratios. 4xx by reason (auth fail, scope mismatch, idempotency conflict). 5xx triggers paging.

Saturation

HSM queue depth, Redis connection pool utilization, DB connection pool, Kafka producer queue depth, pod CPU.

Distributed tracing

Every request gets an OpenTelemetry trace from gateway → service → HSM → DB → Kafka. Sampled at 1% in steady state, 100% for any request flagged as anomalous (e.g., 5xx, p99 outlier). Traces let an engineer answer "where did this one slow request spend its time?" without log-grepping.

Security-specific telemetry — the alerts that matter for tokenization

Detokenize anomaly

ClickHouse continuous query — actor X's detok rate is 5× their 30-day rolling average over a 5-min window. Possible compromised credential or bulk exfil attempt. Page security team, not just on-call SRE.

Audit-log lag

If Kafka MirrorMaker 2 lag > 60 s OR ClickHouse ingestion lag > 5 min, compliance gap. Audit trail must always be present.

HSM error rate

Any single HSM > 1% error rate over 5 min → quarantine. > 2 HSMs degraded → page; we're running on reduced capacity.

Key rotation overdue

A KEK older than 95 days triggers a warning; 100 days triggers a page. PCI requires quarterly rotation.

PAN-in-log canary

A synthetic test logs a fake-but-valid-Luhn number every hour; if our log scrubber misses it, page security. This catches the day a developer adds a debug print.

Mismatched Vault/Metadata

Reconciler job: for each new token, both rows must exist within 5 s. Mismatch > 0.01% → investigate. Catches dual-write bugs.

        Interview point: when asked "how do you know your tokenization service is working?" — the strongest answer is "we have an SLO-based contract with consumers and an error budget; when the budget burns we freeze feature work." Not "we log everything" — that's noise, not signal.
      

Failure modes — what breaks, and what we do about it

For each failure, the question is the same: does the system corrupt data, return wrong data, or just fail loudly? Tokenization can never afford the first two — better to fail loud than to silently mint a token whose PAN cannot be retrieved.

Failure	Detection	Mitigation
HSM unreachable	Healthcheck on HSM client; circuit breaker opens after 3 timeouts	Tokenize returns 503 (fail closed). Never "encrypt later" — the PAN must not enter the DB unencrypted.
Vault DB shard down	Driver-level errors; per-shard health	Replica auto-promoted within ~10 s. Calls to that shard fail 503 during gap.
Kafka unavailable	Producer cannot get acks	Block the operation (audit log is mandatory). Optional WAL-style local disk buffer as last-resort fallback, with strict TTL and replay on Kafka recovery.
Vault write succeeds, metadata write fails	Outbox / CDC reconciler detects mismatch	Async reconciliation job replays the metadata write from the Kafka audit event (which has everything needed to rebuild the metadata row).
Redis cache cluster failover	Increased miss rate, p99 spike	Request coalescing prevents thundering herd; new primary takes over < 30 s; warm-up job preloads top 1M tokens.
Compromised JWT signing key	Out-of-band alert; anomalous detok rate	Rotate signing key, invalidate all live JWTs (short TTLs make this cheap), audit-log all detok in the affected window.
Compromised KEK (worst case)	HSM tamper alarm or insider report	Mass re-encryption job with new KEK; mark all rows ROTATING. The old KEK is destroyed only after the last row finishes rotation. PCI-required incident report within 24h.
Replay attack on tokenize	Idempotency key collision	Idempotency store returns the original token — replay is a no-op by design.
Token enumeration	Detok 404 rate spike from one actor	Token format is random 128-bit — enumeration is infeasible. Rate-limit + IP block on the actor.

LLD appendix — Class diagram

The HLD is the system view. The LLD here zooms into one box from the architecture diagram — the Tokenize Service (④) — and shows the classes inside it. Reserved for the last 15 minutes of an interview if asked "now show me the code."

classDiagram direction LR class TokenizationService { -TokenGenerator tokenGenerator -EncryptionService encryptionService -VaultRepository vaultRepo -MetadataRepository metadataRepo -AuditPublisher auditPublisher -IdempotencyStore idempotencyStore +tokenize(req) TokenizeResponse +detokenize(token, actor) DetokenizeResponse +metadata(token) MetadataResponse +revoke(token, actor) void } class TokenGenerator { +generate(pan, merchantId, strategy) String } class UuidV7TokenGenerator { +generate(pan, merchantId, strategy) String } class HmacTokenGenerator { -PepperStore peppers +generate(pan, merchantId, strategy) String } class EncryptionService { +encrypt(plaintext) Ciphertext +decrypt(ciphertext) bytes } class HsmEncryptionService { -HsmClient hsm -String currentKekId +encrypt(plaintext) Ciphertext +decrypt(ciphertext) bytes } class VaultRepository { +save(record) void +findByToken(token) VaultRecord +updateStatus(token, status) void } class PostgresVaultRepository { -DataSource ds +save(record) void +findByToken(token) VaultRecord +updateStatus(token, status) void } class MetadataRepository { +save(record) void +findByToken(token) MetadataRecord } class AuditPublisher { +publish(event) void } class KafkaAuditPublisher { -KafkaProducer producer +publish(event) void } class IdempotencyStore { -RedisClient redis +putIfAbsent(key, token, ttl) bool +get(key) String } TokenizationService --> TokenGenerator : strategy TokenizationService --> EncryptionService : encrypts via TokenizationService --> VaultRepository : writes PAN TokenizationService --> MetadataRepository : writes meta TokenizationService --> AuditPublisher : logs TokenizationService --> IdempotencyStore : dedupes TokenGenerator <|-- UuidV7TokenGenerator TokenGenerator <|-- HmacTokenGenerator EncryptionService <|-- HsmEncryptionService VaultRepository <|-- PostgresVaultRepository AuditPublisher <|-- KafkaAuditPublisher

Design patterns at work

Pattern	Where	Why
Strategy	`TokenGenerator` (UUIDv7 vs HMAC)	The choice between random and deterministic tokens is a runtime decision per merchant.
Repository	`VaultRepository`, `MetadataRepository`	Abstracts away whether the storage is Postgres, DynamoDB, or a mock — critical for unit tests with no DB.
Facade	`TokenizationService`	Hides the 6-step orchestration (idempotency check → encrypt → store → cache → audit) behind one method.
Adapter	`HsmEncryptionService`	The HSM vendor SDK (Thales, AWS CloudHSM) has its own bespoke API — adapter normalizes it to our `EncryptionService` interface so we can swap vendors.
Builder	`TokenizeRequest`, `AuditEvent`	Both have > 5 fields, several optional. Builder avoids 10-arg constructors.
Observer	`AuditPublisher` consumers	ClickHouse, S3, and fraud all subscribe to the same Kafka topic without the Tokenize Service knowing about them.

LLD appendix — Java implementation

Java 17+, no frameworks. The interesting parts are the orchestration in TokenizationService.tokenize and the envelope-encryption sequence in HsmEncryptionService.encrypt. Everything else is plumbing.

Enums & records

public enum TokenStrategy { RANDOM, DETERMINISTIC }
public enum TokenStatus   { ACTIVE, REVOKED, EXPIRED, ROTATING }
public enum CardNetwork   { VISA, MASTERCARD, AMEX, RUPAY, DISCOVER }

public record Ciphertext(byte[] cipher, byte[] wrappedDek, String kekId) {}
public record VaultRecord(String token, String merchantId, Ciphertext payload,
                          TokenStatus status, Instant createdAt, Instant revokedAt) {}
public record MetadataRecord(String token, String bin, String last4, CardNetwork network,
                             String expiryYYYYMM, String merchantId, TokenStatus status) {}
public record AuditEvent(String eventId, Instant ts, String actor, String action,
                         String token, String merchantId, String ip, String scopeUsed) {}
public record TokenizeResponse(String token, String bin, String last4, CardNetwork network) {}
public record DetokenizeResponse(String pan, String expiryYYYYMM) {}

Request with Builder

public final class TokenizeRequest {
  private final String pan, expiryYYYYMM, merchantId, idempotencyKey, ip;
  private final TokenStrategy strategy;

  private TokenizeRequest(Builder b) {
    this.pan = b.pan; this.expiryYYYYMM = b.expiry;
    this.merchantId = b.merchantId; this.idempotencyKey = b.idem;
    this.ip = b.ip; this.strategy = b.strategy;
  }
  public String pan(){ return pan; }
  public String expiry(){ return expiryYYYYMM; }
  public String merchantId(){ return merchantId; }
  public String idempotencyKey(){ return idempotencyKey; }
  public String ip(){ return ip; }
  public TokenStrategy strategy(){ return strategy; }

  public static Builder builder(){ return new Builder(); }
  public static final class Builder {
    private String pan, expiry, merchantId, idem, ip;
    private TokenStrategy strategy = TokenStrategy.RANDOM;
    public Builder pan(String v){ this.pan = v; return this; }
    public Builder expiry(String v){ this.expiry = v; return this; }
    public Builder merchantId(String v){ this.merchantId = v; return this; }
    public Builder idempotencyKey(String v){ this.idem = v; return this; }
    public Builder ip(String v){ this.ip = v; return this; }
    public Builder strategy(TokenStrategy v){ this.strategy = v; return this; }
    public TokenizeRequest build(){ return new TokenizeRequest(this); }
  }
}

HSM envelope encryption (the security-critical bit)

public final class HsmEncryptionService implements EncryptionService {
  private final HsmClient hsm;            // vendor SDK, thread-safe
  private final String   currentKekId;
  private static final SecureRandom RNG = new SecureRandom();

  public Ciphertext encrypt(byte[] plaintext) {
    byte[] dek = new byte[32]; RNG.nextBytes(dek);     // fresh DEK per record
    byte[] iv  = new byte[12]; RNG.nextBytes(iv);
    byte[] cipher = AesGcm.encrypt(plaintext, dek, iv);  // local AES-256-GCM
    byte[] wrapped = hsm.wrap(currentKekId, dek);        // KEK never leaves HSM
    Arrays.fill(dek, (byte)0);                            // best-effort zeroize
    return new Ciphertext(concat(iv, cipher), wrapped, currentKekId);
  }

  public byte[] decrypt(Ciphertext c) {
    byte[] dek = hsm.unwrap(c.kekId(), c.wrappedDek());
    try {
      byte[] iv      = Arrays.copyOfRange(c.cipher(), 0, 12);
      byte[] payload = Arrays.copyOfRange(c.cipher(), 12, c.cipher().length);
      return AesGcm.decrypt(payload, dek, iv);
    } finally {
      Arrays.fill(dek, (byte)0);
    }
  }
}

The orchestrating service

public final class TokenizationService {
  private final Map<TokenStrategy, TokenGenerator> generators;
  private final EncryptionService encryption;
  private final VaultRepository    vault;
  private final MetadataRepository metadata;
  private final AuditPublisher     audit;
  private final IdempotencyStore   idem;
  private final Clock              clock;

  public TokenizeResponse tokenize(TokenizeRequest req) {
    // 1) idempotency
    Optional<String> existing = idem.get(req.idempotencyKey());
    if (existing.isPresent()) {
      MetadataRecord m = metadata.findByToken(existing.get()).orElseThrow();
      return new TokenizeResponse(m.token(), m.bin(), m.last4(), m.network());
    }
    // 2) mint
    String token = generators.get(req.strategy())
        .generate(req.pan(), req.merchantId(), req.strategy());
    // 3) encrypt
    Ciphertext payload = encryption.encrypt(req.pan().getBytes(StandardCharsets.UTF_8));
    // 4) two writes
    Instant now = clock.instant();
    vault.save(new VaultRecord(token, req.merchantId(), payload, TokenStatus.ACTIVE, now, null));
    String bin = req.pan().substring(0, 6);
    String last4 = req.pan().substring(req.pan().length() - 4);
    CardNetwork network = NetworkResolver.fromBin(bin);
    metadata.save(new MetadataRecord(token, bin, last4, network,
        req.expiryYYYYMM(), req.merchantId(), TokenStatus.ACTIVE));
    // 5) audit (sync)
    audit.publish(new AuditEvent(UUID.randomUUID().toString(), now,
        req.merchantId(), "TOKENIZE", token, req.merchantId(), req.ip(), "token:create"));
    // 6) memoize idempotency
    idem.putIfAbsent(req.idempotencyKey(), token, Duration.ofHours(24));
    return new TokenizeResponse(token, bin, last4, network);
  }

  public DetokenizeResponse detokenize(String token, String actor, String ip) {
    VaultRecord r = vault.findByToken(token)
        .orElseThrow(() -> new NotFoundException(token));
    if (r.status() != TokenStatus.ACTIVE) throw new GoneException("token revoked");
    byte[] pan = encryption.decrypt(r.payload());
    audit.publish(new AuditEvent(UUID.randomUUID().toString(), clock.instant(),
        actor, "DETOKENIZE", token, r.merchantId(), ip, "pan:read"));
    MetadataRecord m = metadata.findByToken(token).orElseThrow();
    try {
      return new DetokenizeResponse(new String(pan, StandardCharsets.UTF_8), m.expiryYYYYMM());
    } finally {
      Arrays.fill(pan, (byte)0);     // do not keep PAN bytes in heap
    }
  }
}

Thread-safety in one sentence

TokenizationService is stateless once constructed; all collaborators are themselves thread-safe (HSM client connection pool, Redis client, JDBC pool, Kafka producer). The only mutable secret on heap is the DEK byte array, which we zeroize in a finally block — best-effort, since GC may have moved it; the HSM is the real boundary.

Trade-offs & interview talking points

The "right answer" in an interview isn't a design — it's recognising what you traded away. Walk through these out loud and you signal seniority.

Decision	Alternative we rejected	Why
Random tokens with vault lookup	Format-Preserving Encryption (FPE) for stateless detokenize	Vault lookup is cheap with the cache; FPE makes revocation impossible at the per-token level and ties every detokenize to one master key's lifetime.
Separate Tokenize & Detokenize services	Single service handles both	Blast-radius reduction: a Tokenize-side RCE shouldn't be able to read existing PANs. Two services, two IAM roles, two network segments.
Envelope encryption (DEK per record + KEK in HSM)	Single global key for all PANs	Per-record DEKs let us rotate keys without re-encrypting every row immediately, and limit the blast radius of a memory-disclosure bug.
Sync Kafka audit write before response	Async fire-and-forget audit	PCI § 10 requires durable audit. We accept ~1–2 ms p99 cost for the audit write because compliance demands it; losing audit logs invalidates the entire system.
Separate Metadata DB (no PAN)	Single DB with PAN + metadata	Letting analytics query a non-PCI DB is enormous for the data team. The cost is dual-write complexity, which we solve with outbox reconciliation.
Shard by hash(token)	Shard by merchant_id	Hash-sharding gives even write distribution. Merchant sharding makes a huge merchant a hot shard.
Active-passive multi-region	Active-active writes	Cross-region idempotency on the same key is a correctness nightmare. We accept ~10 min failover RTO instead.
UUIDv7 (time-ordered) over UUIDv4	UUIDv4 random	UUIDv7 keeps inserts append-only on the B-tree, dramatically improving index locality and write throughput. Still 74 random bits — unguessable.

Extension points (Open/Closed)

New encryption backend. Implement EncryptionService with AWS KMS, GCP CloudKMS, or HashiCorp Vault — no change in TokenizationService.
New token strategy. Add a FpeTokenGenerator implementing TokenGenerator and register it in the strategy map. Existing code is untouched.
New audit sink. Add another consumer to the Kafka topic — say, a SIEM (Splunk, Datadog) — without changing the producer side.
Card-on-file network token rails. Visa and Mastercard now offer "network tokens" (issuer-side tokens). Plug them in as an alternate VaultRepository implementation that calls Visa Token Service instead of writing to Postgres.
Multi-region read replicas. Wrap VaultRepository + MetadataRepository in routing decorators that pick the closest replica on reads.

Likely interviewer follow-ups

"What if the HSM goes down mid-tokenize?"

Circuit breaker opens, service returns 503, retry budget on the merchant SDK kicks in (max 3 retries with backoff). Critically — we never store PAN without HSM-backed encryption, even temporarily. Fail closed is the only safe option.

"How do you rotate the master key without downtime?"

Activate the new KEK; all new tokens use it. A background worker re-wraps old DEKs with the new KEK in batches (say, 1M rows/hour) — token IDs do not change, only the wrapped DEK does. Old KEK kept readable until the last row rotates, then destroyed in the HSM.

"What if two services both call `POST /tokens` with the same idempotency key concurrently?"

The first call wins via Redis SET NX EX. The second call sees the existing key, reads the existing token from metadata, returns it. No double-tokenize, no race.

"How do you delete a customer's data for GDPR right-to-erasure?"

"Crypto-shred": delete the wrapped DEK for that customer's rows. The ciphertext becomes mathematically unrecoverable without re-encryption — instant, irreversible deletion of the PAN without rewriting the row. Metadata can stay (or be tombstoned) for legitimate audit needs.

"Why don't you put PAN and metadata in one row?"

To keep the metadata table out of PCI scope. The moment metadata and PAN share a row (or even a database), analytics joining that table inherits PCI scope. Splitting them costs a dual-write and saves the entire data team from PCI hell.

"What's the p99 actually look like in production?"

Metadata read (cache hit): ~1 ms. Metadata read (cache miss): ~8 ms. Tokenize (HSM-bound): ~12–18 ms (HSM round-trip + 2× DB write). Detokenize (HSM-bound): ~10–15 ms. Tail is dominated by HSM RPC and Kafka acks=all.

        Final intuition check: if you can explain why we have two databases (Vault vs Metadata), why the HSM is not in the read path 95% of the time, and why Kafka is on the synchronous path despite the latency hit — you've understood the design. Everything else is implementation detail.