A payment-grade vault that turns raw 16-digit PANs into opaque tokens, survives PCI-DSS audit, and tokenizes 50,000 card transactions per second — explained as a beginner-friendly HLD walkthrough with the LLD code appendix at the bottom.
Imagine Raj buys a coffee at a Starbucks app. He types his 16-digit Visa card number — 4111 2222 3333 4444 — into the checkout screen. That string of digits is called the PAN (Primary Account Number — the number printed on the front of every credit card). For Starbucks to charge Raj, the PAN has to travel from his phone to a payment processor like Stripe, then to Visa, then to Raj's bank. So far so good — that's just a payment.
Now imagine Raj uses Starbucks every morning for two years. Starbucks would love to never ask him for his card again — they want to remember it. The naive solution is "store the PAN in the database." That instantly creates three nightmares:
A database dump (SQL injection, a stolen backup, a rogue admin) leaks millions of card numbers. Each PAN is worth $5–$50 on the dark web. The 2013 Target breach exposed 40 million PANs and cost Target ~$292M.
PCI-DSS (Payment Card Industry Data Security Standard) is the legal contract Visa/Mastercard force on anyone storing PANs. The moment your DB has even one PAN, every machine that DB lives on falls "in scope" — quarterly audits, network segmentation, $250K+/year compliance bill.
The PAN gets cached in logs, copied into analytics, replicated to a backup region, printed in a debug trace. Now you have to find and scrub every copy. Most breaches happen because of the copies, not the original.
Tokenization is the way out. It says: "We'll keep the PAN in exactly one vault, locked behind hardware encryption. Everywhere else — your application DB, your analytics, your logs, your CRM — we'll store a meaningless lookup string called a token that looks like a card number (tok_8392_1947_2055_8821) but maps to the real PAN only inside the vault." Now if your application DB leaks, the attacker walks away with tokens that are worthless without access to the vault. Your PCI scope shrinks from "everything" to "just the vault."
This page designs a tokenization service that handles 50,000 tokenize/detokenize calls per second — roughly the volume of every major Indian payment gateway combined. The bulk of the page is the high-level design (capacity, architecture, network topology, deployment, multi-region, observability). The low-level Java code lives at the bottom as an appendix.
Before drawing boxes, pin down what the system must do, what it must guarantee, and — just as important — what it will not do. This list is what you'd hand the interviewer in the first 5 minutes.
Numbers ground every design decision. Before choosing Redis or Cassandra or Postgres, work out how much data we're storing, how much we're moving across the network, and how many machines we'll need. The interviewer is watching for whether you can do this without a calculator.
| Metric | Calculation | Result |
|---|---|---|
| Sustained throughput | given | 50,000 req/s |
| Peak throughput (2× headroom) | 50K × 2 | 100,000 req/s |
| Daily requests | 50K × 86,400 s | ~4.3 billion/day |
| Reads (detokenize + metadata) | 91% of total | ~45,500 req/s |
| Writes (new tokenize) | 9% of total | ~4,500 req/s |
Each row in the vault is roughly: token (24 B) + encrypted PAN (32 B) + merchant_id (16 B) + BIN (6 B) + last4 (4 B) + expiry (4 B) + metadata (50 B) + timestamps (16 B) ≈ ~200 bytes per record (with indices and overhead, budget 500 B).
| Horizon | Calculation | Result |
|---|---|---|
| Active card-on-file count (industry) | 500M unique cards globally | ~500M rows |
| Steady-state storage | 500M × 500 B | ~250 GB (fits 1 Postgres node easily; we shard anyway) |
| New tokens/day | 4,500/s × 86,400 × 0.5 dedup | ~200M/day if no dedup; ~10M/day with dedup of repeat cards |
| Audit-log writes (every call) | 50K/s × 200 B | ~10 MB/s ≈ ~860 GB/day — needs cheap blob storage (S3) + a query tier (ClickHouse) |
| Resource | Calculation | Result |
|---|---|---|
| Inbound bandwidth (50K × 500 B request) | 50K × 0.5 KB | ~25 MB/s (200 Mbps) per region — trivial |
| HSM operations/sec | 4,500 tokenize writes/s (HSM only on write) | Need cluster of 4–6 HSMs (each handles ~2K ops/s) |
| Cache hit ratio target | hot tokens (BIN/last4 reads) | ≥ 95% — keeps Postgres reads at ~2,500/s, well within one shard |
The vault is an internal service; humans never touch it directly. The "actors" are other services inside our own payments platform. Naming them in a flowchart prevents the common mistake of designing the vault in isolation from the apps it serves.
Notice three things from this diagram. First, the Merchant App never sees the PAN after the initial tokenize call — it only holds tokens. Second, the Payment Gateway is the only caller allowed to fetch the full PAN (and only when it's about to send the transaction to Visa). Third, Analytics can read masked fields (BIN, last4) without unlocking the vault — this is where most "useful queries" actually happen and we want them to be cheap and safe.
A good vault has a deliberately small API. Every extra endpoint is another attack surface. Five endpoints cover 100% of the use cases above.
| Endpoint | Purpose | Auth | Idempotent? |
|---|---|---|---|
POST /v1/tokens |
Tokenize a PAN. Body: { pan, expiry, merchant_id, idempotency_key }. Returns: { token, bin, last4, network }. |
mTLS + service JWT | Yes (via idempotency_key — replays return same token) |
GET /v1/tokens/{token} |
Detokenize. Returns full PAN. Caller must have scope=pan:read. |
mTLS + JWT + scope | Yes |
GET /v1/tokens/{token}/metadata |
Cheap read — BIN, last4, network, expiry. Used by Fraud, Analytics, UI. | mTLS + JWT | Yes |
DELETE /v1/tokens/{token} |
Revoke. Future detokenize returns 410 Gone. Metadata still readable for forensics. | mTLS + JWT + scope | Yes |
POST /v1/tokens/{token}/rotate |
Re-encrypt PAN under a new KEK (Key Encryption Key). Token stays the same; the ciphertext underneath changes. Run during quarterly key rotation. | mTLS + ops scope | Yes |
Why no PATCH /tokens/{token}? Cards are immutable from the vault's perspective — if the customer's card changes, the merchant tokenizes the new PAN and gets a new token. Mutability would create audit-trail nightmares.
POST /v1/tokens three times, all three must return the same token. We pass an idempotency_key (UUID generated by the merchant) and store it in Redis for 24h. Without this, you create duplicate tokens for the same card on every retry — a silent data-quality bug.
Rather than dumping the final diagram and labelling boxes, let's discover the architecture the way you'd reason through it on a whiteboard. Three passes: naive → mental model → production. At each step a real failure forces the next piece into existence.
The simplest thing that could work: one Postgres database with a table (token, pan, merchant_id, ...). Stripe → Postgres → done. Two endpoints, one table.
Now list what breaks. This is the most important sentence of any architecture interview: "This naive design fails because…"
The PAN is stored in plaintext (or with a key that lives on the same box as the data). One pg_dump and the attacker has 500 million cards. PCI-DSS would refuse to certify this design on day one.
One Postgres node handles ~10K writes/s with fsync on. At 50K TPS sustained you'd hit lock contention, WAL backpressure, and 100% disk utilization within minutes. There's no horizontal scaling story here.
Every metadata lookup hits disk. A cold Postgres read is 5–10 ms by itself. Add network + auth + serialization and you're already past the 5 ms p50 budget. No room for the payment gateway, fraud check, etc.
Each of these three failures will drive one specific component into the production design — a Vault Service with HSM for security, sharded storage for throughput, and a read-through cache for latency.
Here's the central architectural idea, and it's the one most candidates miss: the PAN and the token are two completely different objects with completely different access patterns.
Touched only by the Vault Service. Stored encrypted, key inside an HSM. Read by one caller (the Payment Gateway), and only at the moment of card-network submission. Very few reads. Locked-down network. PCI-DSS scope.
Optimize for: security, not speed.
Touched by everyone — merchant apps, fraud, analytics, UI, BI dashboards. Mostly reads. Mostly metadata (BIN, last4) not the PAN itself. Cache-friendly. No PCI scope as long as it never holds the PAN.
Optimize for: speed, scale, cheap reads.
Once you internalize this split, every other choice falls out naturally: encrypted PAN goes in a separate datastore from the metadata, the cache holds only metadata (never PAN), the HSM sits in front of the PAN-store and nothing else. The architecture is just the careful drawing of this fence.
Now we draw the real architecture. Every numbered box is a component we added to solve a specific failure mode from Pass 1 or to honor the split from Pass 2. After the diagram, each number gets its own card with "what it does" and "what problem it solves."
Use the numbers in the diagram above to find the matching card below. Every card has the same shape: what it does in plain language, then what would break if we removed it.
A small JavaScript / iOS / Android library that the merchant embeds in their checkout page. When Raj types his card, the SDK collects the PAN in the browser, opens a TLS connection straight to our API gateway, and hands back a token to the merchant's own backend. The merchant's server never sees the raw PAN — that's the magic that keeps the merchant out of PCI scope.
Solves: "iframe trick" — if the SDK didn't exist and the merchant POSTed the PAN from their own backend, the merchant's web server, load balancer, and logs would all be in PCI scope. The SDK shrinks scope to just our infra.
The front door. Terminates TLS 1.3, enforces mutual TLS (the client presents a certificate proving "I am Merchant 42's checkout SDK"), drops malformed traffic, rate-limits by client cert. Built on Envoy or NGINX. Stateless — easy to scale horizontally to hundreds of pods behind an L4 load balancer.
Solves: raw exposure of the Tokenize Service. Without a gateway you'd put auth, rate-limiting, and TLS termination inside the app — every change forces a redeploy of business code. The gateway lets us update certificates and rate-limit rules with zero impact on the core service.
Mints and validates short-lived JWT tokens that carry the caller's identity (merchant_id, service_name) and scopes (pan:read, token:create, token:revoke). A merchant SDK gets token:create only — never pan:read. The Payment Gateway gets both. This separation is what stops a compromised merchant key from unlocking PANs.
Solves: "anyone with the right URL gets PAN" — without scope-aware auth, every authenticated caller could detokenize. PCI-DSS § 7 explicitly mandates least-privilege access to PANs.
The hot write path. Stateless, auto-scaled, written in Go or Java. For every POST /tokens: (1) check idempotency key in Redis; (2) call HSM to encrypt PAN with the active DEK (Data Encryption Key); (3) generate a token (UUIDv7 or HMAC-based, see §8); (4) write (token, ciphertext, KEK_id, merchant_id) to Vault DB; (5) write (token, BIN, last4, expiry) to Metadata DB; (6) publish audit event to Kafka. Returns the token to the caller.
Solves: hand-rolling all of these steps in the merchant or gateway code. Centralizing them here means there's exactly one place that knows how to mint a token correctly — and one place to audit.
A 6-node Redis cluster holding the hot token → {BIN, last4, expiry, network} mapping. TTL 1 hour, evicted LRU. Holds no PAN ever — only the safe-to-display fields. Hit ratio ≥ 95% in steady state because the same cards are charged repeatedly (subscriptions, recurring buyers).
Solves: the 5 ms p50 latency budget. Without the cache, every metadata read hits Postgres and you're at 10–15 ms p50. With the cache, hot reads are ~0.5 ms and the DB sees only 5% of traffic.
A separate service from Tokenize, on a separate network segment, with stricter IAM. Only callers with scope=pan:read reach it. For every call: verify scope, fetch ciphertext from Vault DB, ask HSM to decrypt, return PAN. Every call writes an audit event before the response leaves the service.
Solves: blast radius. If the Tokenize Service is compromised, the attacker can mint tokens but cannot read PANs. The two paths share zero code and zero credentials.
Hardware Security Modules — physical, tamper-resistant boxes from Thales or AWS CloudHSM. They hold the KEK (Key Encryption Key) that wraps every DEK (Data Encryption Key). The DEK encrypts the PAN. The PAN's plaintext exists only for the microseconds the HSM has it; it never enters our application memory. Active-active cluster of 4–6 boxes for failover and throughput (~2K ops/s each).
Solves: the moral hazard of "where is the master key?" If the master key lives in a config file or environment variable, anyone who reads it owns every PAN. An HSM physically prevents key extraction — even root on the box can't dump it.
Postgres or DynamoDB, sharded by hash(token) across 16 shards. Stores (token, ciphertext, kek_id, merchant_id, status, created_at). PAN is encrypted with AES-256-GCM using a DEK that's itself encrypted by the HSM's KEK (envelope encryption). PCI-DSS in-scope — full disk encryption, network-isolated, no analytics access.
Solves: a single-DB hotspot. With 16 shards, write load is spread evenly. Sharding by hash(token) — not by merchant_id — avoids one big merchant becoming a hot shard.
A separate Postgres (or DynamoDB) with (token, BIN, last4, expiry, network, merchant_id, status). No PAN, no ciphertext. Because it never holds PAN data, it's not in PCI scope — analytics, fraud, BI can query it freely. Replicated to a read replica for analytics.
Solves: "every analyst is a PCI risk." When metadata lives on a non-PCI box, your data team can join it with order tables without escalating scope.
Every tokenize / detokenize / revoke call writes a structured event to a Kafka topic: {event_id, ts, actor, action, token, merchant_id, ip, scope_used}. Producer waits for acks=all on the write path because PCI-DSS § 10 says the audit log must be durable before the operation completes — if the log is missing, the operation didn't happen.
Solves: "did you actually log every PAN read?" Synchronous Kafka write with acks=all gives durability without coupling your latency to a slow log DB.
Kafka events are streamed into ClickHouse via Kafka Engine tables. Auditors query: "show me every detokenize on token X in the last 90 days" — ClickHouse answers in < 100 ms over billions of rows. Compaction + columnar layout keeps storage cheap.
Solves: running ad-hoc queries on Postgres would slow down the write path. A columnar OLAP store is built exactly for this audit pattern.
PCI-DSS demands 7 years of audit logs. Storing all that in ClickHouse would cost a fortune. After 90 days, events tier down to S3 Glacier — pennies per GB, retrieval takes hours but auditors don't need millisecond access for 5-year-old events.
Solves: the long-tail-storage bill. ClickHouse handles the last 90 days hot, Glacier handles the next 6 years cold, deletion happens at year 7.
Let's trace one real request through every numbered component. Watch how the data-plane / control-plane split from Pass 2 keeps the PAN in one place.
4111 2222 3333 4444 into Starbucks checkout. The Merchant SDK ① running in his browser collects the PAN and POSTs it to our API Gateway ② over TLS 1.3, presenting Starbucks' client cert.token:create only (no pan:read), then routes the request to Tokenize Service ④.tok_018f....411122, last4 4444) to Metadata DB ⑨ in a two-phase commit.acks=all, then returns the token to Starbucks. Total latency: ~7 ms.{ token: tok_018f..., amount: 250 } to its payment gateway.scope=pan:read. Detok service hits Cache ⑤ first for the BIN routing (95% hit), then fetches ciphertext from Vault DB ⑧, asks HSM ⑦ to decrypt, gets the PAN back for 3 ms, and immediately submits to Visa."p99 ≤ 25 ms" sounds easy until you add up where the milliseconds actually go. The 25 ms budget has to cover network round-trips, mTLS handshakes (mostly cached), auth, the HSM RPC, two DB writes, a Kafka acks=all, and the response serialization. If any one of these blows its share, the budget is gone. Let's pull each flow apart by component so you can defend it on a whiteboard.
Tokenize is the most expensive call because it hits the HSM and writes to two databases. Here's the millisecond-by-millisecond breakdown a senior engineer would walk through:
Sum p50 ≈ 10 ms · p99 ≈ 18 ms (HSM tail dominates)
| Hop | p50 | p99 | What's happening |
|---|---|---|---|
| TLS handshake (amortized) | 0.05 ms | 0.5 ms | TLS 1.3 0-RTT or session resumption; only first call on a connection pays the full 1-RTT cost. |
| API Gateway routing | 0.2 ms | 0.6 ms | Envoy decode, route match, header rewrite. Tail is GC pause inside Envoy. |
| Auth Service (JWT verify) | 0.3 ms | 0.8 ms | Public-key verify (RS256) — CPU bound. Cached JWKs avoid network calls. |
| Idempotency lookup (Redis) | 0.3 ms | 1 ms | SET NX EX single round-trip in same AZ. |
| HSM encrypt (the bottleneck) | 4 ms | 10 ms | Wrap DEK + AES-GCM encrypt PAN. CloudHSM RPC + crypto = ~3 ms median; tail is HSM queue depth. |
| Vault DB write | 1 ms | 3 ms | One INSERT with sync commit on Postgres primary in same AZ. |
| Metadata DB write | 1 ms | 3 ms | Second INSERT, parallel with vault write under best implementation. |
Kafka acks=all | 1.5 ms | 4 ms | Replicate to 2 ISRs before ack. We accept this cost because PCI § 10 demands durable audit before responding. |
| Response serialization | 0.2 ms | 0.5 ms | JSON encode + send. |
Detokenize skips the metadata write but still touches the HSM. The hot read is from Vault DB (which is small enough to fit fully in page cache) rather than disk.
Sum p50 ≈ 8 ms · p99 ≈ 15 ms
The cheap path. No HSM, no Vault DB. Just gateway + auth + Redis. This is where most of the 50K TPS goes.
Sum p50 ≈ 1.6 ms · p99 ≈ 5 ms
Single HSMs handle ~2K ops/sec. When all 4 boxes hit 80% utilization, queuing latency adds 5–10 ms to p99. Mitigation: keep utilization under 60% via aggressive autoscaling on HSM-pool depth metric.
Go's STW pause is < 1 ms with current GC; older Java apps can pause 50+ ms during full GCs. Pin the Tokenize Service to G1GC with low pause targets, or use Shenandoah/ZGC.
If a Kafka broker fails mid-request, the producer waits for a new leader to be elected (typically < 6 s). During this window every audit-blocking write spikes — we use very short retry budgets to fail fast instead of waiting.
"Generate a token" sounds trivial — give it a UUID and call it a day. In practice there are three families of approaches, each with very different consequences for the database, the cache, and the security posture.
Generate a cryptographically random ID (UUIDv4 or v7), store the row (token → PAN) in the vault DB. Token has zero relation to the PAN — knowing the token tells the attacker nothing.
Pros: Strongest security. Format flexible. Easy to revoke (delete row).
Cons: Requires DB lookup every detokenize (mitigated by cache). Same PAN tokenized twice gives different tokens unless we dedupe.
Use when: default choice for 90% of vaults. This is what we use.
Token = HMAC-SHA256(PAN, secret_key) truncated to 16 bytes. Same PAN always gives the same token. No DB lookup needed for tokenize.
Pros: Idempotent for free. Useful when merchants want "same card = same token" for analytics across sessions.
Cons: If the secret leaks, an attacker can tokenize a guessed PAN and check if it matches your stored tokens — known-PAN attack. Mitigation: per-merchant secret. We still need a DB to detokenize.
Use when: the merchant explicitly needs cross-session card recognition without storing PANs themselves.
Use NIST FF1/FF3-1 algorithms to encrypt the PAN into a string that also looks like a PAN (16 digits, valid Luhn checksum). The "token" is the encrypted PAN itself — no vault lookup needed.
Pros: Token fits in existing card-number columns in legacy systems. No DB hop for detokenize — just decrypt.
Cons: Compromise of the FPE key compromises every token. Tokens are not revocable independently — you'd rotate the key for everyone. Slower than HMAC.
Use when: retrofitting legacy mainframe systems that cannot hold a non-numeric token.
"strategy":"deterministic" in the tokenize call, we instead compute HMAC-SHA256(PAN, merchant_specific_pepper). The merchant-specific pepper means a leak in one merchant's keys doesn't blow up every merchant.
Token format we emit: tok_<base32(uuidv7)> — e.g. tok_018f7a2b9c4d6e8f9a1b2c3d. Prefix lets logs/dashboards tell tokens apart from other IDs at a glance. Base32 (not base64) avoids URL-encoding issues.
We deliberately keep the PAN-bearing table in a different schema (and on a different DB) from the metadata. The ER diagram below shows both schemas as if they were one logical model — but in production they live behind different network boundaries.
Three things to note:
token) but live in different physical databases. The join happens only inside the Vault Service — the Metadata DB never knows the ciphertext exists, and the Vault DB never holds BIN/last4 (so a vault leak doesn't reveal which cards belong to whom).kek_id_v2 active for new tokens, and rotate old rows in a background job. The token PK never changes.Every token goes through one of these states. Transitions are the only writes that mutate an existing row.
| State | Reads return | Writes allowed? |
|---|---|---|
ACTIVE | PAN (with scope) + metadata | only revoke, rotate |
REVOKED | metadata only (last4 + network for forensics); detokenize returns 410 Gone | none — terminal until purge |
EXPIRED | metadata only; PAN refused | none — terminal until purge |
ROTATING | PAN still readable through new KEK; old KEK kept until job completes | only background re-encryption worker |
PCI scope is defined by network reachability, not by what your code does. If a machine can open a TCP connection to a machine that holds PANs, the auditor considers it in-scope. So the network diagram is literally how you keep the auditor away from 95% of your fleet. Think of it as four concentric rings, with firewall rules that only allow traffic going inward, never sideways or outward.
| Zone | What lives here | Who can talk in | PCI scope? |
|---|---|---|---|
| Zone 1 · DMZ | ALB, API Gateway (Envoy). TLS termination, mTLS, WAF. | Public internet (port 443 only) | Connected — minimally in scope |
| Zone 2 · App | Auth Service, Tokenize, Metadata API, Redis, Metadata DB | Only DMZ; outbound to Zone 3 + 5 allowed | Tokenize is in-scope (touches PAN); rest is connected-only |
| Zone 3 · PCI-in-scope | Detokenize Service, Vault DB (encrypted PAN) | Only Zone 2 services with specific IAM role; nothing else | Fully in scope |
| Zone 4 · HSM | HSM cluster, key-rotation tooling | Only Tokenize + Detokenize via dedicated VPC endpoint; admin via separate jump host | In scope — physical + logical isolation |
| Zone 5 · Audit | Kafka, ClickHouse, S3 archive | Inbound only — no outbound back to Zones 2/3 | Connected (holds audit data, not PAN) |
If an attacker compromises a pod in Zone 2 (say, a vulnerable image in the Metadata API), here's what they can and cannot reach:
PCI-DSS (Payment Card Industry Data Security Standard) is the rulebook every system touching PANs must follow. It has 12 top-level requirements; the ones that actively shape this design are the ones below. Treat them as hard constraints, not nice-to-haves — failing any of them means the auditor revokes your right to process cards.
PAN at rest must be unreadable. We use envelope encryption: the DEK (per-record AES-256 key) encrypts the PAN; the KEK (master key) encrypts the DEK; the KEK lives inside an HSM and never appears in our application memory. A dump of the Vault DB yields ciphertext that needs the HSM to decrypt — and the HSM refuses anyone without the right IAM role.
TLS 1.2+ everywhere. mTLS between services. Old TLS, SSL, and any cipher with known weaknesses (RC4, 3DES) are disabled at the gateway. Within our VPC we still use TLS — not "trust the network" — because lateral movement is the most common attack vector after initial compromise.
JWT scopes split token:create from pan:read. The Merchant SDK can only mint tokens; only the Payment Gateway gets pan:read. Even within the team, on-call engineers cannot read PANs — only incident-time elevated access through a separate break-glass workflow that pages the security team.
Every tokenize, detokenize, revoke, and admin action writes an immutable Kafka event with acks=all before the response leaves the service. Logs are retained for 7 years (90 days hot in ClickHouse, then S3 Glacier). Failed audit writes block the operation — yes, this can cause an outage; that's by design.
KEKs rotate quarterly. Rotation is online: new tokens get the new KEK immediately; old tokens are re-wrapped in the background. Old KEKs are retained (read-only) until the last token using them is rotated or purged. Key generation, retirement, and destruction are all logged.
HSMs are FIPS 140-2 Level 3 — tamper-resistant, alarmed, audited. On AWS we use CloudHSM (managed) or self-host Thales boxes in a co-located facility. The HSM management network is air-gapped from the application network.
tok_018f... contains no embedded BIN, no merchant_id, no expiry — leaking a token gives the attacker an opaque blob.pan.The architecture diagram tells you what services exist. The deployment topology tells you where they run — which K8s namespace, how many replicas, how they autoscale, and how the HSM (which is not a Pod — it's literally a physical box) plugs in. This is where interviewers test whether you've actually run a production system, or just drawn boxes.
| Service | Per-pod capacity | Pods at 50K TPS | Resources |
|---|---|---|---|
| API Gateway (Envoy) | ~8K req/s | ~14 (with headroom 24) | 2 vCPU · 2 GB · no PV |
| Tokenize Service (Java/Go) | ~400 req/s (HSM-bound) | ~50 (with headroom 60) | 2 vCPU · 1 GB |
| Detokenize Service | ~400 req/s (HSM-bound) | ~30 (with headroom 40) | 2 vCPU · 1 GB |
| Metadata Read Service | ~3K req/s (Redis-bound) | ~20 (with headroom 30) | 1 vCPU · 512 MB |
| Auth Service | ~5K req/s | ~12 | 1 vCPU · 512 MB |
Most teams default to CPU-based HPA. For a tokenization service that's wrong because the bottleneck is the HSM, not CPU. We use custom metrics via Prometheus Adapter:
Scale up if hsm_pool_depth > 100 OR p95_latency_ms > 15. Scale down if both metrics have been below 50% threshold for 10 min.
Scale on requests_per_pod > 6000 (= 75% of capacity). Drop a pod only if RPS < 3000 for 5 min — slow scale-down avoids thrash.
Scale on redis_connection_pool_pct > 70%. Redis is the gating resource; CPU rarely matters here.
HSMs are not Pods. They're physical (or AWS-managed) appliances reached over a VPC endpoint. We run 4 HSMs per region (2 per AZ for HA). The app reaches them via an internal HSM proxy that:
One region is enough for capacity. Two regions are required for survival. The hard question is which kind of two regions: active-active (both serve writes) or active-passive (one serves, one waits). For a payment vault, the answer is active-passive with regional pinning, and we'll walk through why active-active is a trap here.
Active-active feels obviously better — more throughput, faster regional failover, lower latency for users. But for tokenization, it creates a correctness problem you can't paper over:
Imagine the merchant retries a tokenize call. The first hits us-east, the retry (after a 1-second timeout) hits us-west. Both regions check the idempotency key — both see "not found" because cross-region replication takes 30 s. Both encrypt the PAN under different DEKs, write to their own Vault DB, return different tokens. Now the merchant has two tokens for the same card, and nobody knows.
Two regions, both minting UUIDv7s. UUIDv7 is unique only within the bounds of a clock-sync assumption — drift between regions can in theory collide (vanishingly small, but possible). More importantly, the audit log now lives in two places with no single source of truth for PCI § 10.
| Data | Mechanism | Lag | Why this way |
|---|---|---|---|
| Vault DB (encrypted PAN) | Postgres logical replication, async | ~30 s RPO | Sync would tie us-east p99 to cross-region latency (60+ ms). We accept 30 s of potential data loss for healthy latency. |
| Metadata DB | Aurora Global async replica | ~5 s | Read-from-replica for analytics in west region; failover RTO ~ 1 minute. |
| Kafka audit topic | MirrorMaker 2 | ~10 s | Audit logs replicate to both regions so a region loss does not erase compliance trail. |
| Redis cache | Don't replicate | n/a | Cache is rebuildable. After failover, west's Redis is cold for ~5 min; we accept the cache-miss storm and pre-warm top tokens during failover drills. |
| HSM keys | Manual sync (operator-driven) | quarterly | HSM key material never traverses the network. Operators export wrapped KEK from primary HSM, import to DR HSM, audit both sides. Done as part of the quarterly rotation drill. |
aws rds failover-db-cluster. Repoint app config via consul.EU GDPR + PSD2 require EU cardholder data to stay in the EU. EU merchants are pinned to eu-west-1 at the Route 53 layer; their PANs never touch us-east. The EU region is its own isolated tokenization vault with its own HSM keys — a US-region breach doesn't expose EU cards. The cost is operating two parallel vaults, but compliance leaves no choice.
You almost never scale "the database" first. The three real bottlenecks in a tokenization vault, in order, are: HSM throughput, audit-log durability, and cache miss storms. Solving them is most of the scaling work.
A single HSM does ~2,000 wrap/unwrap operations per second. At 4,500 tokenize writes/sec we need at least 3 HSMs; with headroom, run 6 in active-active and load-balance via an HSM proxy. Critically: the 45K metadata reads/sec do NOT hit the HSM — only PAN encrypt/decrypt does. That's why the cache exists.
Synchronous Kafka writes with acks=all + min.insync.replicas=2 on a 6-broker cluster. 50K events/sec × 200 B ≈ 10 MB/s — Kafka handles that with one hand tied. The trick is keeping producer batching tight enough that acks=all doesn't add > 3 ms p99.
If Redis loses its hot keys (failover, eviction storm), suddenly all 50K reads hit Postgres. Mitigations: (1) Redis cluster with replicas + automatic failover; (2) "request coalescing" — when N concurrent requests miss on the same key, only one hits the DB; (3) warm-cache job on cold start.
500M tokens × 500 B = 250 GB. Fits one node, but we shard for write throughput and blast radius:
hash(token) with consistent hashing across 16 shards. Why not merchant_id? Because one huge merchant (Amazon) would be a hot shard. Hashing the token guarantees uniform distribution.Both Tokenize Service and Detokenize Service are stateless — scale by adding pods behind an L4 load balancer. Per-pod budget: ~500 req/s comfortably (Go) or ~300 req/s (Java with JIT warm). For 50K TPS plan ~150 pods total across regions. CPU scales linearly; memory is mostly Redis client pools and connection pools.
A payment vault that's "down" but not alerted is worse than no vault at all — every payment in flight times out, merchants disable cards, customer support floods. So observability isn't optional plumbing; it's part of the product. Three layers: SLOs set the contract, metrics + traces tell you whether you're meeting it, and alerts wake the on-call when you're not.
| SLI (what we measure) | SLO (the target) | Error budget |
|---|---|---|
| Availability (any 5xx ratio per minute) | 99.99% over 28 days | ~4 min downtime / 28 d |
| Tokenize p99 latency | < 25 ms over 28 days | 1% of requests may exceed |
| Detokenize p99 latency | < 30 ms over 28 days | 1% |
| Audit log completeness | 100% — no missing events | 0 (hard requirement) |
| Failed-tokenize PAN-not-stored guarantee | 100% — never partial state | 0 |
Once the error budget is burned, all new feature deploys are frozen until the budget resets. This is the SRE contract — it forces the team to invest in reliability, not just feature velocity.
p50 / p95 / p99 per endpoint, per region. Broken out by tokenize vs detok vs metadata. Histograms with 1-min granularity; 7-day retention.
RPS per endpoint per region. Per-merchant top-10 to spot a single tenant blasting the API.
4xx + 5xx ratios. 4xx by reason (auth fail, scope mismatch, idempotency conflict). 5xx triggers paging.
HSM queue depth, Redis connection pool utilization, DB connection pool, Kafka producer queue depth, pod CPU.
Every request gets an OpenTelemetry trace from gateway → service → HSM → DB → Kafka. Sampled at 1% in steady state, 100% for any request flagged as anomalous (e.g., 5xx, p99 outlier). Traces let an engineer answer "where did this one slow request spend its time?" without log-grepping.
ClickHouse continuous query — actor X's detok rate is 5× their 30-day rolling average over a 5-min window. Possible compromised credential or bulk exfil attempt. Page security team, not just on-call SRE.
If Kafka MirrorMaker 2 lag > 60 s OR ClickHouse ingestion lag > 5 min, compliance gap. Audit trail must always be present.
Any single HSM > 1% error rate over 5 min → quarantine. > 2 HSMs degraded → page; we're running on reduced capacity.
A KEK older than 95 days triggers a warning; 100 days triggers a page. PCI requires quarterly rotation.
A synthetic test logs a fake-but-valid-Luhn number every hour; if our log scrubber misses it, page security. This catches the day a developer adds a debug print.
Reconciler job: for each new token, both rows must exist within 5 s. Mismatch > 0.01% → investigate. Catches dual-write bugs.
For each failure, the question is the same: does the system corrupt data, return wrong data, or just fail loudly? Tokenization can never afford the first two — better to fail loud than to silently mint a token whose PAN cannot be retrieved.
| Failure | Detection | Mitigation |
|---|---|---|
| HSM unreachable | Healthcheck on HSM client; circuit breaker opens after 3 timeouts | Tokenize returns 503 (fail closed). Never "encrypt later" — the PAN must not enter the DB unencrypted. |
| Vault DB shard down | Driver-level errors; per-shard health | Replica auto-promoted within ~10 s. Calls to that shard fail 503 during gap. |
| Kafka unavailable | Producer cannot get acks | Block the operation (audit log is mandatory). Optional WAL-style local disk buffer as last-resort fallback, with strict TTL and replay on Kafka recovery. |
| Vault write succeeds, metadata write fails | Outbox / CDC reconciler detects mismatch | Async reconciliation job replays the metadata write from the Kafka audit event (which has everything needed to rebuild the metadata row). |
| Redis cache cluster failover | Increased miss rate, p99 spike | Request coalescing prevents thundering herd; new primary takes over < 30 s; warm-up job preloads top 1M tokens. |
| Compromised JWT signing key | Out-of-band alert; anomalous detok rate | Rotate signing key, invalidate all live JWTs (short TTLs make this cheap), audit-log all detok in the affected window. |
| Compromised KEK (worst case) | HSM tamper alarm or insider report | Mass re-encryption job with new KEK; mark all rows ROTATING. The old KEK is destroyed only after the last row finishes rotation. PCI-required incident report within 24h. |
| Replay attack on tokenize | Idempotency key collision | Idempotency store returns the original token — replay is a no-op by design. |
| Token enumeration | Detok 404 rate spike from one actor | Token format is random 128-bit — enumeration is infeasible. Rate-limit + IP block on the actor. |
The HLD is the system view. The LLD here zooms into one box from the architecture diagram — the Tokenize Service (④) — and shows the classes inside it. Reserved for the last 15 minutes of an interview if asked "now show me the code."
| Pattern | Where | Why |
|---|---|---|
| Strategy | TokenGenerator (UUIDv7 vs HMAC) | The choice between random and deterministic tokens is a runtime decision per merchant. |
| Repository | VaultRepository, MetadataRepository | Abstracts away whether the storage is Postgres, DynamoDB, or a mock — critical for unit tests with no DB. |
| Facade | TokenizationService | Hides the 6-step orchestration (idempotency check → encrypt → store → cache → audit) behind one method. |
| Adapter | HsmEncryptionService | The HSM vendor SDK (Thales, AWS CloudHSM) has its own bespoke API — adapter normalizes it to our EncryptionService interface so we can swap vendors. |
| Builder | TokenizeRequest, AuditEvent | Both have > 5 fields, several optional. Builder avoids 10-arg constructors. |
| Observer | AuditPublisher consumers | ClickHouse, S3, and fraud all subscribe to the same Kafka topic without the Tokenize Service knowing about them. |
Java 17+, no frameworks. The interesting parts are the orchestration in TokenizationService.tokenize and the envelope-encryption sequence in HsmEncryptionService.encrypt. Everything else is plumbing.
public enum TokenStrategy { RANDOM, DETERMINISTIC }
public enum TokenStatus { ACTIVE, REVOKED, EXPIRED, ROTATING }
public enum CardNetwork { VISA, MASTERCARD, AMEX, RUPAY, DISCOVER }
public record Ciphertext(byte[] cipher, byte[] wrappedDek, String kekId) {}
public record VaultRecord(String token, String merchantId, Ciphertext payload,
TokenStatus status, Instant createdAt, Instant revokedAt) {}
public record MetadataRecord(String token, String bin, String last4, CardNetwork network,
String expiryYYYYMM, String merchantId, TokenStatus status) {}
public record AuditEvent(String eventId, Instant ts, String actor, String action,
String token, String merchantId, String ip, String scopeUsed) {}
public record TokenizeResponse(String token, String bin, String last4, CardNetwork network) {}
public record DetokenizeResponse(String pan, String expiryYYYYMM) {}
public final class TokenizeRequest {
private final String pan, expiryYYYYMM, merchantId, idempotencyKey, ip;
private final TokenStrategy strategy;
private TokenizeRequest(Builder b) {
this.pan = b.pan; this.expiryYYYYMM = b.expiry;
this.merchantId = b.merchantId; this.idempotencyKey = b.idem;
this.ip = b.ip; this.strategy = b.strategy;
}
public String pan(){ return pan; }
public String expiry(){ return expiryYYYYMM; }
public String merchantId(){ return merchantId; }
public String idempotencyKey(){ return idempotencyKey; }
public String ip(){ return ip; }
public TokenStrategy strategy(){ return strategy; }
public static Builder builder(){ return new Builder(); }
public static final class Builder {
private String pan, expiry, merchantId, idem, ip;
private TokenStrategy strategy = TokenStrategy.RANDOM;
public Builder pan(String v){ this.pan = v; return this; }
public Builder expiry(String v){ this.expiry = v; return this; }
public Builder merchantId(String v){ this.merchantId = v; return this; }
public Builder idempotencyKey(String v){ this.idem = v; return this; }
public Builder ip(String v){ this.ip = v; return this; }
public Builder strategy(TokenStrategy v){ this.strategy = v; return this; }
public TokenizeRequest build(){ return new TokenizeRequest(this); }
}
}
public final class HsmEncryptionService implements EncryptionService {
private final HsmClient hsm; // vendor SDK, thread-safe
private final String currentKekId;
private static final SecureRandom RNG = new SecureRandom();
public Ciphertext encrypt(byte[] plaintext) {
byte[] dek = new byte[32]; RNG.nextBytes(dek); // fresh DEK per record
byte[] iv = new byte[12]; RNG.nextBytes(iv);
byte[] cipher = AesGcm.encrypt(plaintext, dek, iv); // local AES-256-GCM
byte[] wrapped = hsm.wrap(currentKekId, dek); // KEK never leaves HSM
Arrays.fill(dek, (byte)0); // best-effort zeroize
return new Ciphertext(concat(iv, cipher), wrapped, currentKekId);
}
public byte[] decrypt(Ciphertext c) {
byte[] dek = hsm.unwrap(c.kekId(), c.wrappedDek());
try {
byte[] iv = Arrays.copyOfRange(c.cipher(), 0, 12);
byte[] payload = Arrays.copyOfRange(c.cipher(), 12, c.cipher().length);
return AesGcm.decrypt(payload, dek, iv);
} finally {
Arrays.fill(dek, (byte)0);
}
}
}
public final class TokenizationService {
private final Map<TokenStrategy, TokenGenerator> generators;
private final EncryptionService encryption;
private final VaultRepository vault;
private final MetadataRepository metadata;
private final AuditPublisher audit;
private final IdempotencyStore idem;
private final Clock clock;
public TokenizeResponse tokenize(TokenizeRequest req) {
// 1) idempotency
Optional<String> existing = idem.get(req.idempotencyKey());
if (existing.isPresent()) {
MetadataRecord m = metadata.findByToken(existing.get()).orElseThrow();
return new TokenizeResponse(m.token(), m.bin(), m.last4(), m.network());
}
// 2) mint
String token = generators.get(req.strategy())
.generate(req.pan(), req.merchantId(), req.strategy());
// 3) encrypt
Ciphertext payload = encryption.encrypt(req.pan().getBytes(StandardCharsets.UTF_8));
// 4) two writes
Instant now = clock.instant();
vault.save(new VaultRecord(token, req.merchantId(), payload, TokenStatus.ACTIVE, now, null));
String bin = req.pan().substring(0, 6);
String last4 = req.pan().substring(req.pan().length() - 4);
CardNetwork network = NetworkResolver.fromBin(bin);
metadata.save(new MetadataRecord(token, bin, last4, network,
req.expiryYYYYMM(), req.merchantId(), TokenStatus.ACTIVE));
// 5) audit (sync)
audit.publish(new AuditEvent(UUID.randomUUID().toString(), now,
req.merchantId(), "TOKENIZE", token, req.merchantId(), req.ip(), "token:create"));
// 6) memoize idempotency
idem.putIfAbsent(req.idempotencyKey(), token, Duration.ofHours(24));
return new TokenizeResponse(token, bin, last4, network);
}
public DetokenizeResponse detokenize(String token, String actor, String ip) {
VaultRecord r = vault.findByToken(token)
.orElseThrow(() -> new NotFoundException(token));
if (r.status() != TokenStatus.ACTIVE) throw new GoneException("token revoked");
byte[] pan = encryption.decrypt(r.payload());
audit.publish(new AuditEvent(UUID.randomUUID().toString(), clock.instant(),
actor, "DETOKENIZE", token, r.merchantId(), ip, "pan:read"));
MetadataRecord m = metadata.findByToken(token).orElseThrow();
try {
return new DetokenizeResponse(new String(pan, StandardCharsets.UTF_8), m.expiryYYYYMM());
} finally {
Arrays.fill(pan, (byte)0); // do not keep PAN bytes in heap
}
}
}
TokenizationService is stateless once constructed; all collaborators are themselves thread-safe (HSM client connection pool, Redis client, JDBC pool, Kafka producer). The only mutable secret on heap is the DEK byte array, which we zeroize in a finally block — best-effort, since GC may have moved it; the HSM is the real boundary.
The "right answer" in an interview isn't a design — it's recognising what you traded away. Walk through these out loud and you signal seniority.
| Decision | Alternative we rejected | Why |
|---|---|---|
| Random tokens with vault lookup | Format-Preserving Encryption (FPE) for stateless detokenize | Vault lookup is cheap with the cache; FPE makes revocation impossible at the per-token level and ties every detokenize to one master key's lifetime. |
| Separate Tokenize & Detokenize services | Single service handles both | Blast-radius reduction: a Tokenize-side RCE shouldn't be able to read existing PANs. Two services, two IAM roles, two network segments. |
| Envelope encryption (DEK per record + KEK in HSM) | Single global key for all PANs | Per-record DEKs let us rotate keys without re-encrypting every row immediately, and limit the blast radius of a memory-disclosure bug. |
| Sync Kafka audit write before response | Async fire-and-forget audit | PCI § 10 requires durable audit. We accept ~1–2 ms p99 cost for the audit write because compliance demands it; losing audit logs invalidates the entire system. |
| Separate Metadata DB (no PAN) | Single DB with PAN + metadata | Letting analytics query a non-PCI DB is enormous for the data team. The cost is dual-write complexity, which we solve with outbox reconciliation. |
| Shard by hash(token) | Shard by merchant_id | Hash-sharding gives even write distribution. Merchant sharding makes a huge merchant a hot shard. |
| Active-passive multi-region | Active-active writes | Cross-region idempotency on the same key is a correctness nightmare. We accept ~10 min failover RTO instead. |
| UUIDv7 (time-ordered) over UUIDv4 | UUIDv4 random | UUIDv7 keeps inserts append-only on the B-tree, dramatically improving index locality and write throughput. Still 74 random bits — unguessable. |
EncryptionService with AWS KMS, GCP CloudKMS, or HashiCorp Vault — no change in TokenizationService.FpeTokenGenerator implementing TokenGenerator and register it in the strategy map. Existing code is untouched.VaultRepository implementation that calls Visa Token Service instead of writing to Postgres.VaultRepository + MetadataRepository in routing decorators that pick the closest replica on reads.Circuit breaker opens, service returns 503, retry budget on the merchant SDK kicks in (max 3 retries with backoff). Critically — we never store PAN without HSM-backed encryption, even temporarily. Fail closed is the only safe option.
Activate the new KEK; all new tokens use it. A background worker re-wraps old DEKs with the new KEK in batches (say, 1M rows/hour) — token IDs do not change, only the wrapped DEK does. Old KEK kept readable until the last row rotates, then destroyed in the HSM.
POST /tokens with the same idempotency key concurrently?"The first call wins via Redis SET NX EX. The second call sees the existing key, reads the existing token from metadata, returns it. No double-tokenize, no race.
"Crypto-shred": delete the wrapped DEK for that customer's rows. The ciphertext becomes mathematically unrecoverable without re-encryption — instant, irreversible deletion of the PAN without rewriting the row. Metadata can stay (or be tombstoned) for legitimate audit needs.
To keep the metadata table out of PCI scope. The moment metadata and PAN share a row (or even a database), analytics joining that table inherits PCI scope. Splitting them costs a dual-write and saves the entire data team from PCI hell.
Metadata read (cache hit): ~1 ms. Metadata read (cache miss): ~8 ms. Tokenize (HSM-bound): ~12–18 ms (HSM round-trip + 2× DB write). Detokenize (HSM-bound): ~10–15 ms. Tail is dominated by HSM RPC and Kafka acks=all.