How to Stop a Load Balancer From Being a Single Point of Failure

Companion read

This is the high-availability counterpart to the caching and backend-fundamentals deep dives. If you want the broader picture of how load balancers sit next to API gateways, DNS, and the rest of the request path, read Backend Fundamentals alongside this. Here we go deep on one question only: how do you stop the load balancer itself from being the thing that takes you down?

The Incident

The night the front door fell over

It's 2 a.m. and Priya, the on-call engineer, gets paged. Every dashboard is red. The API is down, the website returns nothing, the mobile app spins forever. She SSHes into a web server — it's healthy. CPU is idle, the database responds, the app logs show zero incoming requests. Ten perfectly healthy servers, sitting there, receiving no traffic at all.

The problem isn't the servers. It's the one box that sits in front of them — the load balancer. It crashed. And because every single request entered through that one box, when it died, all ten healthy servers became unreachable. The thing whose entire job was to protect the system from a single failure had become the single failure.

The cruel irony: a load balancer is the component you add specifically to eliminate single points of failure among your servers. Put one in front of ten servers and any server can die without users noticing. But if there's only one load balancer, you didn't remove the single point of failure — you just moved it from the servers to the load balancer. And now it's a single point that 100% of traffic depends on.

This article is the answer to Priya's 2 a.m. problem. We'll build up, one failure at a time, the set of techniques that make sure no single device — not a server, not a load balancer, not even a whole data center — can take the front door down.

Foundations

What a load balancer really is

Before we make it redundant, let's be crystal clear about what this box does — because the way you make it redundant depends entirely on how it works.

Picture a busy restaurant. The host at the door doesn't cook and doesn't serve. They do one thing: when a group walks in, they decide which table (which waiter) gets them, so no single waiter is swamped while another stands idle. A load balancer (LB) is that host. It sits in front of a pool of backend servers, and for every incoming request it picks one healthy server to handle it. Clients only ever talk to the LB; they never know how many servers are behind it, or which one served them.

The one address everyone points at

Here's the detail that makes the LB dangerous. Your domain — say api.shop.com — resolves in DNS to one IP address, and that IP belongs to the load balancer. This is called the VIP (Virtual IP — a single address that represents the whole service, not any one physical machine). Every phone, browser, and partner system on Earth sends its packets to that VIP. The LB receives them and fans them out to the servers.

So the LB isn't just a component in the path — it is the path. It's the funnel every byte squeezes through. That's exactly why losing it is catastrophic, and why a big chunk of this article is about making sure that VIP keeps answering even when the machine behind it dies.

L4 load balancer

Works at the transport layer (TCP/UDP). It forwards packets based on IP + port without reading the contents — fast, simple, protocol-agnostic. Think AWS NLB, or HAProxy in TCP mode. Failover here is mostly about keeping the VIP alive.

L7 load balancer

Works at the application layer (HTTP). It reads the request — URL path, headers, cookies — and can route /images to one pool and /api to another. Smarter, but heavier. Think AWS ALB, nginx, Envoy. More state to lose on failover (e.g. in-flight requests).

So what: a load balancer is a host-at-the-door for your servers, reachable at a single VIP that all clients target. Keeping the service alive means keeping that VIP answering — even when the box behind it falls over.

Pass 1 — The Naive Design

The naive design — and how it breaks

Let's draw the simplest thing that could plausibly work — the design Priya's team actually had. One load balancer, one pool of servers behind it.

flowchart LR C([Clients]) -->|all traffic| LB[Single Load Balancer
VIP: 203.0.113.10] LB --> S1[Server 1] LB --> S2[Server 2] LB --> S3[Server 3] style LB fill:#e05252,stroke:#e05252,color:#fff style C fill:#4a90d9,stroke:#4a90d9,color:#fff style S1 fill:#38b265,stroke:#38b265,color:#fff style S2 fill:#38b265,stroke:#38b265,color:#fff style S3 fill:#38b265,stroke:#38b265,color:#fff

It looks fine. Add servers, the LB spreads load, any backend can die and users never notice. The servers are redundant. But look at that red box — there's exactly one of it, and every arrow passes through it. Here are the concrete ways it kills you:

① The box crashes

Kernel panic, out-of-memory, a hung process, a failed disk, someone trips over the power cable. The instant the LB process dies, the VIP stops answering. 100% of traffic blackholes — even though every backend is perfectly healthy. This is Priya's incident.

② The bad config push

An engineer pushes a broken nginx/HAProxy config and reloads. The single LB now rejects or misroutes everything. With one box there's no canary, no "roll it out to half the fleet" — the blast radius is the entire service, instantly.

③ It saturates

The LB is also a throughput bottleneck. A single box has a finite packet-per-second and bandwidth ceiling. A traffic spike (or a SYN flood) can max out its NIC or CPU, and now it's dropping connections for everyone — a brownout, not even a clean crash.

④ You can never patch it

Security patch for the OS? Upgrade HAProxy? Reboot the host? With one LB, every maintenance action is a planned outage. Teams that run a single LB end up never patching it out of fear — which is its own time bomb.

The pattern: every one of these failures maps to a fix we'll add. Crash & patching → a second LB (redundant pair). Config push → staged rollout across the pair. Saturation → active-active so two boxes share the load. And the question lurking underneath all of them → how do clients find the surviving box? That's the next section.

Pass 2 — The Mental Model

The mental model — who balances the balancers?

The obvious fix is "add a second load balancer." But that immediately raises the question that trips up most engineers: if clients send traffic to one VIP, and that VIP lives on one box, how does a second box help? When LB-A dies, the VIP dies with it — clients are still pointed at a dead address. You seem to have just created a recursive problem: now you need something to balance between the two load balancers… and won't that thing be a single point of failure too?

This is the key insight of the whole article, so let's name it plainly. The recursion does bottom out — not by adding infinitely many balancers, but by climbing to layers that are progressively dumber and more reliable, until you reach a layer that has no single device at all.

Layer A — the box

Two LB machines. The trick: don't tie the VIP permanently to one box. Make it a floating IP that can jump from the dead box to the survivor in seconds. The network layer below decides which box currently owns the VIP.

Layer B — the network

Routers and switches are inherently multi-path. Protocols like VRRP (for the floating IP) and ECMP/anycast (announce the same IP from many boxes) let the network fabric itself spread or redirect traffic — and the fabric has no single device.

Layer C — DNS

DNS can hand out multiple IPs for one name, health-check them, and stop returning dead ones. It's globally distributed by design (many servers, many copies), so there's no single DNS box to lose either.

The resolution: "who balances the balancers?" is answered by going up the stack to layers that are stateless, replicated, and have no single physical device — DNS resolvers, the BGP-routed internet, the L3 switching fabric. Each is dumb enough and distributed enough that there's nothing left to single-point-fail. You don't add a smarter box on top; you lean on infrastructure that's already redundant by nature.

With that settled, the rest is just choosing which of these layers to use, and how to wire them together. We'll do the two box-level patterns first, then the layers above.

Pattern 1

Active-Passive + a floating IP

This is the classic, battle-tested two-box pattern, and it's the easiest to reason about. You run two identical load balancers — call them Raj's box (active) and the standby (passive). Only one holds the VIP and serves traffic at any moment. The other sits quietly, fully configured, watching and waiting.

flowchart TB C([Clients]) -->|traffic to VIP 203.0.113.10| VIP{{Floating VIP}} VIP -.owned by.-> A[LB-A · ACTIVE
holds the VIP] VIP -.standby.-> B[LB-B · PASSIVE
ready to take over] A <-->|VRRP heartbeat every 1s| B A --> P[Server Pool] B -.-> P style A fill:#38b265,stroke:#38b265,color:#fff style B fill:#7b8599,stroke:#7b8599,color:#fff style VIP fill:#e8743b,stroke:#e8743b,color:#fff style C fill:#4a90d9,stroke:#4a90d9,color:#fff style P fill:#171d27,stroke:#232b38,color:#d4dae5

How the VIP "floats"

The magic is a protocol called VRRP (Virtual Router Redundancy Protocol — a standard way for two machines to share one IP and agree on who owns it), usually run by a daemon called keepalived. Here's the dance, told plainly:

Both boxes are configured with the same VIP. Right now LB-A "owns" it — its network card answers for 203.0.113.10.
LB-A shouts a tiny heartbeat onto the local network every second: "I'm alive, I'm the master." LB-B listens.
The moment LB-B misses ~3 heartbeats in a row (≈3 seconds), it concludes LB-A is dead and promotes itself to master.
LB-B then sends a gratuitous ARP — a broadcast that essentially says "Hey everyone, the VIP 203.0.113.10 now lives at MY hardware address." The local switch updates its table, and the very next packet for the VIP is delivered to LB-B instead of the corpse of LB-A.

From the client's perspective, nothing changed. They're still sending to 203.0.113.10. The IP didn't move — only the machine answering for it did, and it happened in a few seconds. That's why it's called a floating IP: it floats from the dead box to the live one.

Before vs after

Before (single LB): box dies → VIP is dead → 100% outage until a human rebuilds it (minutes to hours).

After (active-passive): box dies → standby grabs the VIP in ~3s via gratuitous ARP → a brief blip, then service resumes. No human in the loop.

The honest trade-off

Half your LB capacity sits idle as a hot spare — you pay for two, use one. And it does nothing for the saturation problem: a spike that overwhelms one box overwhelms the standby too, because only one serves at a time. For that, you want active-active.

So what: active-passive turns "the LB died, call a human" into "the LB died, the standby grabbed the address in 3 seconds." It's the cheapest, simplest way to kill the SPOF — and for many systems it's all you need.

Pattern 2

Active-Active — both boxes earn their keep

Active-passive wastes half your hardware and doesn't help with overload. Active-active fixes both: both load balancers serve traffic simultaneously, so you get double the capacity and redundancy. The catch — and it's the whole game — is that now clients must somehow spread their traffic across two different boxes, and keep working when either one dies. There are three ways to do that, at three different layers.

Option A — DNS hands out both IPs

Give api.shop.com two A-records, one per LB. DNS round-robin sends roughly half of clients to each. If LB-A dies, a health-checked DNS service (like Route 53 with health checks) notices and stops handing out LB-A's IP. Simple, works anywhere — but slow to fail over, because clients and resolvers cache DNS answers for the TTL (Time To Live). Even with a 60-second TTL, some clients keep hammering the dead IP for a minute or more. Good for coarse, cross-region steering; too sluggish to be your only defense.

Option B — ECMP at the router (same IP on both boxes)

Here's the elegant one. Both LBs are configured with the same VIP, and they each "announce" that IP to the upstream router using a routing protocol (BGP/OSPF). The router sees two equal-cost paths to the same destination and uses ECMP (Equal-Cost Multi-Path — the router hashes each connection's 5-tuple and pins it to one of the available paths) to spread connections across both boxes. When a box dies, it stops announcing the route, the router drops that path within a second or two, and all traffic flows to the survivor. This is how serious on-prem and cloud-edge load balancing works — fast failover, full capacity, no DNS caching headaches.

flowchart TB C([Clients]) --> R[Router / L3 Switch
ECMP across equal-cost paths] R -->|path 1| A[LB-A · ACTIVE
announces VIP via BGP] R -->|path 2| B[LB-B · ACTIVE
announces VIP via BGP] A --> P[Server Pool] B --> P style A fill:#3cbfbf,stroke:#3cbfbf,color:#fff style B fill:#3cbfbf,stroke:#3cbfbf,color:#fff style R fill:#e8743b,stroke:#e8743b,color:#fff style C fill:#4a90d9,stroke:#4a90d9,color:#fff style P fill:#171d27,stroke:#232b38,color:#d4dae5

Option C — Anycast across the whole internet

Take Option B and stretch it across the planet. With anycast, the same VIP is announced via BGP from data centers in dozens of cities. The internet's own routing naturally delivers each user to the nearest location announcing that IP. London users hit London; Tokyo users hit Tokyo. If an entire location goes dark, BGP withdraws its announcement and the internet re-routes those users to the next-closest site — automatically, in seconds. This is how Cloudflare, Google, and every big CDN make a single IP resilient to entire data centers vanishing.

Mechanism	Layer	Failover speed	Best for
DNS round-robin + health checks	DNS (L7-ish)	Slow (TTL-bound, 30s–mins)	Coarse cross-region steering, last-resort failover
ECMP + BGP/OSPF	Network (L3)	Fast (~1–2s)	Active-active LBs in one DC / cloud edge
Anycast + BGP	Internet (L3)	Fast (seconds)	Global, multi-DC, DDoS-resilient front door
VRRP floating IP	LAN (L2)	Fast (~3s)	Active-passive pair in one rack/subnet

So what: active-active gives you capacity and redundancy, but you pay for it in complexity — you need a layer that can spread traffic across boxes and react when one dies. ECMP/anycast (fast, network-layer) and DNS (slow, but works everywhere) are the two halves of that answer, and real systems use both together.

The Layer Above

The DNS & anycast layer — and its own SPOF

Sharp readers are already uneasy. We keep saying "DNS notices the dead box and stops handing out its IP." But that just pushes the question up a level: isn't DNS itself a single point of failure? If your DNS provider goes down, nobody can resolve api.shop.com at all — and it doesn't matter how many load balancers you have behind it. (This is exactly what happened in the famous 2016 Dyn outage that took down Twitter, Reddit, and Spotify for hours.)

So how does DNS avoid being a SPOF? By being distributed and redundant by design, in three ways a newbie should understand:

Many nameservers

Your domain lists multiple authoritative nameservers (often 4+), ideally across two independent DNS providers. A resolver that can't reach one simply tries the next. No single nameserver is load-bearing.

Anycast for DNS too

Each nameserver IP is itself anycast — announced from hundreds of locations. The query reaches the nearest copy. Losing a location just shifts queries elsewhere. DNS uses the same trick we use for the LBs.

Caching everywhere

Answers are cached at every resolver for the TTL. So even a total authoritative-DNS outage doesn't instantly break everyone — cached records keep resolving until they expire. Caching is a liability for failover speed but an asset for availability.

GSLB — the smart DNS that health-checks your LBs

The piece that ties DNS to your load balancers is GSLB (Global Server Load Balancing — a DNS service that actively probes each of your endpoints and only returns IPs that are currently healthy). Instead of blindly round-robining two A-records, GSLB continuously pings each LB/region. When LB-A in London fails its health check, GSLB simply stops including London's IP in the answers it gives out. New clients get only healthy IPs. Route 53's failover/latency routing, NS1, Akamai GTM, and F5 GSLB all do this.

So what: you defeat the DNS SPOF the same way you defeat the LB SPOF — redundancy and the absence of any single device. Use 4+ anycast nameservers across two providers, keep TTLs low enough for failover but not so low you melt your DNS, and let GSLB steer clients away from dead regions. DNS becomes a coarse, slow layer of failover sitting on top of the fast network-layer failover below it.

Pass 3 — The Production Shape

The production shape — layered redundancy

Now we assemble everything into the design a real, serious service runs. The principle is defense in depth: no single layer is trusted to be the only thing standing between users and an outage. Each layer below catches what the layer above missed — fast network failover for box deaths, slower DNS failover for whole-region deaths. Every box in this diagram is numbered; the matching card below explains what it does and, more importantly, what breaks without it.

flowchart TB U([① Users worldwide]) DNS[② GSLB / Health-checked DNS
multi-provider · anycast NS] U --> DNS DNS -->|nearest healthy region| AN{{③ Anycast VIP}} subgraph R1 [Region: US-East] direction TB AN -->|BGP/ECMP| L1A[④ LB-A · active] AN -->|BGP/ECMP| L1B[⑤ LB-B · active] L1A --> H1[⑥ Health Checker] L1B --> H1 L1A --> P1[⑦ Server Pool · AZ-1] L1B --> P2[⑧ Server Pool · AZ-2] end subgraph R2 [Region: EU-West] direction TB AN -.failover.-> L2[⑨ LB pair · active] L2 --> P3[⑩ Server Pool] end style U fill:#4a90d9,stroke:#4a90d9,color:#fff style DNS fill:#d4a838,stroke:#d4a838,color:#fff style AN fill:#e8743b,stroke:#e8743b,color:#fff style L1A fill:#3cbfbf,stroke:#3cbfbf,color:#fff style L1B fill:#3cbfbf,stroke:#3cbfbf,color:#fff style H1 fill:#9b72cf,stroke:#9b72cf,color:#fff style P1 fill:#38b265,stroke:#38b265,color:#fff style P2 fill:#38b265,stroke:#38b265,color:#fff style L2 fill:#3cbfbf,stroke:#3cbfbf,color:#fff style P3 fill:#38b265,stroke:#38b265,color:#fff

Component-by-component — what each numbered box does

Use the numbers in the diagram above to find the matching card below. Each card answers: what is it, why it exists, and what would break if you deleted it tomorrow.

① Users worldwide

Real people and apps — Priya's customers — scattered across continents, all typing api.shop.com. They have no idea any of the machinery below exists, and that's the point: every layer's job is to keep that name working no matter what dies.

② GSLB / Health-checked DNS

The first decision point. It resolves the name to an IP, health-checks every region, and returns only healthy ones — steering users to the nearest live region. Solves: whole-region failure and geographic latency. Without it, a dead region keeps receiving traffic until someone manually edits DNS.

③ Anycast VIP

A single IP announced via BGP from every region. The internet routes each user to the nearest site announcing it. Solves: the "clients are pinned to one address" problem — when a site withdraws the announcement, the internet re-routes in seconds, faster than DNS could.

④⑤ LB-A & LB-B (active-active)

Two load balancers in the region, both serving, both announcing the VIP via BGP/ECMP. Solves: the original SPOF and capacity. Either box can die and the router drops its path within ~2s; the survivor carries the load. Patching is now rolling, not an outage.

⑥ Health Checker

The subsystem that constantly probes backends and LBs, marking dead ones out of rotation. Solves: silent failure — without it, an LB happily forwards traffic to a crashed server, or a dead LB keeps announcing its VIP. It's the nervous system that makes every other failover trigger.

⑦⑧ Server pools across AZs

Backends split across two Availability Zones (AZs — physically separate data centers in the same region, with independent power and network). Solves: correlated failure. If all servers shared one rack or one AZ, a power event takes them all out at once. Spreading them means one AZ can burn and the service survives.

⑨ Standby region (EU-West)

A second full stack — its own LB pair and servers — in another region. Solves: the entire-region-down scenario (a regional cloud outage, a fiber cut, a config blast radius). GSLB (②) fails users over to it. Without it, a regional outage is a full outage.

⑩ Standby server pool

The backends behind the standby region's LBs, kept warm and in sync. Solves: the "we failed over DNS but there's nothing to fail over to" trap. A standby region with no running capacity is just a slower outage.

The layering, summarized: a single server dies → the LB routes around it. A single LB dies → ECMP/anycast routes around it in ~2s. An entire AZ dies → the other AZ absorbs it. An entire region dies → GSLB fails users over to the standby region in tens of seconds. At no layer is there a single device whose death takes everything down. That's what "no single point of failure" actually looks like when you draw it.

Walkthrough

A failover, traced step by step

Let's make it concrete. It's 2 a.m. again — but this time the team has the layered design above. LB-A in US-East crashes hard. Follow what happens to Priya's customer, Sam, who's mid-checkout in New York. Watch how many components quietly do their job before anyone even gets paged.

sequenceDiagram actor Sam as Sam (New York) participant Net as Internet / BGP participant LBA as LB-A (US-East) participant LBB as LB-B (US-East) participant HC as Health Checker participant Pool as Server Pool Note over LBA: 💥 LB-A crashes at 02:00:00 Sam->>Net: POST /checkout → anycast VIP Net->>LBA: route to LB-A (nearest path) Note over Net,LBA: first packets blackhole (~1-2s) LBA--xNet: stops announcing BGP route HC->>LBA: health probe — no response (3x) Note over HC: marks LB-A DEAD at 02:00:02 Net->>Net: withdraw LB-A path, keep LB-B Sam->>Net: TCP retransmit (automatic) Net->>LBB: re-route to LB-B LBB->>Pool: forward to healthy server Pool-->>LBB: 200 OK LBB-->>Sam: checkout succeeds ✓ Note over Sam: saw a ~2s delay, not an error

Here's the same story in words, because the timing is the whole point:

02:00:00 — LB-A dies. Sam's checkout request, already in flight, hits the void. His first packets are lost.
02:00:00–02:00:02 — Two things happen in parallel. LB-A's BGP session drops (a dead box can't keep announcing), so the router/internet stops sending it new traffic. Meanwhile the Health Checker notices LB-A missed its probes and marks it dead.
02:00:02 — The path to LB-A is withdrawn. The anycast VIP now resolves only to LB-B.
02:00:02+ — Sam's device does what TCP always does when packets are lost: it retransmits. That retransmit follows the new path to LB-B, which forwards it to a healthy server. Checkout completes.
Sam's experience: a 2-second pause on the checkout button — annoying, instantly forgotten. Not an error page. Not a lost order.

So what: notice that no human did anything during the failover. The network layer (BGP), the health checker, and TCP's own retransmit conspired to heal the path in ~2 seconds. Priya still gets paged — "LB-A is down, go investigate" — but it's a daytime-calm page about a degraded-but-working system, not a 2 a.m. total-outage page. That difference is the entire payoff.

The Gotchas

Health checks & the split-brain trap

Redundancy isn't free magic — it introduces its own failure modes. Two boxes that are supposed to coordinate can miscoordinate. Here are the traps that turn a redundant pair into a new kind of outage, and how to avoid them.

Split-brain — when both boxes think they're the master

Recall the active-passive pair: the standby promotes itself when it stops hearing the active's heartbeat. But what if the heartbeat link between them breaks, while both boxes are still alive and reachable by clients? The standby assumes the active died and grabs the VIP. Now two boxes claim the same IP — they fight over it, ARP tables flap, connections break randomly. This is split-brain, and it can be worse than a clean outage because the failure is intermittent and maddening to debug.

Fix: multiple heartbeat paths

Don't let a single cable be the thing that decides life-or-death. Run VRRP heartbeats over two independent links (e.g. the data network and a dedicated crossover). Both must fail before the standby promotes — making a false "the master is dead" verdict far less likely.

Fix: fencing / quorum

Use a third witness (a quorum node, or the cloud's API) to break ties: a box only becomes master if it can confirm with the witness that the other is truly gone. "I can't hear my peer" should never, by itself, authorize a takeover.

Health checks: shallow vs deep

A health check is only as good as what it actually tests. The classic mistake is a shallow check — "does the server's port 80 accept a TCP connection?" The port can be open while the app is broken: database disconnected, dependency down, returning 500s to every real request. The LB sees "healthy" and keeps sending users to a server that errors on everything.

Rule: use a deep health check — a dedicated /healthz endpoint that actually exercises the critical path (can it reach the DB? the cache?) and returns 200 only if the server can truly serve. But don't over-couple it: if /healthz fails the instant a shared dependency hiccups, every server fails its check at once and the LB pulls the entire pool out of rotation — turning a minor blip into a total outage. Check what this server needs, not the whole world's health.

Don't forget the things behind the LB

Making the LB redundant is pointless if everything it points to shares a hidden SPOF. Common ones to hunt down:

Single AZ: two LBs and ten servers, all in one data center / availability zone. One power event = total outage. Spread across AZs.
Single database: the most common "real" SPOF hiding behind a beautifully redundant LB tier. Replicate it, with automated failover.
Shared config / control plane: if both LBs pull config from one source and that source serves a bad config, both die together. Stage rollouts; validate before reload.
Sticky sessions: if the LB pins users to a specific server via in-memory sessions, losing that server logs everyone out. Store session state externally (Redis) so any server — and any LB — can serve any user.

The Cloud

How the cloud hides all of this

Here's the good news for most engineers in 2026: if you use a managed cloud load balancer — AWS ALB/NLB, GCP Cloud Load Balancing, Azure Load Balancer — you get most of this for free, because the cloud provider already solved it at massive scale. But you have to understand what's handled and what's still your job, or you'll build a SPOF on top of a non-SPOF.

What the cloud LB handles for you

It's not a single box. An AWS ALB is a fleet of nodes that auto-scales with traffic — there's no one machine to crash. You get a DNS name, not a fixed IP, precisely because the IPs change as the fleet scales.
Built-in health checks and automatic removal of unhealthy targets.
The fleet spans the AZs you enable — kill an AZ, the LB keeps serving from the others.
GCP goes further: its global LB uses a single anycast IP across the planet, so the cross-region steering is built in.

What is still YOUR job

Enable multiple AZs. An ALB in one subnet/AZ is back to being a SPOF. This is a config checkbox people forget.
Cross-region failover. AWS ALB/NLB are regional. Surviving a whole-region outage needs Route 53 health-checked failover (or Global Accelerator) in front of LBs in two regions.
Backends & database redundancy. The cloud makes the LB tier robust; your data tier is still on you (Multi-AZ RDS, read replicas, etc.).
Health check depth & sticky-session pitfalls — same rules as §10 apply.

So what: the cloud turns "build an HA load-balancer pair with keepalived and VRRP" into "tick the multi-AZ box and put Route 53 failover in front for multi-region." The hard L2/L3 plumbing is hidden — but the architecture decisions (which AZs, how many regions, how deep the health check, where session state lives) are still yours to get right.

Wrap-Up

Checklist & interview Q&A

The on-call checklist

Never run one load balancer. Minimum is a redundant pair — active-passive (VRRP/keepalived + floating IP) or active-active (ECMP/anycast).
Make the VIP float, so it survives a box death without a human or a DNS change.
Spread LBs and backends across AZs — redundancy in one data center dies with that data center.
Use deep health checks (/healthz that tests the real path), but scope them so one shared dependency can't fail the whole pool at once.
Layer your failover: fast network-layer (ECMP/anycast, ~2s) for box deaths, slower DNS/GSLB (tens of seconds) for region deaths.
Kill the DNS SPOF with 4+ anycast nameservers across two providers; keep TTLs low enough to fail over.
Guard against split-brain with redundant heartbeat links and a quorum/fencing witness.
Externalize session state (Redis) so any LB and any server can serve any user.
Hunt the hidden SPOF behind the LB — usually the database or a shared control plane.
On managed clouds: enable multi-AZ, add cross-region failover, and remember the LB tier being safe doesn't make your data tier safe.

Interview Q&A

If a load balancer removes single points of failure, how can it be one itself?

Because it relocates the SPOF rather than eliminating it. Ten servers behind one LB means any server can die safely — but now 100% of traffic depends on that one LB. You've made the servers redundant and the LB single. The fix is to make the LB tier redundant too: a pair (or fleet) with a floating/anycast VIP so no single LB box is in the critical path.

Active-passive vs active-active — when do you pick which?

Active-passive (one serves, one hot-standby, VRRP floating IP): simplest, great when one box has enough capacity and you just want failover. Downside: half the hardware idles, and it doesn't help with overload. Active-active (both serve via ECMP/anycast): full capacity + redundancy, faster failover, but more complex and needs a network layer that spreads traffic. Pick active-passive for simplicity, active-active when you need the capacity or sub-second failover.

Two LBs share one VIP — how does that even work, and how does failover happen?

Two mechanisms depending on layer. On a LAN: VRRP — both boxes know the VIP, one "owns" it; on failure the standby sends a gratuitous ARP claiming the VIP's MAC, and the switch redirects packets (~3s). At L3: both boxes announce the same IP via BGP and the router uses ECMP to spread connections; a dead box stops announcing and its path is withdrawn (~2s). Clients never change the address they target.

Isn't DNS the real single point of failure then?

It would be, if you ran one nameserver. You defeat it the same way as the LB: redundancy and no single device. Use 4+ authoritative nameservers across two independent providers, each on an anycast IP, plus the natural caching at resolvers that keeps records resolving through a brief authoritative outage. GSLB adds health-checked DNS so dead regions are dropped from answers. The 2016 Dyn outage is the cautionary tale for relying on a single DNS provider.

What's split-brain and how do you prevent it?

When the heartbeat link between an active-passive pair breaks but both boxes are alive, the standby wrongly promotes itself and both claim the VIP — they fight, connections flap. Prevent it with multiple independent heartbeat paths (both must fail to trigger promotion) and a quorum/fencing witness so "I can't hear my peer" never alone authorizes a takeover.

You've made the LB redundant. What's the next SPOF you'd look for?

Usually the database — a gorgeous redundant LB tier in front of a single primary DB is the most common real-world SPOF. After that: a single AZ hosting everything, a shared config/control plane that can push one bad config to both LBs, and in-memory sticky sessions that tie users to one server. Redundancy is only as strong as the least-redundant thing in the request path.

On AWS, is an ALB a single point of failure?

No — an ALB is a horizontally-scaled fleet, not one box, which is why you get a DNS name instead of a fixed IP. But it's regional, and it's only as redundant as the AZs you enable. So: enable multiple AZs (or it's a SPOF again), and for whole-region resilience put Route 53 health-checked failover (or Global Accelerator's anycast) in front of ALBs in two regions.

The Load Balancer That Took Everything Down