A load balancer exists to remove single points of failure — so what happens when it becomes one?
This is the high-availability counterpart to the caching and backend-fundamentals deep dives. If you want the broader picture of how load balancers sit next to API gateways, DNS, and the rest of the request path, read Backend Fundamentals alongside this. Here we go deep on one question only: how do you stop the load balancer itself from being the thing that takes you down?
It's 2 a.m. and Priya, the on-call engineer, gets paged. Every dashboard is red. The API is down, the website returns nothing, the mobile app spins forever. She SSHes into a web server — it's healthy. CPU is idle, the database responds, the app logs show zero incoming requests. Ten perfectly healthy servers, sitting there, receiving no traffic at all.
The problem isn't the servers. It's the one box that sits in front of them — the load balancer. It crashed. And because every single request entered through that one box, when it died, all ten healthy servers became unreachable. The thing whose entire job was to protect the system from a single failure had become the single failure.
This article is the answer to Priya's 2 a.m. problem. We'll build up, one failure at a time, the set of techniques that make sure no single device — not a server, not a load balancer, not even a whole data center — can take the front door down.
Before we make it redundant, let's be crystal clear about what this box does — because the way you make it redundant depends entirely on how it works.
Picture a busy restaurant. The host at the door doesn't cook and doesn't serve. They do one thing: when a group walks in, they decide which table (which waiter) gets them, so no single waiter is swamped while another stands idle. A load balancer (LB) is that host. It sits in front of a pool of backend servers, and for every incoming request it picks one healthy server to handle it. Clients only ever talk to the LB; they never know how many servers are behind it, or which one served them.
Here's the detail that makes the LB dangerous. Your domain — say api.shop.com — resolves in DNS to one IP address, and that IP belongs to the load balancer. This is called the VIP (Virtual IP — a single address that represents the whole service, not any one physical machine). Every phone, browser, and partner system on Earth sends its packets to that VIP. The LB receives them and fans them out to the servers.
So the LB isn't just a component in the path — it is the path. It's the funnel every byte squeezes through. That's exactly why losing it is catastrophic, and why a big chunk of this article is about making sure that VIP keeps answering even when the machine behind it dies.
Works at the transport layer (TCP/UDP). It forwards packets based on IP + port without reading the contents — fast, simple, protocol-agnostic. Think AWS NLB, or HAProxy in TCP mode. Failover here is mostly about keeping the VIP alive.
Works at the application layer (HTTP). It reads the request — URL path, headers, cookies — and can route /images to one pool and /api to another. Smarter, but heavier. Think AWS ALB, nginx, Envoy. More state to lose on failover (e.g. in-flight requests).
Let's draw the simplest thing that could plausibly work — the design Priya's team actually had. One load balancer, one pool of servers behind it.
It looks fine. Add servers, the LB spreads load, any backend can die and users never notice. The servers are redundant. But look at that red box — there's exactly one of it, and every arrow passes through it. Here are the concrete ways it kills you:
Kernel panic, out-of-memory, a hung process, a failed disk, someone trips over the power cable. The instant the LB process dies, the VIP stops answering. 100% of traffic blackholes — even though every backend is perfectly healthy. This is Priya's incident.
An engineer pushes a broken nginx/HAProxy config and reloads. The single LB now rejects or misroutes everything. With one box there's no canary, no "roll it out to half the fleet" — the blast radius is the entire service, instantly.
The LB is also a throughput bottleneck. A single box has a finite packet-per-second and bandwidth ceiling. A traffic spike (or a SYN flood) can max out its NIC or CPU, and now it's dropping connections for everyone — a brownout, not even a clean crash.
Security patch for the OS? Upgrade HAProxy? Reboot the host? With one LB, every maintenance action is a planned outage. Teams that run a single LB end up never patching it out of fear — which is its own time bomb.
The obvious fix is "add a second load balancer." But that immediately raises the question that trips up most engineers: if clients send traffic to one VIP, and that VIP lives on one box, how does a second box help? When LB-A dies, the VIP dies with it — clients are still pointed at a dead address. You seem to have just created a recursive problem: now you need something to balance between the two load balancers… and won't that thing be a single point of failure too?
This is the key insight of the whole article, so let's name it plainly. The recursion does bottom out — not by adding infinitely many balancers, but by climbing to layers that are progressively dumber and more reliable, until you reach a layer that has no single device at all.
Two LB machines. The trick: don't tie the VIP permanently to one box. Make it a floating IP that can jump from the dead box to the survivor in seconds. The network layer below decides which box currently owns the VIP.
Routers and switches are inherently multi-path. Protocols like VRRP (for the floating IP) and ECMP/anycast (announce the same IP from many boxes) let the network fabric itself spread or redirect traffic — and the fabric has no single device.
DNS can hand out multiple IPs for one name, health-check them, and stop returning dead ones. It's globally distributed by design (many servers, many copies), so there's no single DNS box to lose either.
With that settled, the rest is just choosing which of these layers to use, and how to wire them together. We'll do the two box-level patterns first, then the layers above.
This is the classic, battle-tested two-box pattern, and it's the easiest to reason about. You run two identical load balancers — call them Raj's box (active) and the standby (passive). Only one holds the VIP and serves traffic at any moment. The other sits quietly, fully configured, watching and waiting.
The magic is a protocol called VRRP (Virtual Router Redundancy Protocol — a standard way for two machines to share one IP and agree on who owns it), usually run by a daemon called keepalived. Here's the dance, told plainly:
203.0.113.10.From the client's perspective, nothing changed. They're still sending to 203.0.113.10. The IP didn't move — only the machine answering for it did, and it happened in a few seconds. That's why it's called a floating IP: it floats from the dead box to the live one.
Before (single LB): box dies → VIP is dead → 100% outage until a human rebuilds it (minutes to hours).
After (active-passive): box dies → standby grabs the VIP in ~3s via gratuitous ARP → a brief blip, then service resumes. No human in the loop.
Half your LB capacity sits idle as a hot spare — you pay for two, use one. And it does nothing for the saturation problem: a spike that overwhelms one box overwhelms the standby too, because only one serves at a time. For that, you want active-active.
Active-passive wastes half your hardware and doesn't help with overload. Active-active fixes both: both load balancers serve traffic simultaneously, so you get double the capacity and redundancy. The catch — and it's the whole game — is that now clients must somehow spread their traffic across two different boxes, and keep working when either one dies. There are three ways to do that, at three different layers.
Give api.shop.com two A-records, one per LB. DNS round-robin sends roughly half of clients to each. If LB-A dies, a health-checked DNS service (like Route 53 with health checks) notices and stops handing out LB-A's IP. Simple, works anywhere — but slow to fail over, because clients and resolvers cache DNS answers for the TTL (Time To Live). Even with a 60-second TTL, some clients keep hammering the dead IP for a minute or more. Good for coarse, cross-region steering; too sluggish to be your only defense.
Here's the elegant one. Both LBs are configured with the same VIP, and they each "announce" that IP to the upstream router using a routing protocol (BGP/OSPF). The router sees two equal-cost paths to the same destination and uses ECMP (Equal-Cost Multi-Path — the router hashes each connection's 5-tuple and pins it to one of the available paths) to spread connections across both boxes. When a box dies, it stops announcing the route, the router drops that path within a second or two, and all traffic flows to the survivor. This is how serious on-prem and cloud-edge load balancing works — fast failover, full capacity, no DNS caching headaches.
Take Option B and stretch it across the planet. With anycast, the same VIP is announced via BGP from data centers in dozens of cities. The internet's own routing naturally delivers each user to the nearest location announcing that IP. London users hit London; Tokyo users hit Tokyo. If an entire location goes dark, BGP withdraws its announcement and the internet re-routes those users to the next-closest site — automatically, in seconds. This is how Cloudflare, Google, and every big CDN make a single IP resilient to entire data centers vanishing.
| Mechanism | Layer | Failover speed | Best for |
|---|---|---|---|
| DNS round-robin + health checks | DNS (L7-ish) | Slow (TTL-bound, 30s–mins) | Coarse cross-region steering, last-resort failover |
| ECMP + BGP/OSPF | Network (L3) | Fast (~1–2s) | Active-active LBs in one DC / cloud edge |
| Anycast + BGP | Internet (L3) | Fast (seconds) | Global, multi-DC, DDoS-resilient front door |
| VRRP floating IP | LAN (L2) | Fast (~3s) | Active-passive pair in one rack/subnet |
Sharp readers are already uneasy. We keep saying "DNS notices the dead box and stops handing out its IP." But that just pushes the question up a level: isn't DNS itself a single point of failure? If your DNS provider goes down, nobody can resolve api.shop.com at all — and it doesn't matter how many load balancers you have behind it. (This is exactly what happened in the famous 2016 Dyn outage that took down Twitter, Reddit, and Spotify for hours.)
So how does DNS avoid being a SPOF? By being distributed and redundant by design, in three ways a newbie should understand:
Your domain lists multiple authoritative nameservers (often 4+), ideally across two independent DNS providers. A resolver that can't reach one simply tries the next. No single nameserver is load-bearing.
Each nameserver IP is itself anycast — announced from hundreds of locations. The query reaches the nearest copy. Losing a location just shifts queries elsewhere. DNS uses the same trick we use for the LBs.
Answers are cached at every resolver for the TTL. So even a total authoritative-DNS outage doesn't instantly break everyone — cached records keep resolving until they expire. Caching is a liability for failover speed but an asset for availability.
The piece that ties DNS to your load balancers is GSLB (Global Server Load Balancing — a DNS service that actively probes each of your endpoints and only returns IPs that are currently healthy). Instead of blindly round-robining two A-records, GSLB continuously pings each LB/region. When LB-A in London fails its health check, GSLB simply stops including London's IP in the answers it gives out. New clients get only healthy IPs. Route 53's failover/latency routing, NS1, Akamai GTM, and F5 GSLB all do this.
Now we assemble everything into the design a real, serious service runs. The principle is defense in depth: no single layer is trusted to be the only thing standing between users and an outage. Each layer below catches what the layer above missed — fast network failover for box deaths, slower DNS failover for whole-region deaths. Every box in this diagram is numbered; the matching card below explains what it does and, more importantly, what breaks without it.
Use the numbers in the diagram above to find the matching card below. Each card answers: what is it, why it exists, and what would break if you deleted it tomorrow.
Real people and apps — Priya's customers — scattered across continents, all typing api.shop.com. They have no idea any of the machinery below exists, and that's the point: every layer's job is to keep that name working no matter what dies.
The first decision point. It resolves the name to an IP, health-checks every region, and returns only healthy ones — steering users to the nearest live region. Solves: whole-region failure and geographic latency. Without it, a dead region keeps receiving traffic until someone manually edits DNS.
A single IP announced via BGP from every region. The internet routes each user to the nearest site announcing it. Solves: the "clients are pinned to one address" problem — when a site withdraws the announcement, the internet re-routes in seconds, faster than DNS could.
Two load balancers in the region, both serving, both announcing the VIP via BGP/ECMP. Solves: the original SPOF and capacity. Either box can die and the router drops its path within ~2s; the survivor carries the load. Patching is now rolling, not an outage.
The subsystem that constantly probes backends and LBs, marking dead ones out of rotation. Solves: silent failure — without it, an LB happily forwards traffic to a crashed server, or a dead LB keeps announcing its VIP. It's the nervous system that makes every other failover trigger.
Backends split across two Availability Zones (AZs — physically separate data centers in the same region, with independent power and network). Solves: correlated failure. If all servers shared one rack or one AZ, a power event takes them all out at once. Spreading them means one AZ can burn and the service survives.
A second full stack — its own LB pair and servers — in another region. Solves: the entire-region-down scenario (a regional cloud outage, a fiber cut, a config blast radius). GSLB (②) fails users over to it. Without it, a regional outage is a full outage.
The backends behind the standby region's LBs, kept warm and in sync. Solves: the "we failed over DNS but there's nothing to fail over to" trap. A standby region with no running capacity is just a slower outage.
Let's make it concrete. It's 2 a.m. again — but this time the team has the layered design above. LB-A in US-East crashes hard. Follow what happens to Priya's customer, Sam, who's mid-checkout in New York. Watch how many components quietly do their job before anyone even gets paged.
Here's the same story in words, because the timing is the whole point:
Redundancy isn't free magic — it introduces its own failure modes. Two boxes that are supposed to coordinate can miscoordinate. Here are the traps that turn a redundant pair into a new kind of outage, and how to avoid them.
Recall the active-passive pair: the standby promotes itself when it stops hearing the active's heartbeat. But what if the heartbeat link between them breaks, while both boxes are still alive and reachable by clients? The standby assumes the active died and grabs the VIP. Now two boxes claim the same IP — they fight over it, ARP tables flap, connections break randomly. This is split-brain, and it can be worse than a clean outage because the failure is intermittent and maddening to debug.
Don't let a single cable be the thing that decides life-or-death. Run VRRP heartbeats over two independent links (e.g. the data network and a dedicated crossover). Both must fail before the standby promotes — making a false "the master is dead" verdict far less likely.
Use a third witness (a quorum node, or the cloud's API) to break ties: a box only becomes master if it can confirm with the witness that the other is truly gone. "I can't hear my peer" should never, by itself, authorize a takeover.
A health check is only as good as what it actually tests. The classic mistake is a shallow check — "does the server's port 80 accept a TCP connection?" The port can be open while the app is broken: database disconnected, dependency down, returning 500s to every real request. The LB sees "healthy" and keeps sending users to a server that errors on everything.
/healthz endpoint that actually exercises the critical path (can it reach the DB? the cache?) and returns 200 only if the server can truly serve. But don't over-couple it: if /healthz fails the instant a shared dependency hiccups, every server fails its check at once and the LB pulls the entire pool out of rotation — turning a minor blip into a total outage. Check what this server needs, not the whole world's health.Making the LB redundant is pointless if everything it points to shares a hidden SPOF. Common ones to hunt down:
Here's the good news for most engineers in 2026: if you use a managed cloud load balancer — AWS ALB/NLB, GCP Cloud Load Balancing, Azure Load Balancer — you get most of this for free, because the cloud provider already solved it at massive scale. But you have to understand what's handled and what's still your job, or you'll build a SPOF on top of a non-SPOF.
/healthz that tests the real path), but scope them so one shared dependency can't fail the whole pool at once.