High Availability vs Fault Tolerance vs Disaster Recovery – Explained with an Analogy

As a full-stack developer, designing systems that are reliable, resilient and always available is a top priority. But what do terms like "high availability", "fault tolerance", and "disaster recovery" actually mean in practice? And how do you choose the right approach for your application?

In this in-depth guide, we‘ll break down these concepts using a simple analogy of a pizza restaurant chain. We‘ll also look at real-world examples and code patterns for building highly available, fault tolerant applications that can withstand disasters. By the end, you‘ll have a clear framework for architecting systems that meet your reliability goals.

The High Cost of Downtime

Before we jump into the technical details, it‘s important to understand the business context. In today‘s 24/7 digital economy, application downtime is incredibly costly. According to Gartner, the average cost of IT downtime is $5,600 per minute. For larger companies, that figure can be as high as $540,000 per hour.

But the costs go beyond just lost revenue. Downtime can also lead to:

  • Reputational damage and loss of customer trust
  • Lost productivity for employees who can‘t access critical systems
  • Compliance and regulatory issues in some industries
  • Increased churn as frustrated users switch to competitors

Given these high stakes, it‘s no wonder that CIOs and CTOs rank availability and uptime as top concerns. In a 2020 survey by Uptime Institute, 73% of respondents said improving availability and minimizing downtime was their most important priority.

The Pizza Restaurant Analogy

To illustrate the differences between high availability, fault tolerance, and disaster recovery, let‘s use the analogy of a chain of pizza restaurants. Imagine you‘re the owner of this chain. Your goal is to serve up delicious pizza to hungry customers 24/7, without interruption.

High Availability

In our analogy, high availability means maximizing the amount of time the restaurant can take and fulfill customer orders. You don‘t ever want to have to turn customers away because you‘re unable to make pizzas.

To achieve high availability, you might staff the restaurant with multiple chefs and pizza makers, so there‘s always someone ready to jump in if a staff member can‘t make it to work. You‘d also carefully manage your supply chain to ensure you never run out of key ingredients like dough, sauce and cheese.

The key metric for availability is uptime, often expressed as a number of "9s". For example:

  • 99% availability = 3.65 days of downtime per year
  • 99.9% availability (3 nines) = 8.76 hours of downtime per year
  • 99.99% availability (4 nines) = 52 minutes of downtime per year
  • 99.999% availability (5 nines) = 5 minutes of downtime per year

Achieving higher levels of 9s gets exponentially harder and more expensive. Going from 99% to 99.9% is fairly straightforward – perhaps hiring an extra chef and keeping spare supplies on hand. But going from 99.9% to 99.99% might require major investments like redundant pizza ovens, multiple power generators, and on-call staff 24/7.

For most pizza restaurants, 99.9% uptime is more than sufficient. The cost of an occasional hour of lost sales pales in comparison to the massive ongoing expense of maintaining true 24/7/365 availability. It‘s all about striking the right cost-benefit balance for the business.

Fault Tolerance

Now let‘s say you‘ve staffed up and stocked up to achieve high availability. Orders are rolling in and pizzas are flying out the door. But then disaster strikes – your main pizza oven goes down! With only one oven, your restaurant would be dead in the water until repairs can be made.

To guard against this type of single point of failure, you need fault tolerance. In the pizza analogy, that means having spare ovens that can immediately take over if the main one fails. It might also mean having redundant dough mixers, backup power systems, and cross-trained staff who can step in for any role.

Fault tolerance is about building in redundancy so that the failure of any one component doesn‘t take down the whole system. It‘s closely related to high availability – without fault tolerance, a single failure can lead to extended downtime.

But it‘s also distinct, in that true fault tolerance means being able to withstand failures without any interruption of service. With high availability, you might have a 5-10 minute window of downtime while a backup system comes online. With fault tolerance, there‘s no noticeable downtime at all from the customer‘s perspective.

Achieving this level of zero-downtime fault tolerance requires sophisticated engineering. At the infrastructure level, you need redundant servers, load balancers, and data storage. At the application level, you need to build in graceful degradation, circuit breakers, and automated failover.

Here‘s a simplified example in Node.js of an API route handler that uses a circuit breaker to provide fault tolerance:

import CircuitBreaker from ‘opossum‘;

const circuitBreaker = new CircuitBreaker(fetchDataFromDB, {
  timeout: 3000, // If our function takes longer than 3 seconds, trigger a failure
  errorThresholdPercentage: 50, // When 50% of requests fail, trip the circuit
  resetTimeout: 30000 // After 30 seconds, try again
});

app.get(‘/data‘, async (req, res) => {
  try {
    const result = await circuitBreaker.fire();
    res.json(result);
  } catch (err) {
    res.status(500).json({ error: ‘Service unavailable‘ });
  }
});

In this example, if the fetchDataFromDB function starts timing out or throwing errors, the circuit breaker will trip. This prevents cascading failures and allows the system to gracefully degrade. The circuit will automatically reset after 30 seconds, minimizing downtime.

Disaster Recovery

So we‘ve made our pizza restaurant highly available and fault tolerant. We have redundant ovens, backup generators, and a well-oiled team of pizza makers. But what happens if a meteor hits your restaurant and reduces it to rubble? (Okay, that‘s a bit extreme, but stay with me.)

No amount of redundant onsite equipment will help you recover from a true disaster that knocks out the entire facility. To recover from that kind of widespread failure, you need a disaster recovery plan.

In the pizza analogy, a DR plan might look like:

  1. Failover incoming orders to another restaurant location outside the impact zone
  2. Retrieve the latest order and customer data from an offsite backup
  3. Provision a new temporary kitchen at a backup location
  4. Route delivery drivers to the new location
  5. Notify customers of temporary changes and expected resolution time

The key to effective disaster recovery is having all of this planned out and practiced in advance. You don‘t want to be scrambling to find a backup location or restore data from backups while hungry customers are pounding on your door.

The same principles apply to IT disaster recovery. The goal is to get the system back up and running at an alternate location with minimal data loss. To enable this, you need:

  • Reliable, tested backups of data and configurations
  • Redundant infrastructure in geographically separate locations
  • Automated provisioning and deployment to spin up new environments quickly
  • Detailed recovery plans and runbooks that are regularly practiced

Disaster recovery is the most extreme (and expensive) form of business continuity planning. Many businesses choose to accept some data loss and downtime in the face of a true disaster, opting for a recovery time objective (RTO) of a few hours or even days.

The important thing is to make this an intentional choice based on business needs. Don‘t just assume you can quickly recover from a disaster if you haven‘t put in the planning and preparation.

Putting It All Together

So when should you use high availability vs fault tolerance vs disaster recovery in your applications?

The answer, as with most things in engineering, is "it depends". You need to weigh the business need for uptime and data protection against the cost and complexity of each approach.

Here are some general guidelines:

  • For customer-facing web applications where a few minutes of downtime is acceptable, design for high availability. Use redundant servers and databases, but don‘t over-engineer. Aim for 99.5-99.9% uptime.

  • For critical APIs and backend systems that need to be "always on", design for fault tolerance. Use circuit breakers, redundant components and auto-scaling. Accept that this will be more complex and expensive to operate.

  • For any system that is critical to the business, have a disaster recovery plan. The more important the system and data, the lower your RTO and RPO should be. But be pragmatic – an hour of lost data may be painful, but going out of business because you over-invested in DR is worse.

The key is to be intentional about your availability and resiliency design choices. By understanding the differences between high availability, fault tolerance, and disaster recovery, you can make informed tradeoffs and build systems that strike the right balance for your customers and business.