reliability in an unpredictable world

Much is written about reliability for computing services, web services in particular. “Five nines” uptime became a popular catchphrase a few years back (referring to 99.999% uptime, which turns out to be about 5.2 minutes downtime per year).

Reliability is obviously a good thing; you want your website to be up whenever your customers might need it. However this often leads people to spend way more time, energy, and money focusing on building out a large infrastructure (and excess capacity to handle peak load) than is really necessary or useful, and can in fact have a negative effect on your reliability if you’re centralizing too much.

Assuming you have the same amount of resources (human, financial, etc.) to spend on a centralized location with a small number of large servers, or on several locations with a large number of small servers, I’d almost always go for the latter. Also, having redundant machines and a way to remotely restart and diagnose your equipment should be a given; it doesn’t matter where in the world your servers physically reside.

I’ve seen a tendency to “scale up” rather than “scale out”, for example putting a critical web site on one giant server on one fast and reliable network, and attempting to scale using that model (buy a bigger server, move to a more reliable network. If it gets to be too much for one, the work is split onto two servers that don’t have the same data). The problem with this is that it’s putting all of your eggs in one basket, that is if your provider goes down (they will) or your software fails (it will) or something completely unexpected which should never happen happens, then you’re out of luck until you get that physical machine at that physical location back up (or replace the bad part, or move to a different service provider, whatever).

It’s also a lot of work have to break your data processing application into halves after-the-fact,  instead of thinking of how to break down your tasks from the start. Breaking down a hard problem into a series of easy problems is a great exercise in general; you will frequently discover that many of the hard problems you need to solve have a lot in common, and someone’s already found a way to do the easy stuff, which you can learn from.

For most applications and workloads, the problem isn’t that there’s too much data, it’s that there are too many incoming requests. Ideally, you never want data to exist only in one place, from one network, at the same time; having a larger number of smaller servers, distributed geographically, is a much better way to ensure that your customer will get the data they need when they request it. Distributing your servers geographically also tends to make the site more responsive to users, since they can be routed to the closest available server.

Even in the case where you absolutely must have large servers pulling in a lot of data and keeping it all in memory at the same time, there’s no reason that these servers should be providing your web presence (they should be on the “back-end”, not directly accessible to users like your “front-end” web servers are).

This tends to make your front-end servers handle a lot more load, since they can (usually) serve almost all static content. Even if your application is 100% dynamic, an empty server response (or a “sorry we’re down” message) is better than a refused connection, and slightly stale data is better than an empty server response. Best is when your back-end servers are down and back up before your front-end servers (or your customers!) notice.

If I was deploying a new service today, I would:

  • separate the content generation from the content serving as much as possible. A lot of information can be pre-generated and saved as HTML and images, instead of being generated in response to every single HTTP query.
  • buy as few front-end servers as possible; I’d look first at virtual hosting, then virtual servers, then at hosting my own physical servers
  • look into using different network and hosting providers simultaneously – there’s no reason to use the same company for a relatively small server deployment, you usually only get a price break when you’re hosting a lot.
  • ensure secure, remote access to all of my services (and 24/7 contact info for all providers)
  • full redundancy – I should be able to shut down any single site or server, and should suffer no permanent data loss.
  • come up with a reasonable maintenance plan (set aside a certain number of hours per month for regular maintenance, and use it). This includes things like validating backups, working on better monitoring, thinking about how to make things better.
  • automate the basic setup, update, and rollback of servers as well as software. The only manual work should be plugging in the new server (and if it’s virtual, not even that); the OS install and custom software/data can be deployed (and updated) automatically.
  • do phased rollouts for software updates; for example update the OS, custom software, and/or data on a small number of servers, and deploy to larger and larger numbers if no problems are found (despite the best testing, problems are still often found only in the field).

In short, I’d spend more time thinking about how to manage and distribute my data, and plan ahead for the fact that there will be problems. One downside is that more servers means more complexity, but a surprising number of tasks can be automated such that you rarely need to log directly into an individual server, except in the cases where one is having problems.

Another key part of this strategy is to notify customers early and often.

Ideally, a given customer never has problems with your service. Less ideally (but more realistically), there is a problem, and you notify customers of what is wrong and when you expect it to be back up (with a followup on what went wrong and how you will prevent it in the future.

The worst is when a customer rings you up to tell you that your email is down, so they can’t tell you that your site isn’t working. Oh, and they had to dig your phone number out of the Google cache, so they are good and frustrated by that, too :)

Leave a Reply