What I Learnt Losing a Million Pageviews

2014-01-06

I have worked at Memrise for about a year and a half now, and am now in charge of the whole web development side. We’ve been growing a lot, and the size of our database is, naturally, ever increasing.

In the week of the 9^th December, we very unexpectedly went down. The database reported irreparable corruption after we ran a routine reboot, and since everything relies on it, we had to make the whole website offline. We were unavailable for 52 hours while we worked on recovery, and therefore we lost about a million pageviews (~25,000 per hour × 52 hours ≈ 1 million).

Our long fight for recovery involved:

restoring the morning backup, and discovering that it was also corrupt;
restoring the past two days’ backups, and discovering they too were corrupt;
restoring every other daily backup we had, and discovering the newest uncorrupt one was five days ago;
using the good, five day old backup to recover, by replaying the five days of activity with the log files (a separate record of data changes kept for backup purposes).

It was an ordeal, especially as it seemed to come out of nowhere. Amazon also later confirmed that there was a fault on their end; they had deployed a bug in their virtual disk software “EBS”, which had caused the corruption on our database.

In the vein of the classic “What I Learned Losing a Million Dollars”¹, I’m going to analyze the hard-earned lessons this experience gave me, and hopefully you too can learn from it.

What to Learn

Determining the main takeaways from such a complex catastrophe is hard. Many things happened in succession, for many different reasons (I’ve only told you the simplified story). Picking out the root causes, and seeing what should be done differently next time, is tricky.

Random failures in systems are inevitable; even if this time it was an Amazon bug, the next time it could be cosmic rays, or an earthquake, or aliens. It would be easy to believe “Amazon EBS is buggy, don’t use EBS”; however, I don’t think this single event really says much about its general reliability.

The main way to look at such a large failure is to find out how we failed to spot that this problem could knock us out so badly, in alignment with Nassim Taleb’s suggestion in Antifragile²:

…after the occurrence of an event, we need to switch the blame from the inability to see an event coming… to the failure to understand (anti) fragility, namely, “why did we build something so fragile to these types of events?”

Basically the goal is: have some perspective.

Fixing SPOFs Early

SPOF stands for ‘Single Point of Failure’, which means a part of a system that would bring the whole thing down should it fail. Examples include: a washing machine’s power cord; a car’s wheels (when not carrying a spare) - all four are SPOFs, as a puncture in any would mean a complete stop; or a website with all its data on a single database server!

Since such an architecture is dangerously fragile, how can we avoid it? Well actually, there is a tried and tested technique for MySQL (the database technology we use) - and that is replication. This involves a second database that contains an exact copy of the data, and keeps it up-to-date live. It’s almost magical. Should the original ‘master’ database fail, one can just turn the replica into the new ‘master’, and continue merrily.

Unfortunately, we already knew this, and were caught with our pants down. In fact, the reason we performed the ‘routine reboot’ that revealed the corruption was because we were trying to fix replication. A parameter change made in the past meant that our replication kept breaking randomly, and by rebooting we were hoping to fix the parameter change, launch a fresh replica, and be back to the ‘safe zone’ where we had more than one database. It just so happened that the corruption struck at the worst possible time, re-iterating Murphy’s Law very poignantly: “Everything that can go wrong will go wrong.”

So even though we knew the single database was a SPOF, the real error was that we had not prioritized fixing it at all. There had been months in which the replication was known to be broken and could have been fixed; it was always an ‘important but never urgent’ task (very deadly). Even though the chance at a given point in time of random disk corruption is low, the damage it can do is terrible, as we found out. In risk, it seems to be better to focus on exposure to error, rather than the raw probability; therefore, finding and eliminating SPOFs should always be prioritized, especially when it doesn’t appear urgent.

In the Cloud, Make Parallel Bets

Once we found the corruption, we launched a long investigation into which backups were corrupt. Unfortunately, with hindsight, it’s easy to see this is the part of our recovery process where we lost a lot of time. The downtime could have been nearly halved, had we acted in a more parallel fashion.

Our investigation proceeded in a very linear manner: first we checked the same day’s backup; then we checked the previous two days’ backups; and then we checked the remaining five days’. However, restoring one or more backups would take an hour each time, so we lost a lot of time just waiting.

What I now wish we had done, was to assume the very very worst straightaway. We would then have launched all the backups in parallel, having seen the first sign of disk corruption. Had we done this, we would have finished our investigation in about half the time.

And this is down to the somewhat unintuitive power of ‘the cloud.’ It’s quite human to think of ‘our database server’ as a little machine somewhere, but really, Amazon has a whole heap of computing resources—much more than a small startup could possibly use. Since one can get a hold of it cheaply for just a few hours, launching seven backup restores at once is actually a great idea—even if it would only save an hour of downtime.

I think the main lesson here is: before diving into working on a problem, to take a step back, a deep breath, and survey all the resources that are available. Once we had started, it was easy to feel we were making good progress on discovering which backups were corrupted; but by working in a serial fashion, we were moving at a snail’s pace, relative to working in parallel.

Count Your Blessings

It would have been easy to see the downtime as just plain bad luck. However, in some ways, nothing could be further from the truth.

Since we only keep seven days of backups, if we hadn’t rebooted the database for two more days, we wouldn’t have found out that the corruption even existed until it was too late. And that would have meant inevitable data loss—and of an unknown size. We are obviously incredibly glad that didn’t happen!

Additionally, the end result of our downtime was not so bad. We didn’t have an Adobe-scale account leak, nor did we lose any important data. And best of all, seeing the positive perspective on bad luck is one of the best ways to become lucky³, so I’m glad I can do that.

References

What I Learned Losing a Million Dollars by Jim Paul and Brendan Moynihan (1994)

Jim Paul tells his story of massive loss, why he stuck to his position, and how to avoid it.

Antifragile: Things That Gain From Disorder by Nassim Nicholas Taleb (2013)

An in-depth discussion of the implications of risk and probability that overlaps partially with his other two books.

The Luck Factor by Richard Wiseman (2004)

How to become lucky in four easy steps - derived from a large scientific study.