Bitfield Consulting

View Original

Night of the Runbooks: a DevOps horror story

See this content in the original post

It was a dark and stormy night...

Not really, of course; it was warm and pleasantly mild (the weather never seems to co-operate on these occasions), but indulge me by imagining appropriate meteorological conditions for the spine-tingling story I'm about to relate.

It was Morgan's very first night on call, and they were a mite nervous, as you can imagine. They would be solely responsible for an online system handling thousands of users, millions of transactions, including credit card payments, and if anything went wrong, they would have to deal with it. Somehow.

Fortunately, automated monitoring checks would be keeping an eye on the system second by second, ready to send an alert to Morgan's phone if any problems were detected. So as the warm summer evening wore on, Morgan settled into a comfortable chair with a blood-chilling tale of terror, phone within reach, and relaxed a little.

Beep, beep

At around 11pm, the phone suddenly beeped, the alert sounding absurdly loud in the quiet house. Morgan grabbed it quickly. Would this be a false alarm, or a real incident? "Service process-payments is CRITICAL", they read. "Oh, great," thought Morgan. "This is exactly what I need. I don't know anything about process-payments; it belongs to a completely different team. I wouldn't even know how to check it. Wait, though... what's this?" They read on:

"Please click the following link to troubleshoot this service."

Intrigued, Morgan opened the message on their laptop and went to the link provided. It opened a web page entitled "Runbook for process-payments incidents", with a large, friendly heading saying "Start Here". "Great!" thought Morgan; where to start was exactly what had been worrying them. It looked like someone had taken the trouble to prepare a complete troubleshooting guide for exactly this kind of incident.

The first thing in the document was a series of clear, simple steps for Morgan to follow. First, an embedded button to click to acknowledge the alert, to let others know that they were handling the incident, so it wouldn't automatically escalate to someone else.

Next, a set of live images showed the current status of various aspects of the payments service: percentage of failed requests, latency, and so on. The checklist asked Morgan to confirm from these dashboard displays that there was a problem with the service.

It looked like there was, since the number of failures had gone way up, above the alert threshold, and was still rising. Morgan went on to the next step in the checklist. It was a button that triggered a script to capture the current status and log messages from the service, automatically copying them into a new web document for the incident. This would come in handy later when trying to figure out what went wrong.

"So far, so good," thought Morgan. "What's next?" Next came another button, to automatically restart the service. Sometimes things just hang up or get stuck for no particular reason, and as we all know, turning them off and on again can be a useful default remedy. To save Morgan having to go to the appropriate cloud control panel, logging in, finding the service, and working out the right sequence of clicks to restart the service, this whole sequence was automated, and all Morgan needed to do was click the button in the runbook.

Computer says no

After watching the graphs for a few minutes, Morgan noticed that the restart didn't seem to have fixed the issue. "Oh well," they thought, "it was worth trying. So now what do I do?"

The next step in the checklist said "If the issue is still unresolved, click this button to fail over to the backup service, and continue following this checklist to troubleshoot." Luckily, Morgan's team had a Plan B for this service: a third-party equivalent which could take over temporarily and make sure requests still went through, albeit not as fast as with the primary service.

Morgan clicked the failover button, their confidence growing with each step of the runbook they had successfully followed. After confirming from the monitoring dashboard that the backup service was starting to clear down the number of queued requests, they moved on to the next part of the checklist.

This advised them to read the log messages from the failing service (automatically captured in a previous step, and now available as part of the incident report document). There were a small number of examples of different error messages that they might see, with suggestions for what to do in each case.

"Hmm, this says I should look for a message indicating that a new version of the service was deployed recently," Morgan thought. "Let's see... wait, here it is. Deploying update... all instances successfully restarted... Version XXX running. So there was a fresh deploy about five minutes before the first errors started coming in. Very suspicious!"

The runbook advised Morgan that, in this situation, the first thing to try was to click the button to automatically roll back to the previously-deployed version. "Oh, very useful!" thought Morgan. "This saves me having to check out the code repo and figure out how to find the last known good version and deploy it from the command line, or from having to do it via the CI/CD platform. Great. Click!"

It took a few minutes for the deploy to complete, while Morgan anxiously refreshed the service dashboard, but one by one, red lights started to turn green, and the graphs started returning to normal. "This is looking good!" Morgan thought. "What do I need to do now?"

That wasn't too bad...

The checklist said that if a redeploy fixed the problem, they should first of all click a button to switch traffic back to the primary service, cutting out the backup, and then add their own notes to the incident report explaining what they did and what happened. The report would go automatically to Morgan's team lead to be reviewed the next day, and then for the team to review in their weekly meeting. The other team responsible for the process-payments service would be automatically notified, too, and a ticket would be opened for them to identify what went wrong and fix it; in the meantime, new deploys would be blocked, to prevent someone inadvertently putting the buggy service back into production.

With this done, and the incident officially closed, Morgan could finally relax. "That wasn't anywhere near as scary as I'd thought it might be," they mused. "The runbook took a lot of the fear out of the process, and it gave me something to focus on and a clear sequence of actions to take. Even better, most of the actions were automated, so that all I had to do was trigger them. I suppose every time something like this happened in the past, the team learned from it, updated the runbook with information about it, and built a little more automation to help fix the issue. That's a very enlightened way to run your web operations, now I come to think of it."

The terror

Suddenly a harsh beeping jolted Morgan out of their reverie.

"Service process-payments is CRITICAL."

With mounting horror, Morgan realized that there was no runbook link attached. It had all been a dream... and now Morgan's true night of terror was about to begin.