By design, DevOps is supposed to minimize the risk of errors and mitigate their impact when they occur. But mistakes are still inevitable.
A process flow and playbook can’t cover all that can go wrong. Errors affect every company—from the one-person startup to AWS, the largest cloud vendor. In fact, just a few months back, the massive AWS S3 outage took down a big chunk of the Internet’s largest websites. AWS reported that this was caused by a mistake during routine debugging of some of the S3 servers. A single incorrect command caused an outage in the entire US-EAST-1 region of AWS. Despite their best efforts, AWS resumed service after about four hours.
How you respond to failures in continuous delivery will make all the difference between successful and struggling DevOps teams. Let’s discuss the key elements of great incident management in DevOps.
Stateless services build resilience
One organization that is known for their efforts to build resilient systems is Netflix. They were largely unaffected by the AWS S3 outage, though they use US-EAST-1-based S3 storage. They did experience slower performance, but that’s way ahead of other organizations whose websites and apps were completely down. Netflix attributes this resilience to stateless services.
Netflix’s app architecture is heavily reliant on the microservices model, and as a result is highly distributed. They use three different AWS regions (Northern Virginia, Oregon, and Ireland) to serve traffic globally based on how close the user is to each region. While proximity defines which region handles which request, if needed, a request from one region can be served by another region, and this could be due to traffic spikes in one region, or large-scale failures in one of them—which is what happened in the case of the S3 failure. Netflix achieves this statelessness by storing state in the form of a cache using EVCache, a tool they’ve developed and open-sourced. This means when a server fails in one region, another server in another region can handle requests to that server. Since it’s autoscaled, this replication happens automatically.
Netflix adds a disclaimer that while this is a service-centric solution, they also have a lot of work to do to prepare for entire availability zone failure. Following up on their popular open source tool Chaos Monkey, they’re teasing an upcoming Chaos Gorilla.
Identify performance bottlenecks
It’s not just outages, but even slow performance that affects user experience, and in turn, an application’s success. Every request goes through multiple servers and complex networks. There can be many reasons why performance can be slowed down, including DNS delays, redirects, and client rendering delays. Logs and debugging data should show these precise details, so you can quickly find the bottleneck and fix the issue.
LinkedIn has a tool named BOSS (BOttlenecks for Site Speed) that analyzes bottlenecks and identifies the exact root cause for the slowdown. It works only on the client side, and LinkedIn plans to have a similar solution for the server side as well. They report that so far they’ve had many success stories with BOSS. Performance bottlenecks are easy to ignore, and hard to spot, but effectively managing them leads to better user experience, and lesser chances of issues that escalate into something bigger.
ChatOps gets everyone on the same page
When incidents happen, communication is key to resolving the issue fast, and resuming services back to normal. For DevOps, the best way to facilitate communication during crisis is to use ChatOps. Tools like Slack and HipChat enable teams to communicate in chat rooms in real time, and bring in relevant information that can aid in resolving the incident.
Key to ensuring successful ChatOps is to create a dedicated “war room” or “hot room” where you gather all relevant people in the same place. You can save time by integrating your monitoring tools with your ChatOps tool. This way you can bring all relevant information into your chat room automatically, and ensure everyone knows what’s happening in the moment. Rather than the traditional approach that spends a lot of time diagnosing the issue, ChatOps jumps right into fixing the problem and getting services back up and running.
Because of the number of alerts getting triggered during an incident, it may lead to alert spam where irrelevant alerts drown out the important ones. To prevent alert fatigue, some ChatOps let you group similar alerts together so they’re easier to read and take action on. HipChat has a plugin, Big Panda, that does this.
In summary, DevOps is not without faults. In fact, the faster you move, the more you’ll break things. But when things do break, you need a plan of action to respond to the crisis. Additionally, building resilience into your system from the start will go a long way to ensure it holds up even under large-scale failures. No organization is free from failures and incidents, but any organization can be better prepared to handle issues when they do occur.
About the Author
Twain Taylor began his career at Google, where, among other things, he was involved in technical support for the AdWords team. His work involved reviewing stack traces, and resolving issues affecting both customers and the Support team, and handling escalations. Later, he built branded social media applications, and automation scripts to help startups better manage their marketing operations. Today, as a technology journalist he helps IT magazines, and startups change the way teams build and ship applications.
We’re hiring! Check out the careers page for open positions in Amsterdam, London and San Francisco.
As usual, if you want to stay in the loop follow us on twitter @wercker or hop on our public slack channel. If it’s your first time using Wercker, be sure to tweet out your #greenbuilds, and we’ll send you some swag!