A formal incident response process is extremely important but seems to be frequently overlooked by many organizations. No matter what industry you're in, if a well-defined incident response process doesn't exist or isn't used, the impact could be significant.
In this 2-part series, I will detail some of what Uberflip has found to be a highly-efficient and effective process for dealing with incidents, and which we are currently implementing.
What is an Incident?
One of the first and most important steps of this process is to define an incident. The reason classifying incidents is so important is because you don't want incidents created for every little issue. This could lead to incident overload or desensitization, and in turn causes the neglecting of truly important incidents.
We use the following criteria to help guide whether something is an incident:
- When a component-level failure occurs and requires human intervention to address it
- When a service's availability or performance is affected negatively in production and human intervention is required quickly.
- When an event related to security or business continuity occurs, e.g. data loss, actual or suspected network/system breach, security vulnerability exploitation or discovery, active attack, etc.
To me this definition seems quite concise and comprehensible for everyone in our organization. We've also clarified with our teams that the majority of bugs aren't considered incidents under the second point above.
After you've successfully created a simple definition of your incidents you're now ready to empower others.
At Uberflip we've been promoting and encouraging ownership heavily within the product team. We're transitioning to a model where individuals and small teams have end-to-end responsibility over what they've built, including operations and incident response.
This is where the importance of individual or team-based alerting comes in. Instead of a single person or ops group handling all incidents, imagine having specific individuals or small teams handle incidents that relate to their services.
By having the team that built a service handle all aspects of operating it, incidents are more likely to be resolved quickly and correctly because of the greater depth of knowledge that a specific team has over their own service versus the depth of knowledge of a general ops group.
A Team Based Exercise
This approach also ensures better distribution of on-call across the entire product team, which results in any specific individual being woken up less overall.
Another very important concept is the idea borrowed from fire-fighting and emergency management of the Incident Commander (IC) role. A key take-away from their article is that the first person that responds to the incident is in charge of that incident. This is a very important concept because in a time of potential chaos it is important for a single person to lead the team to bring things back to the way they were. Once the IC is in charge of the incident it is imperative that the steps he or she follows are very simple.
Finally, the freedom to make technology choices coupled with being responsible for the effects of those choices is a great motivator - it helps ensure that individuals will do everything necessary to minimize being woken up when on-call. As the saying goes, "if you break it, you bought it"... or in our case, "own it".
K.I.S.S. - Keep it Simple Stupid
In a stressful situation or when you're half-asleep, checklists and runbooks are an invaluable tool. It seems like a simple idea that when you are in the right frame of mind to write down how to solve these incidents so that in a hairy predicament you are able to better manage the situation.
Uberflip is considering VictorOps' solution for checklists and runbooks. The concept is quite simple, using the VictorOps Transmogrifier we are able to upload checklists and runbooks that pertain to specific incident types. For example, imagine your monitoring software has detected that your web servers are completing HTTP requests in under one second only 95% of the time and you have now been alerted about this incident at 01:30.
Wouldn't it be wonderful if your alerting system also then presented what you could do to solve the problem? That hypothetical scenario is where automation comes into play.
In part two of this series, I will cover tools, automation, and knowing where to go.
About the AuthorFollow on Twitter More Content by Rob Damiano