In the first part of this series, we introduced incident response, defined incidents, and covered ownership and runbooks. Read on for Part 2 of The Best Practices For Incident Response Management.
Automate, Automate, Automate… and Integrate!
We saw how beneficial checklists and runbooks can be in the incident response process but what if they could be run automatically without our intervention? Full or partial automation of steps in the incident process are great especially when time is of the essence and you've been woken up at 1:30 in the morning to solve a problem half asleep.
What's in Uberflip's Automation Toolkit
The first tool that's absolutely necessary for incident response automation is monitoring. At Uberflip we use Zenoss but there are many other great alternatives for both component-level and service-level monitoring. Zenoss allows us to monitor all of our infrastructure components and services.
To notify in a reliable manner the appropriate individuals when incidents occur, it's essential to have an alerting/paging tool, such as PagerDuty or VictorOps. Paging tools allow you to notify the appropriate individuals but also verify that the individual has been reached and is actually working on the incident. Uberflip chose VictorOps because of some of the automation tools that they make available. VictorOps integrates with Slack and other IM systems, so that all correspondence is captured and is searchable for future use. It also integrates with status page tools such as StatusPage.io to allow for the automatic updating of customers on the status/availability of services.
A very interesting automation tool which we're considering is integrating with our ticketing system Redmine. Many companies use ticketing systems such as Redmine, Bugzilla, or JIRA, but the idea of having tickets created based on incidents that aren't resolved is a compelling concept.
Finally, another automation relates to creating post-mortems. A post-mortem is an examination of the incident with the hopes of finding the root causes. It would be virtually impossible to fully automate creating a post-mortem, but again VíctorOps provides a tool to automate a good portion of it. The concept of a post-mortem is very simple: it is gathering the correspondence that happened around the incident, creating a root cause analysis, and tagging the post-mortem for future reference.
The Right Tools for the Job
The expression "right tool for the job" is just as applicable in incident response. Below is a list of tools that Uberflip uses in our to ease the incident response process and some popular alternatives:
Monitoring: Zenoss, Nagios, Cacti, Munin, Zabbix, New Relic
Paging/alerting: VictorOps, PagerDuty, OpsGenie.
Status pages for communicating incidents to external parties: StatusPage.io, StatusHub.io, Status.io
Ticketing: Redmine, Bugzilla, JIRA, Mantis
Instant messaging: Slack, HipChat, Skype
The last tool that I want to mention is so integral to the incident response process but is also potentially overlooked: the telephone (remember those?).
The phone cannot be beat when it comes to ease-of-use, speed, reliability, and most importantly its ability to wake you up at an ungodly hour in the morning.
When in doubt in trying to contact a team member to help with an incident, call them. This uncertainty in the incident response process leads to another best practice: know where to go.
Know Where to Go
Nothing is worse and more destructive during the incident response process than individuals not knowing where to get information. This causes a lot of unnecessary questions to be asked which can distract the Incident Commander's (IC) focus and cause confusion and delay.
The best thing to do is to document, in advance, where individuals should go to obtain information. At Uberflip each person involved in incident response knows exactly where to go to get this information. The IC uses the monitoring tools to gain insight into the problem. They are also provided with checklists and runbooks to put out the fire.
Lastly, the IC can reference post-mortems to help find the solution to the problem. Our customer success team and developers involved in the incident know that all correspondence are available within a specific place in Slack. In the near future we plan to direct customers to an externally-hosted status page to gain insight into the current and historical availability and performance of Uberflip services.
Knowing where to go is just as important as knowing what to say, which leads us to our next best practice of communication.
Like any good relationship constant, relevant communication is necessary. You need to ask yourself these three questions when considering communication: "who is my audience?", "what do they need to know?", and "what is the best means of letting them know?".
The reason you need to know who is your audience is because you need to choose the right tone and technical level that they will understand. If you are too technical with your customers, it will only cause more confusion and questions, which will eat up more of your time.
Knowing what your customers need to know
It is hard to answer the question of what the stakeholders need to know, but at least in the case of enterprise SaaS delivery, customers may want to see current and historical availability and performance information about your services, brief descriptions of incidents/impact, estimates on service restoration times during incidents, and what was done to minimize the potential of an incident happening again.
Your customer success team (or equivalent) may want to see everything that the customer sees, as well as a simplified root cause analysis, and more in-depth resolution details. DevOps like to see what the customer sees, what the customer success team sees, as well as low-level details, tickets that were created if the incident isn't resolved immediately, a robust root cause analysis, and a complete post-mortem.
I know I've mentioned root cause analysis many times in this article which is the last part we'll tackle in Incident Response Management.
Root Cause Analysis
Without getting too in-depth on what belongs in a root cause analysis (because that's enough to fill an entire article), I'd like to share a few items that help me while I'm writing mine. The most useful tool for me is a root cause analysis template.
In my opinion the most important sections of this document are the Event Description, some sort of timeline or at least the start and end time of the incident, the root causes, and the corrective actions. If you're able to complete those fields then it should be sufficient for when you have to look back at the root cause analysis in the future.
As VictorOps says, "being on-call sucks".
I hope that this article can act as a guide to help you create or improve your incident response process. This article isn't meant to be an all-encompassing recipe on how to do incident response. I'm sure that there are better ways to do some of the practices listed here, but right now this process is working wonders for Uberflip. Since this process works for us I hope it works for you because nothing would make me happier than you waking up at 01:30 and it sucking a lot less.
Now that you know how we tackle Incident Response Management here at Uberflip. How do you go about it?
About the AuthorFollow on Twitter More Content by Rob Damiano