How do you measure and improve the effectiveness of your incident responses? You can start by looking at the times associated with your responses. You can set targets for these times, and evaluate how well a given incident response met your targets. Over multiple incidents, you may be able to identify trends, and take steps to tune your response methods based on those trends.
When I’m reviewing an incident, I try to identify 6 key timestamps, and then consider the relationships between them:
- Start Time is when your customers or users first started being impacted, whether they noticed it or not. This time usually has to be determined after the fact, through log analysis. It can sometimes be tricky to determine this time, as the problem might not have affected all customers, or it might have been something that gradually grows over time (a system getting progressively slower, for example), but you need to drive a stake in the ground somewhere.
- Detection Time is when your monitoring systems first detected the problem. This is the earliest that you could have alerted someone to the problem, though you might have delayed the alert for a while, to avoid noisy alerts.
- Alert Time is when the system actually alerted a human about the problem. We can debate what “alerted” means; was it when the first ticket was filed, even if the ticket didn’t generate a page? Was it when someone was paged? Was it when someone acknowledged the page? Pick a definition that makes sense for your circumstances, and stick to it.
- Response Time is when a person first started actively working on the problem. This means someone with their hands on the keyboard and their eyes on a screen, logged in and actively working the problem. Note that this is not the time at which someone acknowledged the alert, unless they were able to immediately begin working on the problem after acknowledging it. There might be a considerable amount of time between “Alerted” and “Responded”, while somebody finds their laptop, gets connected, gets their VPN set up, gets logged in, etc.; you want to know that.
- Mitigation Time is when the problem was resolved from the customer’s point of view; when they stopped being affected by it, or stopped seeing symptoms. You may still be working on the problem, but it’s no longer apparent to the customer.
- Resolution Time is when the incident response is “finished” from the responder’s point of view. The problem is solved, a permanent-enough fix is in place, and they can step away from the keyboard and go back to sleep. There might still be followup work to do (bugs to fix, etc.) after the Resolution Time, but basically, this is when the system is back to a normal, functioning state.
Given those timestamps, consider that:
- Customer Impact is from Start Time to Mitigation Time. This is the duration of the problem, from the customer’s point of view.
- Company Impact is from Alert Time to Resolution Time. This is the duration of the incident response, from the company’s point of view. It starts at Alert Time, because that’s when this problem first caused the company to take any action to address this situation.
Notice that the customer’s view of the duration of the incident or outage differs from the company’s view; that is, Customer Impact can differ significantly from Company Impact. At the beginning, customers (or at least some of them) are generally feeling an impact before the company notices the problem, and before the company does anything to address it; that’s the difference between Start Time and Alert Time. At the end, the company is often still working on the problem even after it’s “fixed” from the customer’s point of view; that’s the difference between Mitigation Time and Resolution Time.
In many circumstances, the Mitigation Time and the Resolution Time might be identical. This is true if there are no further steps to take beyond the mitigation, to return the system to a normal, functioning state.
You can use these timestamps to determine several useful intervals:
- Time to Detect is the difference between Start Time and Detection Time. If that interval is too long, you can shorten it by improving your monitoring, and in particular by focusing your monitoring on the customer’s experience. For example, your customer doesn’t care what the load average on your web server is, or how much free memory it has; your customer cares about how long it takes to load the web page they’re visiting, so focus your monitoring on measuring page load times.
- Time to Alert is the difference between Detection Time and Alert Time. If that interval is too long, you can shorten it by improving your alerting system. Beware, however, that you don’t want to create noisy alerts which generate a lot of false positives; that will create alert fatigue, where your responders are run ragged by their pagers, and can’t immediately tell the difference between bogus alerts and serious alerts. This is like your home smoke detector going off when you cook something; if it happens too often, you get annoyed and take the batteries out of the smoke detector, which is obviously bad.
- Time to Respond is the difference between Alert Time and Response Time. If that interval is too long, you can shorten it by improving your team’s response expectations. This may involve paying them more, if they’re on a tighter oncall schedule. There’s a big difference a 30-minute response expectation and a 5-minute response expectation; with a 30-minute response expectation, you can still go out to dinner or a movie, as long as you bring your laptop and are prepared to step out if you get paged, while with a 5-minute response expectation, even a trip to the bathroom might be a challenge.
- Time to Mitigate is the difference between Start Time and Mitigation Time. This is the duration of the incident from your customer’s point of view. If you mitigate a problem, it’s still a problem that you’re working on (perhaps urgently), but the problem has been solved from the customer’s point of view, and they can get on with their work.
- Time to Resolve is the difference between Alert Time and Resolution Time. This is the duration of the incident from your company’s point of view. This is the time during which the incident has an effect on actions taken by your team; it begins at Alert Time (rather than Start Time), because Alert Time is the first point at which your team does something tangible (i.e., respond to the alert) to address the incident.
It’s a good idea to set targets for these various intervals, and to track your team’s performance against those targets across multiple incidents. As part of the blameless postmortem review for a particular incident, you should consider how well you did against your targets. By looking at multiple incidents, you can identify trends or common factors, and make adjustments accordingly to your standards and practices.
The intervals that you have the most control over are Time to Detect, Time to Alert, and Time to Respond. You would expect those should be fairly consistent from incident to incident, and it’s obvious what steps you can take to shorten each of them, if needed (just beware of the impact of false positives and alert fatigue, if you try to tighten them too much).
Time to Mitigate and Time to Resolve are harder to predict and tend to be much less consistent from one incident to the next. What steps you can take to mitigate and resolve the incident depends on the nature of the incident, and of the responders. There’s less statistical validity in comparing these across multiple incidents, unless the incidents are all similar in type and scope.
One thing to consider about Time to Mitigate and Time to Resolve is, how long did it take to get the right people involved in the response? You might find, for example, that what initially got reported as an application problem, turned out to be a database problem, which itself turned out to be a storage problem, which finally turned out to be a networking problem. Each of these levels might take 30 minutes or more to work through, as you page a particular team, they respond, they investigate, and they determine that the problem is deeper in the technology stack. When you finally get ahold of the last team, they might identify and solve the problem within 5 minutes, but you took 2 hours to get to them.
You can improve Time to Mitigate and Time to Resolve by alerting a broader swath of teams earlier in the response, and essentially “swarming” the response. That way, when your apps expert needs a database expert, or your database expert needs a storage expert, and so forth, they’re already engaged in the response. The downside of this is, on many incidents those various specialists aren’t going to be needed, and you’re wasting their time and disrupting their other work by looping them in unnecessarily; this has both productivity and burnout consequences. You have to decide which is more important for your company: minimizing customer impact (which means swarming the problem and wasting responder time, just in case they’re needed on a particular incident) or minimizing company impact (which means longer customer impact, as you involve your experts in sequence, only if they’re really needed for the particular incident).
There are no perfect answers here, but rather a series of tradeoffs. It’s better to consider these tradeoffs and make explicit decisions about them when you’re not in the middle of a crisis. Blameless postmortems are often a good opportunity to revisit and review the tradeoffs that you’ve chosen, and make adjustments if appropriate.
Want to learn more about how to manage your outages and other IT emergencies more effectively, and thereby reduce all of the times discussed above? Come to my Mastering Outages One-Day Class! There’s one scheduled for Friday 18 May 2018 in the San Francisco Bay Area; full details and registration are at greatcircle.com/class, and you can save $100 by using discount code “Measure”.