In one of the Slack channels that I frequent, someone recently asked what a reasonable duty cycle was, for engineers in a 24/7 on-call rotation with a single digit number of pages per week. In other words, under those circumstances, is it reasonable for a given individual to be on call one week in three, one week in four, or what? At least according to Google SRE, “reasonable” is a lot less often than that.
It depends a lot on when those pages occur, and how disruptive they are to your work and your life. If they’re all during your normal workday, and each takes only a few minutes to deal with, it’s not a big deal (though if they take only a few minutes to deal with, can’t you automate them away?). On the other hand, if you’re getting awaked every night by pages, and each page takes hours to deal with, even a “single digit number of pages per week” can be a huge burden.
It also depends somewhat on how quickly you are expected to respond to each page. How long is it, from the time you are paged, until you are expected to be hands-on-keyboard, connected, logged in, and actively working on the problem? If that’s 30 minutes, then you can still have most of a life even when on call; you can go to dinner, go to the movies, and so forth, as long as you bring your laptop and mobile hotspot, and you’re prepared to step out if you get paged. If your response expectation is 5 minutes, on the other hand, you’re pretty much tied to your home or office; you can step out long enough for a trip to the laundry room to change loads, but that’s about all.
At Google, the two standard response levels for SRE on-call rotations are 30 minutes and 5 minutes, and a “single digit number of pages per week”, or slightly more, is typical. On-call SREs at Google are paid for the time they spend on call that falls outside of their normal workday; for a 30-minute rotation, they’re paid at 33% of their normal rate, and for a 5-minute rotation, at 66%. In order to prevent burnout, the amount of on-call bonus pay is capped at 15% of base salary; this is a constraint on managers, who are strongly expected to keep all their people under this 15% cap, and must seek VP-level approval to exceed it (and had better have a damned good explanation of why this is a temporary situation that will be fixed within the next quarter). If you work backwards through the constraints, you find that it takes a 7-person rotation to provide 30-minute on-call coverage 24/7 without anybody exceeding the cap, and a 14-person rotation to provide 5-minute coverage. And therefore, any given individual would be on call no more than one week in seven for a 30-minute response, or one week in fourteen for a 5-minute response.
Google is perhaps an outlier, in this respect. Most companies are probably going to put their people on call more frequently than Google does, but they should beware that they’re burning the candle at both ends. They’re going to disrupt the ongoing project-based work that their engineers are expected to be accomplishing, and they risk driving their engineers to on-call burnout.
Many companies also don’t explicitly pay their engineers for being on call, or say “it’s just part of the job”, but I think that’s short-sighted. Being on call is a burden, and having to pay people for it helps keep it under control by providing an incentive for the company to invest in making its systems more stable, reliable, and automated. And in the end, the company, its customers, and its employees all benefit from that.
What do you do when the pager goes off for something that’s bigger than one person can handle? How do you coordinate an effective, efficient, timely, and scalable response? Learn how at my Mastering Outages One-Day Class! There’s one scheduled for Friday 18 May 2018 in the San Francisco Bay Area; full details and registration are at greatcircle.com/class, and you can save $100 by using discount code “On-Call”.
How do you measure and improve the effectiveness of your incident responses? You can start by looking at the times associated with your responses. You can set targets for these times, and evaluate how well a given incident response met your targets. Over multiple incidents, you may be able to identify trends, and take steps to tune your response methods based on those trends.
When I’m reviewing an incident, I try to identify 6 key timestamps, and then consider the relationships between them:
- Start Time is when your customers or users first started being impacted, whether they noticed it or not. This time usually has to be determined after the fact, through log analysis. It can sometimes be tricky to determine this time, as the problem might not have affected all customers, or it might have been something that gradually grows over time (a system getting progressively slower, for example), but you need to drive a stake in the ground somewhere.
- Detection Time is when your monitoring systems first detected the problem. This is the earliest that you could have alerted someone to the problem, though you might have delayed the alert for a while, to avoid noisy alerts.
- Alert Time is when the system actually alerted a human about the problem. We can debate what “alerted” means; was it when the first ticket was filed, even if the ticket didn’t generate a page? Was it when someone was paged? Was it when someone acknowledged the page? Pick a definition that makes sense for your circumstances, and stick to it.
- Response Time is when a person first started actively working on the problem. This means someone with their hands on the keyboard and their eyes on a screen, logged in and actively working the problem. Note that this is not the time at which someone acknowledged the alert, unless they were able to immediately begin working on the problem after acknowledging it. There might be a considerable amount of time between “Alerted” and “Responded”, while somebody finds their laptop, gets connected, gets their VPN set up, gets logged in, etc.; you want to know that.
- Mitigation Time is when the problem was resolved from the customer’s point of view; when they stopped being affected by it, or stopped seeing symptoms. You may still be working on the problem, but it’s no longer apparent to the customer.
- Resolution Time is when the incident response is “finished” from the responder’s point of view. The problem is solved, a permanent-enough fix is in place, and they can step away from the keyboard and go back to sleep. There might still be followup work to do (bugs to fix, etc.) after the Resolution Time, but basically, this is when the system is back to a normal, functioning state.
Given those timestamps, consider that:
- Customer Impact is from Start Time to Mitigation Time. This is the duration of the problem, from the customer’s point of view.
- Company Impact is from Alert Time to Resolution Time. This is the duration of the incident response, from the company’s point of view. It starts at Alert Time, because that’s when this problem first caused the company to take any action to address this situation.
Notice that the customer’s view of the duration of the incident or outage differs from the company’s view; that is, Customer Impact can differ significantly from Company Impact. At the beginning, customers (or at least some of them) are generally feeling an impact before the company notices the problem, and before the company does anything to address it; that’s the difference between Start Time and Alert Time. At the end, the company is often still working on the problem even after it’s “fixed” from the customer’s point of view; that’s the difference between Mitigation Time and Resolution Time.
In many circumstances, the Mitigation Time and the Resolution Time might be identical. This is true if there are no further steps to take beyond the mitigation, to return the system to a normal, functioning state.
You can use these timestamps to determine several useful intervals:
- Time to Detect is the difference between Start Time and Detection Time. If that interval is too long, you can shorten it by improving your monitoring, and in particular by focusing your monitoring on the customer’s experience. For example, your customer doesn’t care what the load average on your web server is, or how much free memory it has; your customer cares about how long it takes to load the web page they’re visiting, so focus your monitoring on measuring page load times.
- Time to Alert is the difference between Detection Time and Alert Time. If that interval is too long, you can shorten it by improving your alerting system. Beware, however, that you don’t want to create noisy alerts which generate a lot of false positives; that will create alert fatigue, where your responders are run ragged by their pagers, and can’t immediately tell the difference between bogus alerts and serious alerts. This is like your home smoke detector going off when you cook something; if it happens too often, you get annoyed and take the batteries out of the smoke detector, which is obviously bad.
- Time to Respond is the difference between Alert Time and Response Time. If that interval is too long, you can shorten it by improving your team’s response expectations. This may involve paying them more, if they’re on a tighter oncall schedule. There’s a big difference a 30-minute response expectation and a 5-minute response expectation; with a 30-minute response expectation, you can still go out to dinner or a movie, as long as you bring your laptop and are prepared to step out if you get paged, while with a 5-minute response expectation, even a trip to the bathroom might be a challenge.
- Time to Mitigate is the difference between Start Time and Mitigation Time. This is the duration of the incident from your customer’s point of view. If you mitigate a problem, it’s still a problem that you’re working on (perhaps urgently), but the problem has been solved from the customer’s point of view, and they can get on with their work.
- Time to Resolve is the difference between Alert Time and Resolution Time. This is the duration of the incident from your company’s point of view. This is the time during which the incident has an effect on actions taken by your team; it begins at Alert Time (rather than Start Time), because Alert Time is the first point at which your team does something tangible (i.e., respond to the alert) to address the incident.
It’s a good idea to set targets for these various intervals, and to track your team’s performance against those targets across multiple incidents. As part of the blameless postmortem review for a particular incident, you should consider how well you did against your targets. By looking at multiple incidents, you can identify trends or common factors, and make adjustments accordingly to your standards and practices.
The intervals that you have the most control over are Time to Detect, Time to Alert, and Time to Respond. You would expect those should be fairly consistent from incident to incident, and it’s obvious what steps you can take to shorten each of them, if needed (just beware of the impact of false positives and alert fatigue, if you try to tighten them too much).
Time to Mitigate and Time to Resolve are harder to predict and tend to be much less consistent from one incident to the next. What steps you can take to mitigate and resolve the incident depends on the nature of the incident, and of the responders. There’s less statistical validity in comparing these across multiple incidents, unless the incidents are all similar in type and scope.
One thing to consider about Time to Mitigate and Time to Resolve is, how long did it take to get the right people involved in the response? You might find, for example, that what initially got reported as an application problem, turned out to be a database problem, which itself turned out to be a storage problem, which finally turned out to be a networking problem. Each of these levels might take 30 minutes or more to work through, as you page a particular team, they respond, they investigate, and they determine that the problem is deeper in the technology stack. When you finally get ahold of the last team, they might identify and solve the problem within 5 minutes, but you took 2 hours to get to them.
You can improve Time to Mitigate and Time to Resolve by alerting a broader swath of teams earlier in the response, and essentially “swarming” the response. That way, when your apps expert needs a database expert, or your database expert needs a storage expert, and so forth, they’re already engaged in the response. The downside of this is, on many incidents those various specialists aren’t going to be needed, and you’re wasting their time and disrupting their other work by looping them in unnecessarily; this has both productivity and burnout consequences. You have to decide which is more important for your company: minimizing customer impact (which means swarming the problem and wasting responder time, just in case they’re needed on a particular incident) or minimizing company impact (which means longer customer impact, as you involve your experts in sequence, only if they’re really needed for the particular incident).
There are no perfect answers here, but rather a series of tradeoffs. It’s better to consider these tradeoffs and make explicit decisions about them when you’re not in the middle of a crisis. Blameless postmortems are often a good opportunity to revisit and review the tradeoffs that you’ve chosen, and make adjustments if appropriate.
Want to learn more about how to manage your outages and other IT emergencies more effectively, and thereby reduce all of the times discussed above? Come to my Mastering Outages One-Day Class! There’s one scheduled for Friday 18 May 2018 in the San Francisco Bay Area; full details and registration are at greatcircle.com/class, and you can save $100 by using discount code “Measure”.
“How do we keep senior managers from disrupting incident responses?” That audience question generated the strongest response last week at my workshop on Incident Command for IT at the fantastic USENIX SREcon18 Americas.
Senior management definitely has a critical role to play in incident response, but as soon as somebody asked that question, the room lit up; it seemed like all 200 people had tales to share about active incident responses that were inadvertently derailed by directors, executives, and other senior managers. It was clear that this was a significant source of frustration for incident responders and incident leaders in the room.
Incident management is about controlling chaos, and senior management can be a significant source of chaos during an incident, usually without meaning to be. Why is this so, and how can senior managers, incident leadership, and responders all work together to avoid this?
Senior management has a legitimate need for timely information about incident responses in progress. But if they simply barge in on the phone bridge or Slack channel that’s being used to coordinate the response, they disrupt the response by creating confusion about who is in charge of the response, which ultimately delays the resolution of the incident. It’s bad enough when the drive-by senior managers are simply asking for information; those requests take on a life of their own, as responders scramble to answer the questions, instead of continuing whatever they were doing at the direction of the incident leadership. It’s even worse when the senior managers start giving orders, usurping the roles of the incident commander and their leadership team; if the senior manager wants to do that, they should explicitly take over as incident commander (assuming they have the appropriate training), but that seldom seems to happen.
Senior managers can avoid this scenario by being aware that their mere presence in an incident response is going to make waves, and therefore being careful about when and where they appear. If they have questions about the incident response, they should address those questions privately to the incident commander, via direct message or other channels. They should not be asking questions of individual responders directly, or in the main communications channel of the response.
Incident commanders can address this issue by recognizing that senior managers have legitimate interests in incident responses, and need to be kept informed. Periodic proactive updates to senior management can go a long way towards filling this information gap. If senior managers can trust that the IC will keep them informed as the response progresses, then they’re less likely to barge in seeking ad hoc updates. It takes time, likely over several incident responses, to build up this trust, but it’s absolutely worth doing.
On larger responses, or within larger organizations, it can be very effective for one senior manager to take on a “liaison” role between the response and the rest of the senior management team. The liaison usually works directly with the incident commander, passing information along to the rest of the senior management team and representing their concerns to the incident commander. PagerDuty’s open source incident response guide has a good writeup of the internal liaison role.
Individual responders have a role to play here, too. They need to remember that the incident commander is in charge of the incident response, and avoid the urge to defer to wandering senior management. When you’re part of an organized incident response, you’re operating under different rules and norms than day-to-day, within the structure of a temporary org chart created just for this incident. If the senior manager asking questions or giving orders isn’t part of this particular incident response, then you need to politely but firmly redirect them to the incident commander, rather than dropping your IC-assigned tasks in order to respond to the not-involved senior manager.
Senior managers as a group need to establish and monitor norms for each other, in this area. It’s tough for a responder to tell an SVP “I’m sorry, sir, but you should be talking to the incident commander, not me”, even when that’s the right thing to do; the responder shouldn’t be put in that position. A quiet word of advice from a fellow executive who sees this happening, reminding their wayward peer that they shouldn’t meddle in active responses, can go a long way.
Senior managers also have a very important role to play in supporting the organization’s overall incident management program. They need to make sure that the plans get developed, the responders get trained, the exercises get carried out, the responses get reviewed, the postmortems get written, and the action items get followed up on. Explicit and visible support for incident response from senior management is essential for developing and maintaining an effective incident management capability.
If you want to learn more about this and many other incident management topics, I’m teaching a Mastering Outages One-Day Class in the San Francisco Bay Area on Friday 18 May 2018. You can save $100 when you register with discount code “disrupt”, plus save an additional $100 if you register before 16 April 2018.