“How did we get lucky?”
I find that this is often the most interesting section of an incident postmortem. In other words, what might have happened, but didn’t? What could have happened, that would have been worse? Incidents often open your eyes to new and frightening possibilities that you hadn’t previously considered, and the postmortem is a good place to explore them.
“Incidents are unplanned investments in your company’s survival”, according to John Allspaw, who many credit with popularizing the blameless postmortem movement. When you have an incident, you should get the most out of your “unplanned investment” by conducting a blameless postmortem. Among other things, that postmortem should consider “what if?”. What if this incident had occurred during peak load, instead of during off hours? What if you hadn’t happened to still have a copy of the prior config in your working files? What if you weren’t already logged in to the critical server, when all the SSH keys were wiped out by the broken automation distributing new keys?
Sometimes the most important lessons from a postmortem are what didn’t happen, but could have…
Where can you learn more about blameless postmortems, and how to conduct them?
Well, John Allspaw’s seminal Blameless PostMortems and a Just Culture article is a good start and a quick read. His presentation at the 2017 DevOps Enterprise Summit is also quite enlightening; at first, it doesn’t seem to have much to do with postmortems, but if you watch it all the way through (it’s only 33 minutes), you won’t be disappointed.
If you want more depth, Google has long had a very strong blameless postmortem culture, carefully nurtured by their legendary Site Reliability Engineering (SRE) organization, which I was proud to be a part of for several years. They’ve written about it in the O’Reilly Site Reliability Engineering book (which includes both a chapter on postmortems and an example postmortem), and in a blog post in their CRE Life Lessons series.
Finally, for specific guidance on how to conduct a blameless postmortem, Will Gallego delivered a fantastic talk on Architecting a Technical Postmortem at USENIX SREcon18 Americas.
Would your team or professional group like a free on-site tech talk on incident management? I deliver a limited number of these 1-hour talks every month to companies and user groups throughout the San Francisco Bay Area, as well as in other cities that I’m traveling to. Contact me to join the growing list of organizations who have benefited from these free talks, including as LinkedIn, Atlassian, Pivotal, Okta, and BayLISA!
In one of the Slack channels that I frequent, someone recently asked what a reasonable duty cycle was, for engineers in a 24/7 on-call rotation with a single digit number of pages per week. In other words, under those circumstances, is it reasonable for a given individual to be on call one week in three, one week in four, or what? At least according to Google SRE, “reasonable” is a lot less often than that.
It depends a lot on when those pages occur, and how disruptive they are to your work and your life. If they’re all during your normal workday, and each takes only a few minutes to deal with, it’s not a big deal (though if they take only a few minutes to deal with, can’t you automate them away?). On the other hand, if you’re getting awaked every night by pages, and each page takes hours to deal with, even a “single digit number of pages per week” can be a huge burden.
It also depends somewhat on how quickly you are expected to respond to each page. How long is it, from the time you are paged, until you are expected to be hands-on-keyboard, connected, logged in, and actively working on the problem? If that’s 30 minutes, then you can still have most of a life even when on call; you can go to dinner, go to the movies, and so forth, as long as you bring your laptop and mobile hotspot, and you’re prepared to step out if you get paged. If your response expectation is 5 minutes, on the other hand, you’re pretty much tied to your home or office; you can step out long enough for a trip to the laundry room to change loads, but that’s about all.
At Google, the two standard response levels for SRE on-call rotations are 30 minutes and 5 minutes, and a “single digit number of pages per week”, or slightly more, is typical. On-call SREs at Google are paid for the time they spend on call that falls outside of their normal workday; for a 30-minute rotation, they’re paid at 33% of their normal rate, and for a 5-minute rotation, at 66%. In order to prevent burnout, the amount of on-call bonus pay is capped at 15% of base salary; this is a constraint on managers, who are strongly expected to keep all their people under this 15% cap, and must seek VP-level approval to exceed it (and had better have a damned good explanation of why this is a temporary situation that will be fixed within the next quarter). If you work backwards through the constraints, you find that it takes a 7-person rotation to provide 30-minute on-call coverage 24/7 without anybody exceeding the cap, and a 14-person rotation to provide 5-minute coverage. And therefore, any given individual would be on call no more than one week in seven for a 30-minute response, or one week in fourteen for a 5-minute response.
Google is perhaps an outlier, in this respect. Most companies are probably going to put their people on call more frequently than Google does, but they should beware that they’re burning the candle at both ends. They’re going to disrupt the ongoing project-based work that their engineers are expected to be accomplishing, and they risk driving their engineers to on-call burnout.
Many companies also don’t explicitly pay their engineers for being on call, or say “it’s just part of the job”, but I think that’s short-sighted. Being on call is a burden, and having to pay people for it helps keep it under control by providing an incentive for the company to invest in making its systems more stable, reliable, and automated. And in the end, the company, its customers, and its employees all benefit from that.
What do you do when the pager goes off for something that’s bigger than one person can handle? How do you coordinate an effective, efficient, timely, and scalable response? Learn how at my Mastering Outages One-Day Class! There’s one scheduled for Friday 18 May 2018 in the San Francisco Bay Area; full details and registration are at greatcircle.com/class, and you can save $100 by using discount code “On-Call”.
How do you measure and improve the effectiveness of your incident responses? You can start by looking at the times associated with your responses. You can set targets for these times, and evaluate how well a given incident response met your targets. Over multiple incidents, you may be able to identify trends, and take steps to tune your response methods based on those trends.
When I’m reviewing an incident, I try to identify 6 key timestamps, and then consider the relationships between them:
- Start Time is when your customers or users first started being impacted, whether they noticed it or not. This time usually has to be determined after the fact, through log analysis. It can sometimes be tricky to determine this time, as the problem might not have affected all customers, or it might have been something that gradually grows over time (a system getting progressively slower, for example), but you need to drive a stake in the ground somewhere.
- Detection Time is when your monitoring systems first detected the problem. This is the earliest that you could have alerted someone to the problem, though you might have delayed the alert for a while, to avoid noisy alerts.
- Alert Time is when the system actually alerted a human about the problem. We can debate what “alerted” means; was it when the first ticket was filed, even if the ticket didn’t generate a page? Was it when someone was paged? Was it when someone acknowledged the page? Pick a definition that makes sense for your circumstances, and stick to it.
- Response Time is when a person first started actively working on the problem. This means someone with their hands on the keyboard and their eyes on a screen, logged in and actively working the problem. Note that this is not the time at which someone acknowledged the alert, unless they were able to immediately begin working on the problem after acknowledging it. There might be a considerable amount of time between “Alerted” and “Responded”, while somebody finds their laptop, gets connected, gets their VPN set up, gets logged in, etc.; you want to know that.
- Mitigation Time is when the problem was resolved from the customer’s point of view; when they stopped being affected by it, or stopped seeing symptoms. You may still be working on the problem, but it’s no longer apparent to the customer.
- Resolution Time is when the incident response is “finished” from the responder’s point of view. The problem is solved, a permanent-enough fix is in place, and they can step away from the keyboard and go back to sleep. There might still be followup work to do (bugs to fix, etc.) after the Resolution Time, but basically, this is when the system is back to a normal, functioning state.
Given those timestamps, consider that:
- Customer Impact is from Start Time to Mitigation Time. This is the duration of the problem, from the customer’s point of view.
- Company Impact is from Alert Time to Resolution Time. This is the duration of the incident response, from the company’s point of view. It starts at Alert Time, because that’s when this problem first caused the company to take any action to address this situation.
Notice that the customer’s view of the duration of the incident or outage differs from the company’s view; that is, Customer Impact can differ significantly from Company Impact. At the beginning, customers (or at least some of them) are generally feeling an impact before the company notices the problem, and before the company does anything to address it; that’s the difference between Start Time and Alert Time. At the end, the company is often still working on the problem even after it’s “fixed” from the customer’s point of view; that’s the difference between Mitigation Time and Resolution Time.
In many circumstances, the Mitigation Time and the Resolution Time might be identical. This is true if there are no further steps to take beyond the mitigation, to return the system to a normal, functioning state.
You can use these timestamps to determine several useful intervals:
- Time to Detect is the difference between Start Time and Detection Time. If that interval is too long, you can shorten it by improving your monitoring, and in particular by focusing your monitoring on the customer’s experience. For example, your customer doesn’t care what the load average on your web server is, or how much free memory it has; your customer cares about how long it takes to load the web page they’re visiting, so focus your monitoring on measuring page load times.
- Time to Alert is the difference between Detection Time and Alert Time. If that interval is too long, you can shorten it by improving your alerting system. Beware, however, that you don’t want to create noisy alerts which generate a lot of false positives; that will create alert fatigue, where your responders are run ragged by their pagers, and can’t immediately tell the difference between bogus alerts and serious alerts. This is like your home smoke detector going off when you cook something; if it happens too often, you get annoyed and take the batteries out of the smoke detector, which is obviously bad.
- Time to Respond is the difference between Alert Time and Response Time. If that interval is too long, you can shorten it by improving your team’s response expectations. This may involve paying them more, if they’re on a tighter oncall schedule. There’s a big difference a 30-minute response expectation and a 5-minute response expectation; with a 30-minute response expectation, you can still go out to dinner or a movie, as long as you bring your laptop and are prepared to step out if you get paged, while with a 5-minute response expectation, even a trip to the bathroom might be a challenge.
- Time to Mitigate is the difference between Start Time and Mitigation Time. This is the duration of the incident from your customer’s point of view. If you mitigate a problem, it’s still a problem that you’re working on (perhaps urgently), but the problem has been solved from the customer’s point of view, and they can get on with their work.
- Time to Resolve is the difference between Alert Time and Resolution Time. This is the duration of the incident from your company’s point of view. This is the time during which the incident has an effect on actions taken by your team; it begins at Alert Time (rather than Start Time), because Alert Time is the first point at which your team does something tangible (i.e., respond to the alert) to address the incident.
It’s a good idea to set targets for these various intervals, and to track your team’s performance against those targets across multiple incidents. As part of the blameless postmortem review for a particular incident, you should consider how well you did against your targets. By looking at multiple incidents, you can identify trends or common factors, and make adjustments accordingly to your standards and practices.
The intervals that you have the most control over are Time to Detect, Time to Alert, and Time to Respond. You would expect those should be fairly consistent from incident to incident, and it’s obvious what steps you can take to shorten each of them, if needed (just beware of the impact of false positives and alert fatigue, if you try to tighten them too much).
Time to Mitigate and Time to Resolve are harder to predict and tend to be much less consistent from one incident to the next. What steps you can take to mitigate and resolve the incident depends on the nature of the incident, and of the responders. There’s less statistical validity in comparing these across multiple incidents, unless the incidents are all similar in type and scope.
One thing to consider about Time to Mitigate and Time to Resolve is, how long did it take to get the right people involved in the response? You might find, for example, that what initially got reported as an application problem, turned out to be a database problem, which itself turned out to be a storage problem, which finally turned out to be a networking problem. Each of these levels might take 30 minutes or more to work through, as you page a particular team, they respond, they investigate, and they determine that the problem is deeper in the technology stack. When you finally get ahold of the last team, they might identify and solve the problem within 5 minutes, but you took 2 hours to get to them.
You can improve Time to Mitigate and Time to Resolve by alerting a broader swath of teams earlier in the response, and essentially “swarming” the response. That way, when your apps expert needs a database expert, or your database expert needs a storage expert, and so forth, they’re already engaged in the response. The downside of this is, on many incidents those various specialists aren’t going to be needed, and you’re wasting their time and disrupting their other work by looping them in unnecessarily; this has both productivity and burnout consequences. You have to decide which is more important for your company: minimizing customer impact (which means swarming the problem and wasting responder time, just in case they’re needed on a particular incident) or minimizing company impact (which means longer customer impact, as you involve your experts in sequence, only if they’re really needed for the particular incident).
There are no perfect answers here, but rather a series of tradeoffs. It’s better to consider these tradeoffs and make explicit decisions about them when you’re not in the middle of a crisis. Blameless postmortems are often a good opportunity to revisit and review the tradeoffs that you’ve chosen, and make adjustments if appropriate.
Want to learn more about how to manage your outages and other IT emergencies more effectively, and thereby reduce all of the times discussed above? Come to my Mastering Outages One-Day Class! There’s one scheduled for Friday 18 May 2018 in the San Francisco Bay Area; full details and registration are at greatcircle.com/class, and you can save $100 by using discount code “Measure”.
“How do we keep senior managers from disrupting incident responses?” That audience question generated the strongest response last week at my workshop on Incident Command for IT at the fantastic USENIX SREcon18 Americas.
Senior management definitely has a critical role to play in incident response, but as soon as somebody asked that question, the room lit up; it seemed like all 200 people had tales to share about active incident responses that were inadvertently derailed by directors, executives, and other senior managers. It was clear that this was a significant source of frustration for incident responders and incident leaders in the room.
Incident management is about controlling chaos, and senior management can be a significant source of chaos during an incident, usually without meaning to be. Why is this so, and how can senior managers, incident leadership, and responders all work together to avoid this?
Senior management has a legitimate need for timely information about incident responses in progress. But if they simply barge in on the phone bridge or Slack channel that’s being used to coordinate the response, they disrupt the response by creating confusion about who is in charge of the response, which ultimately delays the resolution of the incident. It’s bad enough when the drive-by senior managers are simply asking for information; those requests take on a life of their own, as responders scramble to answer the questions, instead of continuing whatever they were doing at the direction of the incident leadership. It’s even worse when the senior managers start giving orders, usurping the roles of the incident commander and their leadership team; if the senior manager wants to do that, they should explicitly take over as incident commander (assuming they have the appropriate training), but that seldom seems to happen.
Senior managers can avoid this scenario by being aware that their mere presence in an incident response is going to make waves, and therefore being careful about when and where they appear. If they have questions about the incident response, they should address those questions privately to the incident commander, via direct message or other channels. They should not be asking questions of individual responders directly, or in the main communications channel of the response.
Incident commanders can address this issue by recognizing that senior managers have legitimate interests in incident responses, and need to be kept informed. Periodic proactive updates to senior management can go a long way towards filling this information gap. If senior managers can trust that the IC will keep them informed as the response progresses, then they’re less likely to barge in seeking ad hoc updates. It takes time, likely over several incident responses, to build up this trust, but it’s absolutely worth doing.
On larger responses, or within larger organizations, it can be very effective for one senior manager to take on a “liaison” role between the response and the rest of the senior management team. The liaison usually works directly with the incident commander, passing information along to the rest of the senior management team and representing their concerns to the incident commander. PagerDuty’s open source incident response guide has a good writeup of the internal liaison role.
Individual responders have a role to play here, too. They need to remember that the incident commander is in charge of the incident response, and avoid the urge to defer to wandering senior management. When you’re part of an organized incident response, you’re operating under different rules and norms than day-to-day, within the structure of a temporary org chart created just for this incident. If the senior manager asking questions or giving orders isn’t part of this particular incident response, then you need to politely but firmly redirect them to the incident commander, rather than dropping your IC-assigned tasks in order to respond to the not-involved senior manager.
Senior managers as a group need to establish and monitor norms for each other, in this area. It’s tough for a responder to tell an SVP “I’m sorry, sir, but you should be talking to the incident commander, not me”, even when that’s the right thing to do; the responder shouldn’t be put in that position. A quiet word of advice from a fellow executive who sees this happening, reminding their wayward peer that they shouldn’t meddle in active responses, can go a long way.
Senior managers also have a very important role to play in supporting the organization’s overall incident management program. They need to make sure that the plans get developed, the responders get trained, the exercises get carried out, the responses get reviewed, the postmortems get written, and the action items get followed up on. Explicit and visible support for incident response from senior management is essential for developing and maintaining an effective incident management capability.
If you want to learn more about this and many other incident management topics, I’m teaching a Mastering Outages One-Day Class in the San Francisco Bay Area on Friday 18 May 2018. You can save $100 when you register with discount code “disrupt”, plus save an additional $100 if you register before 16 April 2018.
Come join me at next week’s BayLISA meeting in San Jose, as I’ll be speaking about Learning from the Fire Department: Experiences with Incident Command for IT. I’ll be sharing key lessons, along with a few war stories, from companies such as Google, Heroku, and PagerDuty, who have all developed successful incident management practices based on the public safety world’s Incident Command System (ICS).
The BayLISA meeting is at 7:30pm on Thursday 15 March 2018, at the Paypal campus near San Jose International Airport. Full details are on the BayLISA web page at baylisa.org.
Speaking at BayLISA is a bit of a homecoming for me. I’ve been going to BayLISA meetings for over half my life, and I’ve been the featured speaker at several of them over the years. I’m not one of the group’s founders, but I started going within a year or two of its formation, and I was a regular participant for many years when I lived in the heart of Silicon Valley. BayLISA played a very important role in my early career, and I’m glad to see that it’s still going strong.
By the way, don’t forget my full-day class on Mastering Outages on Fri 18 May, here in the SF Bay Area. There are a limited number of Early Bird seats available at $100 discount until 16 April, plus get another $100 discount with code “Fire”. greatcircle.com/class
Was your last outage a triumph, or a tragedy? Was it like Apollo 13, or more like Titanic?
Are you ready for your next outage? Are you prepared to respond quickly, smoothly, effectively, and efficiently? Users expect uptime, all the time, and it’s critical to your success. Whether it’s a massive web service like Google Search, or the file server in your 6-person office, if the service is unavailable, nothing else about it matters.
Mastering outages makes a critical difference by reducing downtime and disruption. This protects your users, your reputation, your staff, and your bottom line.
Preparation is key to mastering outages. Is your organization prepared to respond effectively to outages, or do you simply react haphazardly? You prevent an emergency from becoming a crisis with a response that is quick, smooth, effective, and efficient, in order to reliably solve the problem with minimum fuss and disruption.
Learn how to reduce downtime and the impact of outages on your company, customers, and staff at our Mastering Outages 1-day class in the San Francisco Bay Area on Friday 18 May 2018.
Sign up now at greatcircle.com/class.
A limited number of discounted Early Bird seats are available now, plus get a further $100 discount with code “Apollo13”.
Can’t make this date? Join our list to hear about future dates and locations for this class.
Interested in consulting and in-house training for your company? Questions about this class? Want to suggest your city for a future class? Contact us!
I’m excited to share that I just signed the venue contract for a full-day “Mastering Outages” public class on incident management, for here in the SF Bay Area on Friday 18 May 2018. I’ll be posting full details shortly, as I get everything else set up, but you should join my mailing list so that you don’t miss any announcements (or discounts!)…
Effective incident management matters because it both reduces downtime for your service, and reduces the impact of dealing with that downtime on your staff.
If you’re a service provider, uptime is critical for your success. Whether it’s a massive web service like Google Search, or the print service in your 6-person office: if the service is not available, nothing else about it matters. Your users expect uptime, and they notice downtime.
Besides the direct impact of outages on users, outages also have a huge impact on your staff and management. Outages are disruptive to your organization’s ongoing work, whatever that might be. Projects and other deliverables are delayed when staff are interrupted to deal with outages, and both the quality of their work and their own quality of life suffer.
In the worst cases, organizations risk entering a toxic cycle, as the resources involved in dealing with outages are drawn away from other work, which in turn causes delays, missed deadlines, and more stress for all concerned. In the wake of each outage, the work that was neglected while dealing with the outage still needs to get done and therefore becomes more urgent, while work to identify and address the root causes of the outage gets short shrift. The root causes remain unaddressed, which in turn leads to more outages, thus perpetuating the toxic cycle.
There are two basic ways to reduce the impact of downtime: reduce the number of outages, and reduce the time and disruption of dealing with outages. You reduce the number of outages by making the service more resilient through architectural and infrastructure changes, and you reduce both the duration and impact of each outage by improving your incident management practices.
By adopting proven incident management methods, you can both shorten the duration and user impact of outages, and reduce the disruptions that outages inflict your own staff and their ongoing development work.
I’ll be presenting a 3-hour workshop on incident management at the USENIX SREcon18 Americas conference in Silicon Valley on 27 March 2018:
Incident Command for IT—What We’ve Learned from the Fire Department
Don’t delay in registering for the conference; it has sold out each of the past several years, often before the Early Bird discount registration date arrives! This conference is always one of my favorites each year, and I highly recommend it.
I’ve also got an extended full-day version of this training available; contact me if you’d be interested in a private presentation for your organization.
Finally, I’m working on scheduling a public presentation of this full-day extended training for mid-May in San Francisco. If you want to be notified when that is scheduled, please sign up for our mailing list.
Everybody wants to be a hero when there’s an outage or similar service problem, swooping in to save the day with their knowledge, skill, and wisdom, and then reaping their reward of praise and adulation from management and peers. However, if heroics are what you reward, then you’ll get more heroics, and that’s not good for your team or your service. Instead, you should prepare to respond *without* heroics, by using good incident management practices, and reward folks for avoiding the need for heroics.
If a heroic response is needed, it means that something has gone wrong with your service, and your users are impacted. Obviously, you’d prefer to minimize that impact, either by avoiding the problem in the first place, or by catching it sooner and dealing with it faster. If you reward heroics in responding to such problems, however, you risk inadvertently creating an incentive to let problems fester until heroic measures are required (and rewarded).
Heroic responses are bad for both the hero and their organization. The hero risks exhaustion and burnout, and the time and energy required for heroics cut into their “real” job, which is usually project-oriented; heroics delay the hero’s project work, and also often impact its quality. Organizationally, the hero becomes a bottleneck or a single point of failure; if they are unavailable for some reason, outages take longer to resolve and have more serious consequences and side-effects, and are more disruptive for the rest of the team, because more of the team is needed for the response.
Preparation and good incident management practices enable you to respond effectively to outages and service problems as a routine matter, rather than as a crisis demanding a heroic response. After the problem is resolved, a good post-incident review (in the form of a blameless postmortem) helps you become even better prepared for future incidents.
Often, it seems that a heroic response is the only option, because nobody else knows the systems, nobody else has the necessary passwords and other admin privileges, or whatever. However, these are actually signs of failure to prepare before the incident ever occurred, through training, documentation, development of monitoring and tools, and so forth. Outages and other problems are going to occur in any system; by choosing *not* to prepare, the organization is implicitly choosing to require a heroic response to future problems, thus perpetuating the cycle of heroic responses.
By all means, recognize heroic actions when they do occur, but also ask the hard questions about why such heroic measures were needed. What could have been done to prevent the need for heroic measures?
Finally, make sure that the folks who do good, solid work on *preventing* problems are getting properly recognized for their work. They are your organization’s real heroes.
“Routine emergency” might seem like an oxymoron, but it’s a good description of effective incident management. We do our best to avoid emergencies, but we still have to be prepared to deal with them when they occur. Dealing with an emergency doesn’t have to be a crisis, though; with good incident management practices, it can be routine.
You see this in professions that deal with emergencies on a daily basis. For example, you seldom see firefighters running at a fire. Because of their extensive training, preparation, and practice, an event that is incredibly stressful for the victims is just another day at the office for the firefighters. It’s routine.
For dealing with an emergency to be routine, you need to be prepared to respond to emergencies. Who is going to respond? How are they going to be notified to respond? What steps are they going to take when notified? How can they enlist help from others if needed? How are all the responders going to organize and coordinate their activities? How are they going to communicate with each other, and with stakeholders beyond the response? How do you scale up and scale down the response, as the situation unfolds? How do you wrap up the response and return to normal operations? Effective incident management practices address these questions, and more.
If you want your emergency responses to be routine, you need to have considered, planned, trained, and practiced all of this, before the emergency arises.
Are you prepared? We can help.
I’m pleased to announce the return of my consulting practice, Great Circle Associates, with a focus on helping organizations develop and strengthen their incident management capabilities.
Service outages and other incidents are a fact of life. You do your best to avoid them, but organizations are judged by how effectively they handle them. A poorly managed incident can be a major black eye; conversely, a well-managed incident can be a significant confidence builder for both your team and, ironically, your users (they expect problems, but they appreciate how well you handle those problems).
I am an expert in managing service outages and other emergencies, enabling organizations to restore service quickly while minimizing impact to both users and staff. I can help organizations develop the procedures, tools, and skills to handle incidents effectively.
- While I was a part of Google’s legendary Site Reliability Engineering (SRE) organization, I created, developed, and taught Google’s internal incident management procedures, known as IMAG (Incident Management at Google).
- I’ve presented talks about incident management practices in a variety of forums, such as the USENIX LISA conference and the O’Reilly Velocity conference.
- For over two decades, while working primarily in high tech, I’ve also been an incident manager for a variety of public safety agencies. I have managed searches for missing aircraft, coordinated community emergency/disaster responses, and supervised the emergency dispatch centers for large multi-day art & music festivals.
Let’s talk about how I can help your organization improve your incident management capabilities.