Mastering Outages full-day class scheduled

I’m excited to share that I just signed the venue contract for a full-day “Mastering Outages” public class on incident management, for here in the SF Bay Area on Friday 18 May 2018. I’ll be posting full details shortly, as I get everything else set up, but you should join my mailing list so that you don’t miss any announcements (or discounts!)…

Why does incident management matter?

Effective incident management matters because it both reduces downtime for your service, and reduces the impact of dealing with that downtime on your staff.

If you’re a service provider, uptime is critical for your success. Whether it’s a massive web service like Google Search, or the print service in your 6-person office: if the service is not available, nothing else about it matters. Your users expect uptime, and they notice downtime.

Besides the direct impact of outages on users, outages also have a huge impact on your staff and management. Outages are disruptive to your organization’s ongoing work, whatever that might be. Projects and other deliverables are delayed when staff are interrupted to deal with outages, and both the quality of their work and their own quality of life suffer.

In the worst cases, organizations risk entering a toxic cycle, as the resources involved in dealing with outages are drawn away from other work, which in turn causes delays, missed deadlines, and more stress for all concerned. In the wake of each outage, the work that was neglected while dealing with the outage still needs to get done and therefore becomes more urgent, while work to identify and address the root causes of the outage gets short shrift. The root causes remain unaddressed, which in turn leads to more outages, thus perpetuating the toxic cycle.

There are two basic ways to reduce the impact of downtime: reduce the number of outages, and reduce the time and disruption of dealing with outages. You reduce the number of outages by making the service more resilient through architectural and infrastructure changes, and you reduce both the duration and impact of each outage by improving your incident management practices.

By adopting proven incident management methods, you can both shorten the duration and user impact of outages, and reduce the disruptions that outages inflict your own staff and their ongoing development work.


I’ll be presenting a 3-hour workshop on incident management at the USENIX SREcon18 Americas conference in Silicon Valley on 27 March 2018:

Incident Command for IT—What We’ve Learned from the Fire Department

Don’t delay in registering for the conference; it has sold out each of the past several years, often before the Early Bird discount registration date arrives! This conference is always one of my favorites each year, and I highly recommend it.

I’ve also got an extended full-day version of this training available; contact me if you’d be interested in a private presentation for your organization.

Finally, I’m working on scheduling a public presentation of this full-day extended training for mid-May in San Francisco. If you want to be notified when that is scheduled, please sign up for our mailing list.

Everybody wants to be a hero

Everybody wants to be a hero when there’s an outage or similar service problem, swooping in to save the day with their knowledge, skill, and wisdom, and then reaping their reward of praise and adulation from management and peers. However, if heroics are what you reward, then you’ll get more heroics, and that’s not good for your team or your service. Instead, you should prepare to respond *without* heroics, by using good incident management practices, and reward folks for avoiding the need for heroics.

If a heroic response is needed, it means that something has gone wrong with your service, and your users are impacted. Obviously, you’d prefer to minimize that impact, either by avoiding the problem in the first place, or by catching it sooner and dealing with it faster. If you reward heroics in responding to such problems, however, you risk inadvertently creating an incentive to let problems fester until heroic measures are required (and rewarded).

Heroic responses are bad for both the hero and their organization. The hero risks exhaustion and burnout, and the time and energy required for heroics cut into their “real” job, which is usually project-oriented; heroics delay the hero’s project work, and also often impact its quality. Organizationally, the hero becomes a bottleneck or a single point of failure; if they are unavailable for some reason, outages take longer to resolve and have more serious consequences and side-effects, and are more disruptive for the rest of the team, because more of the team is needed for the response.

Preparation and good incident management practices enable you to respond effectively to outages and service problems as a routine matter, rather than as a crisis demanding a heroic response. After the problem is resolved, a good post-incident review (in the form of a blameless postmortem) helps you become even better prepared for future incidents.

Often, it seems that a heroic response is the only option, because nobody else knows the systems, nobody else has the necessary passwords and other admin privileges, or whatever. However, these are actually signs of failure to prepare before the incident ever occurred, through training, documentation, development of monitoring and tools, and so forth. Outages and other problems are going to occur in any system; by choosing *not* to prepare, the organization is implicitly choosing to require a heroic response to future problems, thus perpetuating the cycle of heroic responses.

By all means, recognize heroic actions when they do occur, but also ask the hard questions about why such heroic measures were needed. What could have been done to prevent the need for heroic measures?

Finally, make sure that the folks who do good, solid work on *preventing* problems are getting properly recognized for their work. They are your organization’s real heroes.

Routine Emergencies

“Routine emergency” might seem like an oxymoron, but it’s a good description of effective incident management. We do our best to avoid emergencies, but we still have to be prepared to deal with them when they occur. Dealing with an emergency doesn’t have to be a crisis, though; with good incident management practices, it can be routine.

You see this in professions that deal with emergencies on a daily basis. For example, you seldom see firefighters running at a fire. Because of their extensive training, preparation, and practice, an event that is incredibly stressful for the victims is just another day at the office for the firefighters. It’s routine.

For dealing with an emergency to be routine, you need to be prepared to respond to emergencies. Who is going to respond? How are they going to be notified to respond? What steps are they going to take when notified? How can they enlist help from others if needed? How are all the responders going to organize and coordinate their activities? How are they going to communicate with each other, and with stakeholders beyond the response? How do you scale up and scale down the response, as the situation unfolds? How do you wrap up the response and return to normal operations? Effective incident management practices address these questions, and more.

If you want your emergency responses to be routine, you need to have considered, planned, trained, and practiced all of this, before the emergency arises.

Are you prepared? We can help.

Offering Incident Management Consulting and Training

I’m pleased to announce the return of my consulting practice, Great Circle Associates, with a focus on helping organizations develop and strengthen their incident management capabilities.

Service outages and other incidents are a fact of life. You do your best to avoid them, but organizations are judged by how effectively they handle them. A poorly managed incident can be a major black eye; conversely, a well-managed incident can be a significant confidence builder for both your team and, ironically, your users (they expect problems, but they appreciate how well you handle those problems).

I am an expert in managing service outages and other emergencies, enabling organizations to restore service quickly while minimizing impact to both users and staff. I can help organizations develop the procedures, tools, and skills to handle incidents effectively.

  • While I was a part of Google’s legendary Site Reliability Engineering (SRE) organization, I created, developed, and taught Google’s internal incident management procedures, known as IMAG (Incident Management at Google).
  • I’ve presented talks about incident management practices in a variety of forums, such as the USENIX LISA conference and the O’Reilly Velocity conference.
  • For over two decades, while working primarily in high tech, I’ve also been an incident manager for a variety of public safety agencies. I have managed searches for missing aircraft, coordinated community emergency/disaster responses, and supervised the emergency dispatch centers for large multi-day art & music festivals.

Let’s talk about how I can help your organization improve your incident management capabilities.