Effective incident management matters because it both reduces downtime for your service, and reduces the impact of dealing with that downtime on your staff.

If you’re a service provider, uptime is critical for your success. Whether it’s a massive web service like Google Search, or the print service in your 6-person office: if the service is not available, nothing else about it matters. Your users expect uptime, and they notice downtime.

Besides the direct impact of outages on users, outages also have a huge impact on your staff and management. Outages are disruptive to your organization’s ongoing work, whatever that might be. Projects and other deliverables are delayed when staff are interrupted to deal with outages, and both the quality of their work and their own quality of life suffer.

In the worst cases, organizations risk entering a toxic cycle, as the resources involved in dealing with outages are drawn away from other work, which in turn causes delays, missed deadlines, and more stress for all concerned. In the wake of each outage, the work that was neglected while dealing with the outage still needs to get done and therefore becomes more urgent, while work to identify and address the root causes of the outage gets short shrift. The root causes remain unaddressed, which in turn leads to more outages, thus perpetuating the toxic cycle.

There are two basic ways to reduce the impact of downtime: reduce the number of outages, and reduce the time and disruption of dealing with outages. You reduce the number of outages by making the service more resilient through architectural and infrastructure changes, and you reduce both the duration and impact of each outage by improving your incident management practices.

By adopting proven incident management methods, you can both shorten the duration and user impact of outages, and reduce the disruptions that outages inflict your own staff and their ongoing development work.


I’ll be presenting a 3-hour workshop on incident management at the USENIX SREcon18 Americas conference in Silicon Valley on 27 March 2018:

Incident Command for IT—What We’ve Learned from the Fire Department

Don’t delay in registering for the conference; it has sold out each of the past several years, often before the Early Bird discount registration date arrives! This conference is always one of my favorites each year, and I highly recommend it.

I’ve also got an extended full-day version of this training available; contact me if you’d be interested in a private presentation for your organization.

Finally, I’m working on scheduling a public presentation of this full-day extended training for mid-May in San Francisco. If you want to be notified when that is scheduled, please sign up for our mailing list.