“How did we get lucky?”
I find that this is often the most interesting section of an incident postmortem. In other words, what might have happened, but didn’t? What could have happened, that would have been worse? Incidents often open your eyes to new and frightening possibilities that you hadn’t previously considered, and the postmortem is a good place to explore them.
“Incidents are unplanned investments in your company’s survival”, according to John Allspaw, who many credit with popularizing the blameless postmortem movement. When you have an incident, you should get the most out of your “unplanned investment” by conducting a blameless postmortem. Among other things, that postmortem should consider “what if?”. What if this incident had occurred during peak load, instead of during off hours? What if you hadn’t happened to still have a copy of the prior config in your working files? What if you weren’t already logged in to the critical server, when all the SSH keys were wiped out by the broken automation distributing new keys?
Sometimes the most important lessons from a postmortem are what didn’t happen, but could have…
Where can you learn more about blameless postmortems, and how to conduct them?
Well, John Allspaw’s seminal Blameless PostMortems and a Just Culture article is a good start and a quick read. His presentation at the 2017 DevOps Enterprise Summit is also quite enlightening; at first, it doesn’t seem to have much to do with postmortems, but if you watch it all the way through (it’s only 33 minutes), you won’t be disappointed.
If you want more depth, Google has long had a very strong blameless postmortem culture, carefully nurtured by their legendary Site Reliability Engineering (SRE) organization, which I was proud to be a part of for several years. They’ve written about it in the O’Reilly Site Reliability Engineering book (which includes both a chapter on postmortems and an example postmortem), and in a blog post in their CRE Life Lessons series.
Finally, for specific guidance on how to conduct a blameless postmortem, Will Gallego delivered a fantastic talk on Architecting a Technical Postmortem at USENIX SREcon18 Americas.
Would your team or professional group like a free on-site tech talk on incident management? I deliver a limited number of these 1-hour talks every month to companies and user groups throughout the San Francisco Bay Area, as well as in other cities that I’m traveling to. Contact me to join the growing list of organizations who have benefited from these free talks, including as LinkedIn, Atlassian, Pivotal, Okta, and BayLISA!