Great Circle Waypoints BlogItems of interest to Great Circle clients and friends
What’s the most interesting question in a blameless postmortem?
“How did we get lucky?”
I find that this is often the most interesting section of an incident postmortem. In other words, what might have happened, but didn’t? What could have happened, that would have been worse? Incidents often open your eyes to new and frightening possibilities that you hadn’t previously considered, and the postmortem is a good place to explore them.
How often should your engineers be on call?
In one of the Slack channels that I frequent, someone recently asked what a reasonable duty cycle was, for engineers in a 24/7 on-call rotation with a single digit number of pages per week. In other words, under those circumstances, is it reasonable for a given individual to be on call one week in three, one week in four, or what? At least according to Google SRE, it’s a lot less often than that. […]
How to improve your incident response times
How do you measure and improve the effectiveness of your incident responses? You can start by looking at the times associated with your responses. You can set targets for these times, and evaluate how well a given incident response met your targets. Over multiple incidents, you may be able to identify trends, and take steps to tune your response methods based on those trends. In this blog post, learn what the key times are, and how to improve each of them. […]
Senior managers, stop disrupting your team’s incident responses
“How do we keep senior managers from disrupting incident responses?” That audience question generated the strongest response last week at my workshop on Incident Command for IT at the fantastic USENIX SREcon18 Americas.
Senior management definitely has a critical role to play in incident response, but as soon as somebody asked that question, the room lit up; it seemed like all 200 people had tales to share about active incident responses that were inadvertently derailed by directors, executives, and other senior managers. It was clear that this was a significant source of frustration for incident responders and incident leaders in the room.
Incident management is about controlling chaos, and senior management can be a significant source of chaos during an incident, usually without meaning to be. Why is this so, and how can senior managers, incident leadership, and responders all work together to avoid this? […]
Speaking about Lessons from the Fire Department at BayLISA next Thu 3/15 in San Jose
Come join me at next week's BayLISA meeting in San Jose, as I'll be speaking about Learning from the Fire Department: Experiences with Incident Command for IT. I'll be sharing key lessons, along with a few war stories, from companies such as Google, Heroku, and...
Was your last outage like Apollo 13, or more like Titanic?
Was your last outage a triumph, or a tragedy? Was it like Apollo 13, or more like Titanic? Are you ready for your next outage? Are you prepared to respond quickly, smoothly, effectively, and efficiently? Users expect uptime, all the time, and it's critical to your...
Great Circle Associates, Inc.
International: +1 415 861 3588
USA Toll Free: 877 GRT CRCL
[et_pb_module_placeholder selected_tabs="all"]<!-- [et_pb_line_break_holder] -->`;<!-- [et_pb_line_break_holder] --></script>