Why is it important to separate the Incident Commander and Tech Lead roles?

There are many roles in high-tech incident response, such as incident commander, tech lead, communications lead, subject matter expert, and so forth. Individuals often fill multiple roles simultaneously, especially in the early stages of an incident; generally, this is OK, and particular roles can be handed off to other individuals as more people join the response. However, in my experience with incident response at Google and elsewhere, having one person trying to act as both the incident commander (IC) and the tech lead (TL) is a recipe for trouble.

The fundamental problem is, these two roles have significantly different responsibilities, and it’s difficult for an individual to task-switch between them effectively. The TL needs to be head down, hands on the keyboard, eyes on the screen, 100% focused on leading the subject matter experts that are resolving the problem at hand. In contrast, the IC needs to be head up and looking around, maintaining a broad view and keeping their mind on the “big picture” for the response. Essentially, in order to enable the TL and the subject matter experts to focus on the solving the problem at hand, the IC and other incident leaders (communications lead, scribe, liaison, etc.) handle everything else related to the incident.

Trying to be both the IC and the TL for an incident is an enticing trap, and I’ve fallen into it myself a number of times, always to my regret. To fill both roles simultaneously, you need to continuously shift back and forth between leaning in for tight focus on the problem as the TL, and stepping back and looking around at the big picture as the IC. While you’re head-down acting as the TL, you aren’t doing the IC’s job of considering the big picture, communicating to the rest of the organization, obtaining additional resources for the response, liaising with key stakeholder groups beyond the response, and so forth. Similarly, when you step back to do the IC’s job, you aren’t giving the TL role the focus that it typically needs in order to troubleshoot and solve the problem. You tend to get “stuck” in one of the roles for many minutes at a time, while the other role suffers.

In the most effective responses, the IC and the TL work closely together to solve the problem quickly and effectively. The TL is focused on the technical problem itself, and on leading the team of subject matter experts who have been recruited for the response. The IC is dealing with everything else, in order to enable the TL and their team to maintain that focus.

I’ve seen many incidents resolved quickly with only two responders involved: the IC and the TL. While the IC is making notifications, recruiting more responders, setting up incident communications and documentation, and so forth, the TL is able to focus on the problem, and quickly resolves it, enabling the IC to cancel the rest of the responders before they even really get engaged.

It might seem obvious to start with the IC when organizing a response, but it often makes more sense for an incident’s first responder to take the TL role, rather than the IC role. Frequently, an on-call engineer is paged to investigate some anomaly, evaluates the situation, and decides that a full incident response is called for. In such cases, it’s usually better to make the on-call engineer the TL for the response, and recruit some additional responder to be the IC. While investigating the problem before deciding to launch an incident response, the on-call engineer has already become familiar with the technical aspects of the problem at hand; they’ve called up the relevant dashboards and documentation, poked around at the system, formed and tested some hypotheses about what’s happening, and so forth. In other words, they’ve already got their head in the game for solving the problem, which is exactly what you want the TL to focus on. If they become the IC at that point, somebody else has to step into the TL role, and basically start from scratch on building their own understanding of the situation, which takes valuable time. Instead, if the on-call engineer takes the TL role, then as soon as they’ve gotten the incident response ball rolling and identified an IC, the TL can get back to work on the problem while the IC gets the rest of the response organized.

Another benefit of separating the IC and TL roles is that this allows you to draw from a larger pool of potential ICs within your company, because they don’t need to have the hard-core technical skills required of a TL. The IC only needs to be technical enough to understand the systems at a high level, and to engage in a discussion of requirements and priorities with the TL and the subject matter experts. When the IC and TL roles are separated this way, organizations often find that they have excellent ICs beyond their engineering team, among their project/program/product managers, customer care teams, marketing teams, and so forth; these folks often make great ICs, because they have a strong understanding of the business of the organization, and the overall business context in which the incident response is taking place, rather than just the technology of the organization.

To summarize:

  • Have separate people as IC and TL for an incident, so that the TL can focus on solving the problem while the IC deals with everything else.
  • For many responses, IC and TL are all you need.
  • If an on-call engineer initiates an incident response, recruit somebody else as IC, and make the on-call engineer the TL; let them quickly get back to working the problem, while the IC organizes the rest of the response.
  • Expand your pool of ICs beyond your engineering organization; your ICs need to be technical enough to engage with your TL and subject matter experts, but need not be technical experts themselves.

Would you like a free on-site tech talk on incident management for your team or organization? I deliver a number of these 1-hour talks every month to companies and user groups throughout the San Francisco Bay Area, as well as in other cities that I’m traveling to. Contact me to join the growing list of organizations who have benefited from these free talks, including LinkedIn, Atlassian, Pivotal, Okta, and BayLISA!