Everybody wants to be a hero when there’s an outage or similar service problem, swooping in to save the day with their knowledge, skill, and wisdom, and then reaping their reward of praise and adulation from management and peers. However, if heroics are what you reward, then you’ll get more heroics, and that’s not good for your team or your service. Instead, you should prepare to respond *without* heroics, by using good incident management practices, and reward folks for avoiding the need for heroics.
If a heroic response is needed, it means that something has gone wrong with your service, and your users are impacted. Obviously, you’d prefer to minimize that impact, either by avoiding the problem in the first place, or by catching it sooner and dealing with it faster. If you reward heroics in responding to such problems, however, you risk inadvertently creating an incentive to let problems fester until heroic measures are required (and rewarded).
Heroic responses are bad for both the hero and their organization. The hero risks exhaustion and burnout, and the time and energy required for heroics cut into their “real” job, which is usually project-oriented; heroics delay the hero’s project work, and also often impact its quality. Organizationally, the hero becomes a bottleneck or a single point of failure; if they are unavailable for some reason, outages take longer to resolve and have more serious consequences and side-effects, and are more disruptive for the rest of the team, because more of the team is needed for the response.
Preparation and good incident management practices enable you to respond effectively to outages and service problems as a routine matter, rather than as a crisis demanding a heroic response. After the problem is resolved, a good post-incident review (in the form of a blameless postmortem) helps you become even better prepared for future incidents.
Often, it seems that a heroic response is the only option, because nobody else knows the systems, nobody else has the necessary passwords and other admin privileges, or whatever. However, these are actually signs of failure to prepare before the incident ever occurred, through training, documentation, development of monitoring and tools, and so forth. Outages and other problems are going to occur in any system; by choosing *not* to prepare, the organization is implicitly choosing to require a heroic response to future problems, thus perpetuating the cycle of heroic responses.
By all means, recognize heroic actions when they do occur, but also ask the hard questions about why such heroic measures were needed. What could have been done to prevent the need for heroic measures?
Finally, make sure that the folks who do good, solid work on *preventing* problems are getting properly recognized for their work. They are your organization’s real heroes.