Incident Management – Consulting and Training
Bring proven, effective incident management to your organization through my consulting and training services
Together, we can strengthen your organization’s incident management capabilities. I can help you develop the capabilities you need to quickly, effectively, and efficiently respond to incidents.
- Conduct a gap analysis of your incident management capabilities:
- What incident management capabilities do you need?
- What incident management capabilities do you already have?
- How do you get from where you are, to where you want to be?
- Develop and refine your organization’s incident management plan
- Train your staff in incident management principles and practices
- Educate your senior management about their role in effective incident management (i.e., how best to support and enable it, rather than inadvertently disrupt it)
- Identify incident management tools and technology that are compatible with your organization’s environment and culture
- Create, conduct, and critique incident management exercises
- Facilitate a blameless incident review of a past incident
My one day Incident Management Tutorial is an effective way to bring your team up to speed on incident management, and get everyone onto the same page. The tutorial covers:
- The basic principles of Incident Command
- How to prepare for an effective response
- How to launch and manage an effective response
- How to evolve your response on the fly, scaling it up and down, as both the situation and your resources change
- How to communicate effectively among responders
- How to communicate beyond the responders, to management, customers, investors, regulators, the public, and others
- When and how to conclude a response and return to normal operations
- How to follow up effectively with a blameless incident review
In addition, the tutorial includes:
- A guided incident management exercise, so that participants can practice applying these principles and methods
- Discussions of how best to adapt and apply these lessons to your organization
I’ve helped Slack, Google, and others develop these critical capabilities
When I hired Brent at Slack I asked him to give us a year to help us transform our engineering incident response. We were lucky to get over 3 years of Brent’s time, and he made such a huge impact. He led the creation of the Slack engineering incident response, taught endless classes, represented the program to executives, and scaled the program as Slack grew. Slack has a world class incident response process thanks in no small part to Brent’s efforts.
I was the lucky beneficiary of Brent’s work on Incident Management at Google. His leadership and direct effort resulted in nothing short of a full reformulation of this key competency of SRE. […] Through Brent’s work and training we were able to build structure and automation around this critical area of expertise, which has resulted in significant reductions in MTTR and improvements in the org’s ability to learn and grow from its service interruptions (e.g. through postmortems).
Bring effective Incident Management to your organization
I am an expert at emergency management for IT services, guiding companies to prevent, prepare for, respond to, and learn from emergencies. I work from a strong background in IT infrastructure, site reliability engineering (SRE), and public safety emergency management.
Slack recruited me to lead incident response and incident management for their Engineering organization and the company as a whole. I designed and built Slack’s incident management capabilities; kept incident management running smoothly day-to-day; helped the company learn from, prevent, and prepare for incidents; and shared Slack’s incident management story with Slack’s customers and the industry.
As a leader in Google’s legendary SRE organization, I convinced senior management of the need to strengthen and standardize the company’s incident management practices, and created the Incident Management at Google (IMAG) system used throughout the company. I also helped refine the Postmortems at Google (PMAG) system that the company uses to learn from incidents large and small.
I bring a unique perspective to my work in IT, as a former air search and rescue pilot and incident commander, an emergency dispatcher and dispatch supervisor for major art & music festivals and events, and a Community Emergency Response Team (CERT) member and instructor.
Throughout my career, I have designed, built, managed, and scaled IT infrastructure and teams for everything from embryonic startups to giants such as Google, Apple, and Netflix. I have worked with dozens of organizations in Silicon Valley and globally, and with various non-profit and government entities. I am the co-author of the highly regarded O’Reilly book Building Internet Firewalls, the developer of widely used open-source software, and a sought-after speaker at conferences worldwide.
I have a rare combination of experience as an emergency manager, technology manager, people manager, software developer, network/systems engineer, and educator. My extensive experience enables me to quickly and effectively dive in, assess a situation, and deliver results.