Founder and Principal
I am an expert at emergency management for IT services, guiding companies to prevent, prepare for, respond to, and learn from emergencies. I work from a strong background in IT infrastructure, site reliability engineering (SRE), and public safety emergency management.
Slack recruited me to lead incident response and incident management for their Engineering organization and the company as a whole. I designed and built Slack’s incident management capabilities; kept incident management running smoothly day-to-day; helped the company learn from, prevent, and prepare for incidents; and shared Slack’s incident management story with Slack’s customers and the industry.
As a leader in Google’s legendary SRE organization, I convinced senior management of the need to strengthen and standardize the company’s incident management practices, and created the Incident Management at Google (IMAG) system used throughout the company. I also helped refine the Postmortems at Google (PMAG) system that the company uses to learn from incidents large and small.
I bring a unique perspective to my work in IT, as a former air search and rescue pilot and incident commander, an emergency dispatcher and dispatch supervisor for major art & music festivals and events, and a Community Emergency Response Team (CERT) member and instructor.
Throughout my career, I have designed, built, managed, and scaled IT infrastructure and teams for everything from embryonic startups to giants such as Google, Apple, and Netflix. I have worked with dozens of organizations in Silicon Valley and globally, and with various non-profit and government entities. I am the co-author of the highly regarded O’Reilly book Building Internet Firewalls, the developer of widely used open-source software, and a sought-after speaker at conferences worldwide.
I have a rare combination of experience as an emergency manager, technology manager, people manager, software developer, network/systems engineer, and educator. My extensive experience enables me to quickly and effectively dive in, assess a situation, and deliver results.
Outages and other IT emergencies are expensive in many different ways, including lost sales, lost productivity, damaged reputation, and damaged morale.
I’m an expert in site reliability engineering, which is about avoiding these problems in the first place. However, despite everyone’s best efforts, sometimes things still go wrong, so I specialize in incident management, which is about resolving these problems quickly and effectively when they do occur.
It’s essential to be prepared, and to learn from each incident so that you’re better prepared for next time.
I can guide your organization to develop these critical incident management capabilities.