Mastering Outages – One‑day Class
Friday 18 May 2018, San Francisco Bay Area
Are you ready for your next outage?
You do your best to avoid problems and outages, but sometimes things go wrong.
Be prepared to respond effectively and efficiently.
Users expect uptime, all the time
Uptime is critical for your success. Whether it’s a massive web service like Google Search, or the file server in your 6-person office, if your service is unavailable, nothing else about it matters.
Mastering outages is critical
Mastering outages makes a difference by reducing downtime. This protects your users, your reputation, your staff, and your bottom line.
Preparation is the key to mastering outages
You prevent an emergency from becoming a crisis with a response that is quick, smooth, effective, and efficient. This enables you to reliably solve the problem with minimum fuss and disruption.
Are you prepared?
Is your organization prepared to respond effectively to outages, or do you simply react haphazardly?
We can help!
Great Circle can strengthen your organization’s incident management. We can help you identify and develop the capabilities that you need.
Are you satisfied with your answers to these questions?
- Who is going to respond? How are they going to be alerted to respond? What steps are they going to take when alerted?
- How can they enlist help from others if needed?
- How are all the responders going to organize and coordinate their activities?
- How are the responders going to communicate with each other, and with stakeholders beyond the response?
- How do you scale up and scale down the response, as the situation unfolds?
- How do you deal with multiple incidents simultaneously?
- How do you wrap up a response and return to normal operations?
- How do you follow up with a blameless postmortem, to improve your next response?
- What incident management tools, technologies, and practices are compatible with your organization’s environment and culture?
- What is your senior management’s role in responding to outages? How can they support and enable effective responses, rather than inadvertently disrupting them?
Outages are inevitable, but downtime is not
Learn how to reduce downtime and the impact of outages on your company, customers, and staff.
What You Will Learn
- How to launch and manage an effective response
- How to evolve your response on the fly, as the situation evolves
- How to communicate effectively among responders
- How to communicate beyond the responders, to the rest of the company, customers, and others
- How to conclude a response and return to normal operations
- How to follow up effectively with a blameless postmortem, to improve your next response
- How to deal with multiple incidents simultaneously
- Individual contributors who respond to outages and other IT incidents
- Managers who want to prepare their teams to respond effectively
- Leaders who want to reduce the impact of outages on their company’s customers, reputation, and staff
The proven incident management practices and principles taught in this class apply to all types of environments:
- Service providers or service consumers
- Internal or external users and services
- SaaS, PaaS, IaaS, or other delivery model
- DevOps, SRE, Agile, traditional, or other operating model
- On-premises, shared, or cloud infrastructure
I was the lucky beneficiary of Brent’s work on Incident Management at Google. His leadership and direct effort resulted in nothing short of a full reformulation of this key competency of SRE. […] Through Brent’s work and training we were able to build structure and automation around this critical area of expertise, which has resulted in significant reductions in MTTR and improvements in the org’s ability to learn and grow from its service interruptions (e.g. through postmortems).Marc Alvidrez
About the instructor, Brent Chapman
Brent Chapman is an expert at emergency management, and at guiding organizations to prepare for and learn from emergencies, working from a strong background in IT infrastructure and site reliability engineering (SRE).
As a leader in Google’s legendary SRE organization, Brent convinced senior management of the need to strengthen and standardize the company’s incident management practices, and created the Incident Management at Google (IMAG) system that is now used throughout the company. He also helped refine the Postmortems at Google (PMAG) system that the company uses to learn from incidents large and small.
Brent brings a unique perspective to his work in IT, as a former air search and rescue pilot and incident commander, an emergency dispatcher and dispatch supervisor for major art & music festivals and events, and a Community Emergency Response Team (CERT) member and instructor.
Throughout his career, Brent has designed, built, managed, and scaled IT infrastructure and teams for everything from embryonic startups to giants such as Google, Apple, and Netflix. He is the coauthor of the highly regarded O’Reilly book Building Internet Firewalls, the developer of widely used open source software, and a popular speaker at conferences worldwide. He has worked with dozens of organizations both in Silicon Valley and around the world, as well as with a variety of non-profit and government entities.