Site Reliability Engineering Consulting
Making SRE Work for Your Organization
Google is the birthplace of Site Reliability Engineering (SRE), but what works for Google may not work for your organization, due to differences in size, technology, culture, or other factors. Great Circle founder and principal Brent Chapman spent several years at Google as part of their legendary Site Reliability Engineering (SRE) organization, both as a high-level individual contributor and as manager of an SRE team. Brent also has extensive consulting experience with organizations large and small, which enables him to quickly analyze your unique circumstances and figure out how best to adapt SRE to your particular situation.
Brent can help you bring SRE’s powerful philosophies, principles, practices, and perspectives to your organization
What is Site Reliability Engineering (SRE)? How does it relate to DevOps?
According to Benjamin Treynor Sloss of Google, the “father of SRE”
SRE is what happens when you ask a software engineer to design an operations team.
[…] SRE is fundamentally doing work that has historically been done by an operations team, but using engineers with software expertise, and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, design and implement automation with software to replace human labor.
By design, it is crucial that SRE teams are focused on engineering. Without constant engineering, operations load increases and teams will need more people just to keep pace with the workload. […] That means hiring more people to do the same tasks over and over again.
DevOps or SRE?
The term “DevOps” emerged in industry in late 2008 and as of this writing (early 2016) is still in a state of flux. Its core principles—involvement of the IT function in each phase of a system’s design and development, heavy reliance on automation versus human effort, the application of engineering practices and tools to operations tasks—are consistent with many of SRE’s principles and practices. One could view DevOps as a generalization of several core SRE principles to a wider range of organizations, management structures, and personnel. One could equivalently view SRE as a specific implementation of DevOps with some idiosyncratic extensions.Benjamin Treynor Sloss
Make Site Reliability Engineering work for your organization
Brent Chapman is an expert at emergency management and at helping organizations prepare for and learn from emergencies, working from a strong background in IT infrastructure and site reliability engineering (SRE).
While Brent was part of the legendary SRE organization at Google, he created and launched the Incident Management at Google (IMAG) system that is now used throughout the company for emergency management, and helped refine the Postmortems at Google (PMAG) system that the company uses to learn from incidents large and small. Brent is also a former air search and rescue pilot and incident commander, an emergency dispatcher and dispatch supervisor for major art & music festivals and events, a Community Emergency Response Team (CERT) member and instructor, and a highly regarded speaker at conferences worldwide.
Throughout his career, Brent has designed, built, managed, and scaled IT infrastructure and teams for everything from embryonic startups to giants such as Google, Apple, and Netflix. He has worked with dozens of organizations both in Silicon Valley and around the world, as well as with a variety of non-profit and government entities.