Site Reliability Engineering Consulting
Making SRE Work for Your Organization
Google is the birthplace of Site Reliability Engineering (SRE), but what works for Google may not work for your organization, due to differences in size, technology, culture, or other factors. Great Circle founder and principal Brent Chapman spent several years at Google as part of their legendary Site Reliability Engineering (SRE) organization, both as a high-level individual contributor and as manager of an SRE team. Brent also has extensive consulting experience with organizations large and small, which enables him to quickly analyze your unique circumstances and figure out how best to adapt SRE to your particular situation.
Brent can help you bring SRE’s powerful philosophies, principles, practices, and perspectives to your organization
What is your organization’s current situation? What challenges do you face? What capabilities do you have? What constraints must you operate under?
Which SRE principles are most relevant for your situation? Which SRE practices are feasible for your organization? Which can make the biggest difference, the soonest? How can you best integrate SRE into your organization’s culture?
How should your SRE team be structured? What capability gaps have you got, and how best can you fill them? How can you bring existing staff up to speed with SRE? What should you be looking for in new hires?
How do you champion SRE principles in a constantly changing environment? How do you create buy-in throughout your organization, both up and down the management chain? What do you need to do, to excel?
Brent brings his whole experience to a job, sharing ideas from all the different things he’s done, and pragmatically integrating them into solutions to the task at hand. When deploying complex services to demanding clients, it felt good to know that Brent was on the team. I could count on him doing his part of the job right the first time, but if necessary digging in with me as deep as necessary to find out why something wasn’t working.
Brent is the best possible person you could ever hope to hire in a start-up or rapidly growing organization. More than just his technical knowledge (impressive as that is), it’s Brent’s intense concern to “do the right thing” for the organization and its customers that makes him such an incredible asset.
What is Site Reliability Engineering (SRE)? How does it relate to DevOps?
According to Benjamin Treynor Sloss of Google, the “father of SRE”
SRE is what happens when you ask a software engineer to design an operations team.
[…] SRE is fundamentally doing work that has historically been done by an operations team, but using engineers with software expertise, and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, design and implement automation with software to replace human labor.
By design, it is crucial that SRE teams are focused on engineering. Without constant engineering, operations load increases and teams will need more people just to keep pace with the workload. […] That means hiring more people to do the same tasks over and over again.
DevOps or SRE?
The term “DevOps” emerged in industry in late 2008 and as of this writing (early 2016) is still in a state of flux. Its core principles—involvement of the IT function in each phase of a system’s design and development, heavy reliance on automation versus human effort, the application of engineering practices and tools to operations tasks—are consistent with many of SRE’s principles and practices. One could view DevOps as a generalization of several core SRE principles to a wider range of organizations, management structures, and personnel. One could equivalently view SRE as a specific implementation of DevOps with some idiosyncratic extensions.
The quotes above are from the excellent book Site Reliability Engineering: How Google Runs Production Systems, published by O’Reilly. Brent can help you understand these principles and practices, and determine how best to apply them to your organization’s unique circumstances.
Make Site Reliability Engineering work for your organization
Brent Chapman is an expert at emergency management and at helping organizations prepare for and learn from emergencies, working from a strong background in IT infrastructure and site reliability engineering (SRE).
While Brent was part of the legendary SRE organization at Google, he created and launched the Incident Management at Google (IMAG) system that is now used throughout the company for emergency management, and helped refine the Postmortems at Google (PMAG) system that the company uses to learn from incidents large and small. Brent is also a former air search and rescue pilot and incident commander, an emergency dispatcher and dispatch supervisor for major art & music festivals and events, a Community Emergency Response Team (CERT) member and instructor, and a highly regarded speaker at conferences worldwide.
Throughout his career, Brent has designed, built, managed, and scaled IT infrastructure and teams for everything from embryonic startups to giants such as Google, Apple, and Netflix. He has worked with dozens of organizations both in Silicon Valley and around the world, as well as with a variety of non-profit and government entities.