In one of the Slack channels that I frequent, someone recently asked what a reasonable duty cycle was, for engineers in a 24/7 on-call rotation with a single digit number of pages per week. In other words, under those circumstances, is it reasonable for a given individual to be on call one week in three, one week in four, or what? At least according to Google SRE, “reasonable” is a lot less often than that.
It depends a lot on when those pages occur, and how disruptive they are to your work and your life. If they’re all during your normal workday, and each takes only a few minutes to deal with, it’s not a big deal (though if they take only a few minutes to deal with, can’t you automate them away?). On the other hand, if you’re getting awaked every night by pages, and each page takes hours to deal with, even a “single digit number of pages per week” can be a huge burden.
It also depends somewhat on how quickly you are expected to respond to each page. How long is it, from the time you are paged, until you are expected to be hands-on-keyboard, connected, logged in, and actively working on the problem? If that’s 30 minutes, then you can still have most of a life even when on call; you can go to dinner, go to the movies, and so forth, as long as you bring your laptop and mobile hotspot, and you’re prepared to step out if you get paged. If your response expectation is 5 minutes, on the other hand, you’re pretty much tied to your home or office; you can step out long enough for a trip to the laundry room to change loads, but that’s about all.
At Google, the two standard response levels for SRE on-call rotations are 30 minutes and 5 minutes, and a “single digit number of pages per week”, or slightly more, is typical. On-call SREs at Google are paid for the time they spend on call that falls outside of their normal workday; for a 30-minute rotation, they’re paid at 33% of their normal rate, and for a 5-minute rotation, at 66%. In order to prevent burnout, the amount of on-call bonus pay is capped at 15% of base salary; this is a constraint on managers, who are strongly expected to keep all their people under this 15% cap, and must seek VP-level approval to exceed it (and had better have a damned good explanation of why this is a temporary situation that will be fixed within the next quarter). If you work backwards through the constraints, you find that it takes a 7-person rotation to provide 30-minute on-call coverage 24/7 without anybody exceeding the cap, and a 14-person rotation to provide 5-minute coverage. And therefore, any given individual would be on call no more than one week in seven for a 30-minute response, or one week in fourteen for a 5-minute response.
Google is perhaps an outlier, in this respect. Most companies are probably going to put their people on call more frequently than Google does, but they should beware that they’re burning the candle at both ends. They’re going to disrupt the ongoing project-based work that their engineers are expected to be accomplishing, and they risk driving their engineers to on-call burnout.
Many companies also don’t explicitly pay their engineers for being on call, or say “it’s just part of the job”, but I think that’s short-sighted. Being on call is a burden, and having to pay people for it helps keep it under control by providing an incentive for the company to invest in making its systems more stable, reliable, and automated. And in the end, the company, its customers, and its employees all benefit from that.
What do you do when the pager goes off for something that’s bigger than one person can handle? How do you coordinate an effective, efficient, timely, and scalable response? Learn how at my Mastering Outages One-Day Class! There’s one scheduled for Friday 18 May 2018 in the San Francisco Bay Area; full details and registration are at greatcircle.com/class, and you can save $100 by using discount code “On-Call”.