Register or log in to access this video
Uptime matters, but so do your people. At Intercom, keeping our product online and working well at all times is critical to the success of our business.
Out-of-hours on-call is inherently disruptive to your life as an engineer. You need to be ready to respond quickly and competently to an alert about something being broken. This means having a decent Internet connection, a computer, power for the computer, whatever you’re using for 2FA, and passwords available. However, we realized that we had ended up with an on-call setup that we weren’t proud of, and had a number of problems to solve. There were too many people on-call at any one moment in time. The quality of alarms and runbooks was inconsistent across teams and there were ad-hoc review processes for new and existing alarms. We decided to attempt to solve these problems by creating a new virtual team that would take over all out-of-hours on-call work, consisting of volunteers, not conscripts, from teams across the engineering organization. This talk goes into the process we applied, the positive impact to our on-call, and lessons learned.