Berlin

November 4 & 5, 2024

New York

September 4 & 5, 2024

Creating an effective process for on-call engineering teams

Uptime matters, but so do your people. At Intercom, keeping our product online and working well at all times is critical to the success of our business.

Speakers: Brian Scanlan

Register or log in to access this video

Create an account to access our free engineering leadership content, free online events and to receive our weekly email newsletter. We will also keep you up to date with LeadDev events.

Register with google

We have linked your account and just need a few more details to complete your registration:

Terms and conditions

 

 

Enter your email address to reset your password.

 

A link has been emailed to you - check your inbox.



Don't have an account? Click here to register
July 29, 2020

Uptime matters, but so do your people. At Intercom, keeping our product online and working well at all times is critical to the success of our business.

Out-of-hours on-call is inherently disruptive to your life as an engineer. You need to be ready to respond quickly and competently to an alert about something being broken. This means having a decent Internet connection, a computer, power for the computer, whatever you’re using for 2FA, and passwords available. However, we realized that we had ended up with an on-call setup that we weren’t proud of, and had a number of problems to solve. There were too many people on-call at any one moment in time. The quality of alarms and runbooks was inconsistent across teams and there were ad-hoc review processes for new and existing alarms. We decided to attempt to solve these problems by creating a new virtual team that would take over all out-of-hours on-call work, consisting of volunteers, not conscripts, from teams across the engineering organization. This talk goes into the process we applied, the positive impact to our on-call, and lessons learned.