Intercom’s mission is to make internet business personal. But it’s impossible to be personal when your product is broken. Uptime is critical to the success of our business, and not just because our customers are paying us, but also because we heavily dogfood our own product. If our product is down, we acutely feel our customer’s pain.
Uptime is influenced by many factors such as the software architecture and the quality of day to day operations. However, quite often it comes down to having a human on call, responding to an alert from PagerDuty. On call work like this can be a powerful customer orientated activity that connects engineers to the value customers get from your product. It can also be a great learning and growth opportunity – after all, outages and errors can be complex events to understand and remediate.
Being on call out of office hours is inherently disruptive to your life
But at the same time, being on call out of office hours is inherently disruptive to your life. You need to be ready to respond quickly and competently to an alert about something being broken. Even without being paged, being on call creates anxiety – I know from personal experience that it is very disruptive to sleep, even if nothing actually breaks. Being on call regularly can lead to burnout, apathy or a general desire to never see a computer again.
The history of on call at Intercom
Back in the early days of Intercom, our CTO Ciaran was the entirety of the on call team, both in and out of the office. As Intercom grew, we built an operations team to help Ciaran out. Soon after, new teams started building a lot of new features and services, and they assumed full on call responsibilities.
There were too many people on call at any moment
This felt natural at the time, as it was a lightweight way to scale our on call team and was consistent with our values that emphasized the importance of ownership. Without deliberately planning it, we had ended up with four or five teams that were regularly being paged out of hours. The remaining product teams had a handful of alarms that rarely, if ever, resulted in an out of hours page.
We realized that we had ended up with an on call setup we weren’t proud of, and had a number of critical problems that we wanted to solve, such as:
There were too many people on call at any moment in time – our infrastructure wasn’t so large that it required at least five engineers having their weekends disrupted.
The quality of our alarms and on call procedures were inconsistent across teams and we were using ad hoc review processes for new and existing alarms. Runbooks (the procedures to follow when an alarm fires) were mostly conspicuous by their absence.
There were inconsistent expectations for engineers depending on which team they ended up working on. For example, only the original operations team had any form of compensation for doing on call shifts other than time off in lieu.
There appeared to be a general level of tolerance for unnecessary out of hours pages.
Finally, it doesn’t always suit everybody to do this type of work. Life circumstances can mean that on call shifts are just too disruptive for some people.
Finding the right on call process
We decided create a new virtual team who would take all out of hours on call work from every team. The team would consist of volunteers, not conscripts, from any team in the engineering organization. Engineers would rotate out of the virtual team after six months or so, having done a handful of weeks on call. Thankfully, we had no problems getting enough volunteers to start the virtual team.
Our on call went from being spread across more than 30 engineers to just 6 or 7
The team then agreed upon and defined what acceptable alarms and runbooks look like, and described an acceptance process for moving alarms over to the new on call team. They defined all our alarms in code using a Terraform module, and started using peer review for every change. We put in place a level of compensation that we were happy with for taking a week’s worth of on call shifts. We also created a “Level 2” escalation team made up of engineering managers to be a single point of escalation for the on call engineer.
It took a few months of hard work ironing out our process, our on call went from being spread across more than 30 engineers to just 6 or 7. Our engineering teams still do on call for their features and services during office hours, which is when things tend to break the most, but our out of office on call is fully owned by a dedicated set of volunteers.
What we learned
After we launched our virtual on call team, we expected a large amount of follow-up work after an on call shift, such as researching the causes of alarms and collaborating on solving the problem that caused the page. However, our engineering teams took very strong ownership of anything that caused a page, and any follow-up generally has had prompt action. We also haven’t had to threaten the nuclear option – handing an alarm back to the team where it originated due to lack of follow-up and forcing them to do out-of-hours on call again.
The number of out-of-hours pages has dropped to less than 10 a month
Our formal escalation process has rarely been used. A more common scenario is the on call engineer is informally helped out by engineers who happen to be online at that time, notably from engineers from our San Francisco office. Numerous problems have been repaired or mitigated through on the fly collaboration and teamwork.
Engineers in our San Francisco office have joined the team fully and gone beyond just ad hoc support. There is a degree of added overhead, but spreading team membership across multiple offices has been very successful for us – it’s a great way to build relationships and depth of knowledge about the technology stack which we all work on.
The experience of being an engineer in Intercom is now way more consistent across our teams, and we can confidently advertise Systems Engineer positions on our Careers website stating that there is no on call required, unless you really want to do it.
Along with foundational work stabilizing and scaling out our datastores, the consistent focus on resolving pages has resulted in the number of out-of-hours pages dropping to less than 10 a month, a number we’re very proud of.
We are continuing to work to maintain and improve our on call team, and as Intercom grows we may need to revisit the decisions – what works today may not work the next time our team doubles in size. That said, this work has been very positive for our engineering organization, significantly improving our engineers’ quality of life, the quality of our on call response and, above all, the customer experience.
Komentarze