How do you find the space to take on high-risk, high-reward projects, especially when it involves a system the whole company relies on?
Every company has one of those business-critical systems that were built when the company was a certain size and, with a proven track record of stability over the years, there are few reasons to touch it. But as the company grows, a few problems start to surface, though nothing that can’t be solved by spinning up more instances, and spending more money on Amazon Web Services (AWS).
Optimizing systems that already work adequately is difficult to make a priority.
However, as the cycle keeps repeating, the problems become bigger and bigger. Soon you’re in a vicious circle, left with a suboptimal and expensive juggernaut of a system. Deployments are messy if done during peak hours. Writing new tests is tedious and iterations are slowing down. It doesn’t make new engineers eager to jump in and do the work of their life. Oh, and there are the costs. Running 280 larger AWS instances comes with a price tag that would make your head spin!
At Intercom, that system was Nexus, our real-time messaging system, built in-house and which routes hundreds of millions of messages every week. It worked fine, but was demanding more and more resources to keep it running efficiently, updating dependencies was risky and running tests was slow.
Several people tried to optimize and simplify Nexus but there was always more pressing work to be done. After all, optimizing systems that already work adequately, if expensively, is difficult to make a priority.
High-risk, high-reward
So how do you find the space to take on such a high-risk, high-reward project? How do you approach the sort of experimentation necessary to rewrite a system the whole company depends on? One of our engineers, William Tabi, thought he knew how – he decided to turn Nexus on its head and rewrite the system entirely in Go. He thought this would be easier than trying to update and retrofit the existing inefficient and overly complex Java-based implementation. However, finding the time to do it among his other commitments was going to be a challenge.
Wiggle week pauses higher company priorities for a week.
That’s when a new initiative opened up some space for exactly this sort of project. On the engineering level, Intercom’s quarters are broken down into two six-weekly cycles with corresponding goals. This leaves one extra week per quarter, and last year it was decided to turn these extra few days into a “wiggle week”, a dedicated space on the calendar to let individual contributors decide what they want to work on for that week, with minimal planning. Wiggle week projects are completely initiated, proposed and implemented from the ground up. They are not generally on Intercom’s immediate road map, which is decided top down, with defined goals and timelines for completion.
Hence, the outcome of a wiggle week project doesn’t really matter. They allow for those projects that are on the “I’ll-get-to-it-in-my-free-time” to-do list. Wiggle weeks give developers and designers the freedom to be creative and come up with their own ideas to do something that they think would make Intercom better in the wider sense, be it for the employees or for the customer.
Fostering innovation
Wiggle weeks are, in a sense, Intercom’s equivalent of Google’s famous 20% time, which was conceived as a way of fostering innovation. But because the wiggle week is baked into the quarterly schedule, the space for creativity and innovation is made explicit, rather than being incumbent on people to carve out time and space for themselves.
Taking bets on big high-risk undertakings is not, in general, a sound approach.
They allow for the type of high-risk, high-reward projects that William was envisaging – rewriting one of Intercom’s core functionalities using Go, a language he had never used before.
Understandably enough that’s the sort of proposal that can be hard to get a green light for. Convincing our manager to use a completely new approach was no easy feat – managers are well aware that engineers can get very excited by shiny new baubles, and that just because engineers like to experiment with new technologies does not mean that they will create a better solution compared to what is already in place. There is a good reason conventional engineering wisdom holds that you should “work with what you know, rather than what you don’t know”.
Taking bets on big high-risk undertakings is not, in general, a sound approach, especially when the real-time service is so integral to Intercom that we cannot afford for it to go down for even a few minutes. This was a scenario where it looked like the stability of the existing Java solution was more important than the potential savings and efficiencies offered by rewriting it in Go.
Addressing concerns
Our managers voiced concerns that needed to be addressed with initial testing data and statistics. We had to demonstrate that the new solution worked and was much more efficient than the old Java solution, and an upcoming wiggle week gave us the chance to do just that.
William used one wiggle week to build a functional prototype to see if it would actually be possible to make the basic realtime functionality of Intercom work in Go. Once he established that, he then ran some tests to see how efficient it was compared to the old Java version and the results were very promising.
Higher test coverage made us more comfortable with introducing changes
Indeed, they were so promising that he enlisted the two of us to devote a second wiggle week to the project. Neither of us had worked with Go before, but we helped him polish the project from its rough and early state into a production-ready state, complete with a suite of tests to make sure that it worked exactly the same way as the old one.
The results were clear-cut, to say the least. Where engineers used to waste half a day setting up Nexus-Java, Nexus-Go is set up in a couple of minutes. It builds in 1.5 seconds compared to 41 seconds it took to build the Java version. And while the Nexus-Java tests run in a bit less time than Nexus-Go tests, this is because of its low test coverage and extensive use of mocked objects while in Go we don’t mock anything. Furthermore, writing tests in Go is a breeze so test coverage went up too. Higher test coverage made us more comfortable with introducing changes and iterating quickly.
Significant advantages
The advantages in deployment are stark, too – Nexus-Java deployment took 2 hours on average, and deploying during peak hours was flakey at best. Nexus-Go, on the other hand, deploys in 25 minutes without a hiccup, every time – an impressive 4.8 times faster.
But the biggest advantages were found in terms of performance, leading to increased stability and major cost-savings – while Nexus-Java used around 30GB of memory per instance, Nexus-Go uses around 10GB. Since we didn’t need so much memory, we switched to compute optimized instances. This, in turn, reduced our average message latency from 80 to 7 milliseconds. We’re now able to have hundreds of thousands of active WebSocket connections per server. And it was even more efficient than we thought – we managed to reduce the number of AWS instances from 280 to 50, which reduced our costs approximately 5.6 times. That’s a pretty significant saving, by anyone’s standards.
Make space for unexpected wins
The success of the Nexus-Go project was primarily a vindication of William’s instincts and our hard work. But it was also an example of how beneficial it can be to encourage experimentation and innovation, how dedicating time and space to approach problems outside the usual company priorities can result in all sorts of unexpected rewards. Sure, it won’t always bring such impressive results and not every experiment will come off, but building the concept of innovation time into the culture allows for these sorts of unexpected wins. And it’s also a lot of fun.
Kommentare