Lesson Learned from On-call and Managing Alert Noise
How we scaled infrastructure 20x and kept developers sane.
Tomasz Finc | December 10, 2016
Building a workplace where every engineer feels happy, productive, and empowered to quickly see the results of their work should be the goal of all engineering teams. In practice, though, that’s challenging, requiring consistent re-assessment of tech, process, and the magic pixie dust that makes your team and products unique.
Recently, we told the story of how we scaled our MySQL databases to not be crushed by our user growth. That work was critical but only told part of our story. Today, we’re going to tell the next part of our journey and how we took charge of our operations infrastructure.
When you have to scale your infrastructure 20x like we did, you take out a lot of loans and quickly accumulate technical debt. Some of it is natural and necessary when you have a constrained amount of resources (people, money, etc.), and as a growing startup you need to iterate quickly and keep moving and innovating to stay alive. Over time, though, you start to notice the cruft building up. All of us have had that moment where we know we should pay down the loan but either can’t or choose not to do it. Wait a couple more months and the blinders are so great that it just seems like business as usual, even though many underlying issues have not been fixed.
This point is where you must step back and think about strategic improvements and not simply firefighting. At Nylas, we decided to focus on on-call, alerting noise, and removing distractions.
An on-call rotation is one of the best ways for engineers to truly see the unexpected things that happen in production. If you don’t have your whole engineering team on call, then you’re losing on precious training and growth for both your teams and your infrastructure. Up to this point, the Nylas on-call rotation was limited to only backend engineers.
To increase our rotation size and grow in-house knowledge, we decided to train any engineer who was interested in how to be on call. Backgrounds didn’t matter as long as they were eager and hungry to grow. We never strove to cover everything that could go wrong. You can’t. If you could, then you should have already fixed it. Instead, we used it as an exercise to generate both excitement and critical thinking about our infrastructure.
On-call bootcamp consisted of architecture presentations, hands-on deployment of services, and business-hours pager duty to give engineers the responsibility of on-call while still having a quick escalation path. We weren’t sure how many new people would join our regular rotation, but we were overjoyed when our rotation more than doubled by simply asking for help.
During training, we even heard someone say “Can I be on call more?” which made all of us smile with glee.
That shift was critical because we had changed the perception of on-call from something that was painful and uninspiring to something that acted as a vehicle for positive infrastructure change. Too many times, engineering teams see on-call as a necessary evil, when they should instead be using it as one of the best motivators for change.
Collective ownership of development and operations creates a shared definition of success between all team members and helps to bridge any divide that may exist between frontend and backend engineers. For instance, at Nylas, we expect all engineers to be able to ship their own code, which requires them to know the impact of not only their changes but also how the infrastructure will react to their change. While it may not be possible for an engineer to know all parts of the stack, having every engineer on call acts as a great forcing function to continue learning when you come across something new.
Lesson Learned: Everyone should be on call
Alerting when something is going wrong is critical. Alerting so much that your engineers are being constantly interrupted and are in firefighting mode is a recipe for alert fatigue in the short term and burnout over the long term. We had alerts that were caused by actual broken pieces of our infrastructure, mixed with persistent noise. These constant interruptions and the lack of clarity surrounding noise versus real problems made it difficult to prioritize the work we needed to do.
To remedy the issue, we first started tracking alert noise. We thought about doing something fancy like writing scripts using the PagerDuty API but ultimately settled on a simple document to keep track of what was being loud and prioritize tackling the noisiest alerts first.
One of our most important alerts is the success rate alert on our API. We were seeing periodic noise in the 99% and 95% threshold alerts, which, upon investigation, appeared to be triggering when nothing was wrong. To figure out what was happening, we dug deeper into the metrics data we were sending to Graphite:
Something was happening that was causing our success rate calculation to get super confused. Here’s how we calculated the success rate using Graphite functions:
Drilling down into the HAProxy metric data, we found that when this happened, we were missing data in Graphite:
Lesson Learned: If a Graphite-based alert is doing something unexpected, you can often find the root cause by drilling down into the data backing the metric’s calculation
When you run an API platform, it’s not good enough to have the service up all the time—it also needs to be servicing requests with a reasonable latency. In order to ensure that, we have alerts on request latency as well as success rate. Our first attempt at implementing latency checks involved consuming HAProxy ‘ttime’ statistics, since HAProxy is the last piece of our stack that processes a request before it returns to the client. Unfortunately, the statistics that HAProxy emits are aggregated across all requests, and our API has several API endpoints, like our streaming endpoint, that intentionally take a long time to complete since the client holds the connection open to stream data. We developed a workaround by (1) putting a high threshold on the HAProxy latency check and (2) creating a second check that used latency measurements from the API application and excluded certain supposed-to-be-slow endpoints.
We were seeing noise with unclear cause in our latency alerts—sometimes that meant “a database cluster is heavily loaded,” but, more often, “who knows.” As a result, engineers learned to ignore the alerts.
On diving into an instance of the application-level check going off when nothing was wrong, we found an interesting pattern:
We drilled down in this data to the individual endpoints, pulled out the top five slowest ones, and immediately found that there was one slow endpoint which we were incorrectly including in the check: a longpolling endpoint that had been added to the API after the check was created.
Excluding this endpoint from the check input data as well, we found a much more reasonable looking data set which would no longer trigger the alert:
With the application-level alert fixed, we still faced noise from the HAProxy latency alert. We found that, depending on the breakdown of our API traffic amongst normal and supposed-to-be-slow endpoints, the alert would sometimes trigger when nothing was wrong—there were just more streaming or long-polling endpoint requests being made.
The only way to remove that noise would be to start generating Graphite metrics from the HAProxy logs (which contain per-endpoint data) rather than HAProxy’s summary statistics, which we decided was a yak that wasn’t worth shaving right now. We’ve never seen a problem that would have been detected by a latency alert at the HAProxy and not at the application level, so we deleted the alert.
Lesson Learned: If an alert is paging engineers but not telling them about real problems, it’s better for that alert to not exist at all
We were already sending alerts to two Sensu PagerDuty handlers: one for critical alerts (like our API success rate) and another for less urgent alerts (like our staging servers being down). The non-critical alerts were configured not to page between 12am and 8am, using the Sensu filters. That meant that engineers who slept earlier would be woken for non-criical issues, and, to make matters worse, some non-critical alerts were sent to the critical handler.
We decided to do a full review of all alerts to make sure they were going to the right place and changed our off hours for non-critical alerts to be 7pm to 9am PST to respect our engineers’ time and sleep—regardless of whether they are night owls or early risers.
Lesson Learned: Only wake up engineers for critical issues
Paging Volume Results
After spending a few weeks laser-focused on the highest priority problems that were showing up in our alerts doc, we got our alerts down from a firehose to a trickle, with much higher signal:
Building happy engineering teams needs to be your top priority if you want to build great products. Through collective ownership, increasing trust, removing noise, and being bold with new ideas, you can begin to not only improve your practices but also allow new ideas to flourish organically. Allow new eyes to push you to both fix issues and take their fresh perspective not as criticism but as a catalyst for change.
After our strategic investments, our paging volume is down, service quality is up, and we’re better positioned to move even faster to make email suck less.
We’re extremely grateful to the folks at Honeycomb for their advice and support during this tough transition. Without the sage words of Charity Majors and Ben Hartshorne, it would have been a much less straightforward path. Several of the ideas shared here are theirs and we can take credit only for applying them to our specific context and passing the wisdom on for others to benefit.