A few months ago, we found that deploying to the Sync Engine – the powerhouse that allows the Nylas APIs to sync billions of emails, calendar events, and contact data into our sharded MySQL database—was adversely affecting database performance, which drastically slowed down our deploys and started setting off alerts. We conjectured that it was likely related to the changes we made to improve security for end-users by adding support for granular authentication scopes.
Shortly after Nylas’ inception in 2013, we started using Honeycomb.io to help diagnose performance degradations. Honeycomb provides relevant information about a query’s internals and breaks it down in different ways (more on that to come).
We logged a MySQL slow query, which can be made when determining which queries are exceeding a certain threshold of execution time. In addition, in order to hone in on where the drag was happening in the database, we added custom instrumentation on the application side in order to track every single query from our sync engine and determine where it was coming from.
Troubleshooting the Problem with Honeycomb
The first step in the troubleshooting process was delving deeper into the data we already had in Honeycomb, which was structured data from the MySQL slow query log. Honeycomb handily sorts the queries with the highest number of rows_examined at the top so it’s easy to tell at a glance which queries are causing the problems.
In the chart above, you can see that the rows_examined being so much larger than the rows_sent. This means that the ratio of rows scanned in the database to actually returned by the query was 2361:1.
We realized that just having the slow query log data from the database side wasn’t enough to determine the frequency, or where in our code the queries were coming from. This drove us to add custom instrumentation to our application using SQLAlchemy’s events. We were able to log every single query across our sync engine with the help of Honeycomb’s Python SDK to automatically sample the data.
Here’s an example of an event we sent to Honeycomb to track the queries coming from our sync engine. Note the ‘query_location’ attribute which proved essential in pointing us towards the cause of the problem.
'comments': ' nylas/services/sync/events/google_poll_event_sync.py:45 nylas/services/sync/sync/base_sync.py:147 nylas/common/retry.py:77 nylas/services/sync/util/retry.py:157 nylas/services/sync/sync/base_sync.py:128 /Users/jonny/cloud-core/.venv/lib/python2.7/site-packages/gevent/greenlet.py:536 ',
'params': "(datetime.datetime(2020, 3, 17, 17, 56, 11, 138986), 'REDACTED', 562949953421313, 'REDACTED', 562949953421338, 'REDACTED', 'REDACTED')",
'query': 'INSERT INTO transaction (created_at, public_id, namespace_id, object_type, record_id, object_public_id, command) VALUES (%s, %s, %s, %s, %s, %s, %s)',
'query_location': 'nylas/services/sync/contacts/remote_sync.py:219 nylas/services/sync/contacts/remote_sync.py:282 nylas/services/sync/sync/base_sync.py:147 nylas/common/retry.py:77 nylas/services/sync/util/retry.py:157 nylas/services/sync/sync/base_sync.py:128',
This is what the results of the slow query log looked like:
This visualization helped us see a few things right off the bat. One look at the ‘rows_examined’ field’s ratio to ‘count’ revealed the extraordinarily high number of rows being examined for each query. This helped point us towards the root of the problem.
The Problem: Inefficient MySQL Queries
Taking a look at the ‘rows_examined’ field, we saw that over 180 billion rows were scanned over a two-hour span. Why were so many rows being read? The answer was that these queries were coming from our GoogleHistoricalMailSyncer.
As the name indicates, the GoogleHistoricalMailSyncer is responsible for fetching all the historical mail for a Gmail account once it has authenticated. Practically speaking, a lot of our customers are mostly interested in recent emails for an account, but because this syncer was running on every restart, all the historical messages for an account were being refreshed as well. Carrying out this query entailed reading four columns: id, is_draft, namespace_id and g_msgid; and because there was no covering index, the query was obliged to read the whole row. This meant that even if the query simply needed to read 25 bytes (the total length of the four columns), the query actually read the entire row (which has an average length of 9KB). This, in turn, meant that 99.7% of the reads resulting from this query were wasted.
The Solution: Implementing a SyncProgress Tracker
Some of the solutions we brainstormed involved reordering the logic and creating a monotonic ID for messages, but as is so often the case, the simplest solution was the most effective. We pivoted to tracking whether GoogleHistoricalMailSyncer had finished during an account’s lifetime using a SyncProgress tracker. This was done by noting the sync status for SyncProgress and checking if it was completed; if it was not completed, sync would continue, but if it had completed, logic was built in to avoid syncing it again. Then, in order to check the progress on implementing this solution, logs for sync starting, sync ending, and the slow query dataset were monitored in order to make sure it wasn’t spiking.
We saw that queries distribution had quick-resolving spikes:
The Final Result: We Reduced Our Query Volume by 50%
Although the main problem & solution seem very straightforward, it actually took weeks of identifying then fixing dozens of queries that appeared to be problematic before finally honing in on the culprit. We had to analyze the data in a lot of different ways before knowing exactly which columns and ratios of different data points related to load on the database.
Being able to quickly run aggregates on different fields with different time frames in Honeycomb, then sharing links to pre-configured queries with other team members was crucial to solving this problem.
When we finally tied a bow on this issue, we were able to cut our query volume by almost 50% and reduced the load on our databases by even more.