As we monitored the API calls, we saw that connection times between client and server were quite lengthy, especially when it came to connections outside the United States. The multi-step SSL/TLS handshake—in which the client and server validate the identity of the other party and generate a common secret key for encryption of data and prevention of forgery—was causing a large amount of latency (around 1,000 ms or even more).
To solve this issue, we had to first:
Establish our target response time metrics
Uncover the cause of the latency
Implement the solution & monitor
Let’s dive in.
Using APImetrics to Monitor API Response Times
To benchmark our API, we use a tool called APImetrics. It helps our team stay on top of the health of the API and provides us with the reporting and alerting necessary to analyze and continually improve our performance. API Metrics software allows us to create and run real API calls, just like our customers do, and gives us detailed results of the API calls with a breakdown of, among other things, handshake and processing time.
The API Metrics dashboard provides an easy way to access data about our endpoints in the form of reports that compile information such as the region from which the call is being made, and the frequency per hour, per location. If focusing on a certain subset of endpoints, we can easily create a custom report with the selected endpoints.
We knew that we wanted our average API response time across all endpoints to be under 100ms. Through consistent monitoring, we were able to ascertain that our metrics from the West Coast were up to par, but internationally, there was room for improvement.
Reducing Latency with A Faster TLS Handshake
With this information in hand, we began researching ways to decrease the connection time latency. Big picture-wise, we knew our team would be setting up an EU data center, but in the interim, we wanted to find an efficient way to get our API durations down. Cloudflare, one of the biggest networks operating on the Internet, presented itself as a promising solution. The sheer size of Cloudflare’s Content Delivery Network (CDN), with data centers in 200 cities across the world, means that their content delivery network is able to leverage this geographic distribution in a number of ways.
As a globally distributed network, CDN reduces the geographical distance between users and website resources, and improves the speed with which users are able to receive data. It reduces the number of roundtrips required and speeds up the SSL/TLS process by optimizing connection reuse and enabling TLS false start. As RFC 7981 delineates, TLS false starts allow the client to start sending application data when the full handshake is only partially complete, if certain conditions are met, thus reducing latency.
Improving Dashboard Speed
Users of the Nylas dashboard will be happy to note that this reduction in latency is apparent in the dashboard as well. This can be seen when carrying out common tasks such as updating your billing information;
pulling up authentication, API, Mailsync, Syncback and Webhook logs;
and canceling accounts.
While we were surprised to discover that SSL/TLS connection negotiation and termination were contributing significantly to our API latency, this launched our investigation to reconcile the discrepancy between our target times and the times we were observing for domestic calls to our API in APIMetrics. With APIMetrics, we were able to monitor the metrics of the API calls, collect them in one place, and then examine the collected data presented in the APIMetrics dashboard to gain insight about our performance. The analysis of the issues and research for possible solutions led us to Cloudflare, a drop-in replacement that improved our API and dashboard speed by a factor of two. Other developers interested in building APIs where duration speed is important would do well to consider APIMetrics and Cloudflare to facilitate improvement in performance.