Reliability is the foundation of trust for every API platform. At Nylas, we take that trust seriously, not just by claiming high uptime but by holding ourselves to one of the strictest reliability standards in the industry.
Developers building on APIs depend on predictable performance. A single failed call can break entire workflows. That’s why reliability isn’t just a metric at Nylas; it’s a core engineering principle that guides how we design, deploy, and measure success.
While many companies report SLA as uptime (whether their systems are technically reachable), Nylas measures it by API success rate:
Total successful API calls ÷ total API calls.
This approach captures the actual customer experience. If a request fails, it counts — no matter how short or isolated the issue. It’s a tougher standard, but one that keeps us focused on real reliability instead of surface-level uptime metrics.
Kubernetes excels at running stateless workloads, but we found that certain components, especially our databases and primary API gateway, perform better on dedicated infrastructure. Moving these services off Kubernetes gave us tighter control over performance, latency, and failure recovery.
When we migrated our API gateway and all databases to dedicated compute, we saw a 12% reduction in average request latency and significantly reduced load on CoreDNS/KubeDNS. Stateless workloads continue to thrive in Kubernetes, but high-throughput components benefit from the predictability of managed infrastructure where we control every variable, including failover timing and I/O profiles.
Every new release goes through an automated canary phase before full rollout. We direct a small percentage of live traffic, from 5% – 50% in stages to the new version and monitor API success rates in real time.
We compare every metric against the previous release. If the new build shows even a 0.01% regression in success rate, the deployment halts automatically and rolls back within minutes. This measurable guardrail closes the feedback loop between code and reliability, ensuring that every change improves the developer experience rather than degrading it.
We continuously simulate real-world failures across our databases and API services to ensure the platform self-heals. Any single node or subsystem can be brought down and replaced without customer impact.
Our chaos testing program runs multiple times per week, injecting controlled failures such as database node loss, API rate-limit spikes, and regional disruptions. These drills surfaced hidden dependencies early, allowing us to harden retry logic, add regional redundancy, and verify that all critical paths recover without manual intervention. The result is a platform designed to remain stable even when individual components fail.
Through this disciplined engineering approach, Nylas evolved from 99.9% to 99.99% API reliability, a tenfold reduction in allowed downtime and a major milestone in customer trust.
In practice, that means fewer failed requests, faster recovery from incidents, and a more consistent, developer-first experience across billions of API calls each month. For our customers, reliability isn’t an invisible feature; it’s a reason they can build confidently on our platform.
Our journey continues, but one thing is clear: when you measure what truly matters—successful outcomes for every API call—the right architectural choices naturally follow.
Reliability is never ‘done.’ It’s an ongoing commitment to the developers who build on us every day.Interested in building on a platform that prioritizes reliability? Explore the Nylas API and see how we help developers ship faster with confidence.
Director, Site Reliability Engineering