Autoscaling With AWS
A Site Reliability Engineer’s deep-dive into imminent failures and how to avoid (or reduce) them with autoscaling.
Brady Wetherington | May 30, 2019
Early on when I started at Nylas, the one thing I kept finding myself repeating was the word “autoscalers.” I felt like I was repeating it so much that it started not even sounding like a real word. Autoscalers, autoscalers, ‘atuocslraes’ [sic].
But let me take a step back. What do we mean by an autoscaler? And why do we like them so much?
Autoscalers — sometimes referred to as Auto Scaling Group, Managed Instance Group, Amazon EC2 Fleet, whatever you want to call it — help protect your database from imminent failures. When you engineer systems to be hosted in a dynamic, ephemeral environment like Amazon Web Services (AWS, our provider of choice), you need to be prepared for systems to occasionally fail. Crash, abnormally terminate, get stuck — whatever you like to call your potential outages. If you don’t plan for that worst-case scenario, you’re going to have a very bad time.
Autoscalers are an ingenious method for handling these errors. You define a ‘fleet’ of servers at a particular size, and the Autoscaling Group makes sure that that number of servers is always running. If a server stops being able to respond to requests — e.g., if it fails “health checks” — it should be terminated, and a new one will be brought up in its place.
Even better, you can start to determine autoscaling ‘triggers’ that can make your fleet grow larger or smaller based on load being high or low. This is one of the cornerstones to making a resilient, reliable, scalable service.
Autoscaling Embraces the Inevitability of Failure
Autoscaling is really the true cloud-native way to architect services. Autoscaling embraces the inevitability of failure. It’s really just a numbers game: when you have a handful of servers, you can make a bet that none of them will ever go down, and that will often work out. You can even take great pains to try to ensure that they don’t go down. But when you start to get to the scale of hundreds of servers, it becomes harder and harder to make that bet a smart one.
Even through no fault of your own, failures will happen. Amazon has been known to take down servers on a whim. But if you instead start from the perspective of assuming that something bad can and will happen, and then plan for it — rather than trying to prevent anything bad from ever happening — it can be an enormously powerful and liberating way to build things. That’s what’s great about autoscalers: they work specifically from that perspective.
Implementing Autoscaling: The Devil is in the Details
Usually, the first service to look at for autoscaling is HTTP(S)-based API endpoints. These are good candidates because in most cases each API server is completely separate from every other, and it’s usually pretty easy to determine good scaling metrics (CPU usage is often a great start).The trick about autoscalers is always the same: it’s never the initial autoscaler setup that’s hard. Hell, I had a decent configuration for our API up in staging in less than a month. No, the real problem is literally everything else.
How well does your service handle having a server stop in the middle of a request? How well can you handle having a new server booting at literally the worst-possible time you can imagine? Or having an old, ailing server terminate at the worst-possible time? How can you ensure that the latest configurations are always propagated to these newly booted servers? What are the right metrics to use to scale up? What are the right metrics you need to scale down? How can you determine that an instance is ‘healthy’? How deep should your health checks go — should they go ‘deep’ throughout your entire stack to ensure that everything is really working — or should they be ‘shallow’ and just check the layer that they’re operating at? And of course, the answer to these questions, like most things in technology, is the same: “Well, it depends…”
When configuring autoscalers, the devil is in the details. At Nylas, for example, we can add servers pretty easily — but removing them can be a real problem, especially when they’re servicing long-lived connections. We don’t want to give our customers error messages when we decide to terminate a server! So we had to invest some development time in making it so that we could gracefully close those connections without sending the customer an error message.
These are the kinds of concerns you have to think about in a world with autoscalers. Imagine the worst possible time for a server to boot. Or terminate. And then assume that that’s the time when your servers are going to start or stop. Then plan for that.
There Are Two Paths for Autoscaling
Sometimes you can be tempted to try and build two paths. One build for the “nice path” — when an autoscaler terminates your instance gracefully, and you have time to carefully shut down services, hand off connections, let old connections finish cleanly, and so on. The other path is the “rude path,” i.e., when your server has just stopped. For seemingly no good reason. It’s not even responsive any more. Even if you wanted to do something gracefully, you can’t.
I always recommend you handle the rude path first. If you do that, usually you’ll find that you don’t even need to handle the nice path at all — just assume that all instance terminations are abnormal and sudden, and if you can handle that well enough, then there’s no reason to build out any kind of separate code path to handle the ‘nice’ way. For reasons mentioned before, that approach wouldn’t work out for us. But it usually does work out like that in most other environments.
But that’s not even the hardest part. The hardest part has to do with configuration management. Nylas has an enormously complex and enormously powerful clustered database setup. Using MySQL, ProxySQL, and other software, we’re able to provide an extremely robust, resilient, high-performance database service. But what’s most important about our database service is how we can respond to load. We can make changes and propagate them throughout our entire fleet so that every server knows about every database.
That’s a great deal of operational flexibility for our database administrators (DBAs), and allows us to rapidly respond to changing load requirements. However, it brings up an issue we first noticed in our staging environment: what happens if we’re in the middle of dealing with a failover, or relocating database shards, or any other kind of operational database change, and suddenly a new server has to be brought online?
Trying to manually ship out configuration changes in the middle of an autoscaling event would be a real non-starter for us. So, instead, we built a solution using ProxySQL’s Clustering feature, which enables us to have one single grand unified view of our database topology. And this means that, just like we want, a server can boot or shut down at any time and we’ll still have a single source of truth about how our database environment is currently operating.
Life After Autoscaling
So what’s next? Once our API autoscaler is live, we’ll move on to making our actual sync fleet autoscaled. We already have great ideas on when we will want to scale our fleet up and down, so we’ll just need to ensure that the sync fleet can read the same ProxySQL clustering configuration that our API fleet works with.
There’s still room for us to add more autoscalers, and I’m sure we will. Sometimes we’ll have a one-off server that provides a useful service — if there’s a way to autoscale that, even if the autoscaler is only ever at size ‘1’ — we should do it to help ensure that single instance failures can’t take us down.
When Not to Autoscale
Things that are hard to autoscale might be anything that tries to handle data — like databases. In some cases there are workarounds to handle this, but these tend to be poor fits. Better to treat your databases as “snowflake” servers and handle them carefully, as individualized servers, each with their own precious data on them.
So, not everything fits in an autoscaler. And things that don’t fit well in an autoscaler shouldn’t be put in them. (We even tried to put our ProxySQL Cluster array into an ASG, but were told in no uncertain terms that we needed to… not do that. So we don’t any more.) But if you can fit something in there, you should. I’d even recommend doing it before you have to. You learn how to deal with failure before you are forced to, and can pick and choose your battles slowly and steadily. You quickly learn how to deal with systems that really are truly ephemeral. That can unlock approaches that can be novel and simple, though also coarse and brute-force-ey.
Server not working right? Don’t bother fixing it, just let it get terminated and replaced on its own. Want to deploy new code, or new configurations? Just spin up new boxes with your new stuff already on them, and get rid of the old ones. Just set your Termination Policy for your autoscaler to OldestLaunchTemplate, OldestLaunchConfiguration, OldestInstance, then resize it, and you’re pretty much good to go!
Opening up your infrastructure to the types of approaches unlocked by autoscalers can let your tech folks focus on the real guts of what you’re doing, not running around spinning up and down servers all the time. That’s definitely worth the investment.