Bringing Structure to Infrastructure
How our growing SRE team learned to coordinate, plan, and go beyond reacting to outages.
Breanne Boland | June 19, 2019
Interested in joining Nylas engineering? We’re currently hiring for our TechOps team (and many other roles). Check out our career page!
I joined the Nylas TechOps team in November 2018 as an SRE. The team’s rapid expansion kept pace with engineering’s growth overall, and it meant that I arrived at an exciting time — and also a time where we needed to figure out how we wanted to function as a team with a newly expanded capacity to plan and look to the future.
TechOps maintains the infrastructure for the Nylas API, creates internal tools for use by us and engineering at large, addresses security issues, and works with our outsourced DBA and security teams. We’re in the process of moving control of our AWS infrastructure to Terraform, and we’re a Python shop (with an occasional taste for Bash, when we’re feeling spicy). We like to make our infrastructure as resilient and automated as possible. There are a lot of possibilities for someone (like me, maybe like you too) who really wanted to dig in deep to some classic SRE challenges using modern tools.
Our team quadrupled in size in just six months. When a team is one person, you don’t really need sprint planning; furthermore, since the job is largely reacting to needs with zero delegation, extensive documentation might feel superfluous. When you have two people, it’s pretty easy to keep each other up to date via Slack or daily Hangouts, so a project management board might still feel unnecessary. All very reasonable, right?
Reasonable, and also untenable when the team expands to three and then four and beyond. It takes a metric butt-ton of Slack chat (that’s straight from the ISO standards for quantifying communication) to replicate the easy syncing that comes from Phab, Trello, or even Jira (I know), when accompanied by a regular and deliberate cadence of meetings. And when your team is four people or more, you suddenly have the ability to look ahead and plan instead of just reacting to the next alert or engineering request.
When our team grew, it became clear that our fellow engineers didn’t know when to ask us questions, what to expect from us, or how to most effectively reach us — meaning that interactions with us were informed by a mix of personal history, old habits, and winging it in the #techops Slack channel. Not ideal.
We needed structure. With the blessing of my team (who checked with me regularly to ensure I was ok doing this glue work and made sure I knew — regularly — that it was appreciated), I started building processes to help our team work more effectively. First step: documentation.
Documentation: Part Description, Part Conjuring
I started by making a statement of purpose of sorts, with the goal of expanding our pretty basic team description in the Nylas engineering team doc. I set out to answer these questions:
- Do we want to teach people to make their own AWS resources, or are they supposed to ask us to make things for them?
- How do we want people to communicate with us?
- How can we set expectations for what will happen when someone reaches out to us on Slack?
- What do we own? What do we explicitly not own?
- What’s a small ask? What’s a big ask? How does the resulting process differ for each?
In doing this, I talked to my team at length. I talked to multiple teams at Nylas: the platform team (whose work is often adjacent to ours), the developer success team (a team of developers on the front line of customer issues), and our VP of engineering. I watched for questions in Slack and kept tabs on common complaints from across engineering.
In the course of answering these questions internally (because some of them had never been asked), I found that I was working with people who genuinely enjoyed helping their colleagues finish their work and learn something new while they did it. However, in the team’s long-time incarnation as firefighters, this capacity hadn’t been fully explored or understood. In writing our team description, I narrowed down what we want to do, what we don’t want to do, and how we want to help our colleagues. And when others were able to finally understand that, they sought us out more. Win/win! We ended up describing ourselves like this:
Why this team exists
To build, maintain, and troubleshoot systems infrastructure, networking, internal scripts and tooling, data stores, and automation, ensuring smooth and secure operations. We educate and guide other teams to responsibly use all of these resources. We build tools to enable Nylas engineers to complete their work efficiently and effectively.
Done and done. Our section of the doc also explained our areas of responsibility, how to reach us, and how we measured success. It was less than a page long, but it made a tremendous difference in perception and understanding. Now that people had a better idea of what we did, though, they were better able to ask us for things, which made it very clear that we needed a place to put those more complex tasks from other teams. Enter: PM software.
If It’s Not Written Down, It’s Not Real
Planning work on the team had been, up until now, something that evolved through many conversations across time, with inconsistent formats, resulting documentation, and other nice-to-haves. This became harder to work with in our expanded team.
We initially tried using the GitHub project board. It may have evolved since our late-2018 experiment, but the version of it we used didn’t have the granularity we needed, and it was tied too closely to the existence of GitHub issues, while we use Phabricator.
After a couple of unhappy weeks, we moved to Phab’s project workboards. It isn’t a perfect fit, but it’s close enough for us. We can point, we can move items between columns to indicate where they are in our process, and we can tag people and leave updates. We don’t define rigid sprints (because a significant amount of our work will always be reacting to engineer needs), but it’s easy to tell what the next week will be like, barring any emergencies.
Our pointing process, as of this writing, is a little ad hoc, happening weekly (ish) or as needed, shaping the backlog as our priorities shift. In our section of the Nylas engineering team doc, we invite folks to drop tasks into the inbox if needed, but mostly the team creates new tasks based on conversations we have with other Nylas engineers and among ourselves.
How those conversations happen has shifted too, though.
Just @ Us Next Time 👋
When I started at Nylas, our expectations for how people would reach out to us were friendly, well-intended, and generally uncommunicated. It’s kind of like saying “Just let me know how I can help” to someone in an emergency — it’s done with the best of intentions but isn’t actually very helpful. We had a #techops channel, which had a good amount of activity, but it was unfocused and reflected that people weren’t really sure if the channel was for us, for them, or for something else entirely.
In the team description doc I mentioned, we specified a few things:
- We will meet you however you want to reach out to us.
- We prefer public channel discussion to DMs, to better socialize information and get more input.
- A point person is specified in the channel topic, but reaching out to any of us is ok, so long as you’re fine with a wait.
In time, we added an @techopsen alias, so the team could be summoned from other channels too.
More recently, we refined the “point person” idea, as we tended to forget to rotate names until two or more weeks had passed. It wasn’t a bottleneck exactly, but it’s also the opposite of the consistent automation we’re trying to cultivate. Recently, one of my teammates implemented a quiet PagerDuty rotation that names a point person each week, which doesn’t have alerts tied to it (yet) but is linked in the channel topic.
To further open up the lines of communication, we created a second channel: #techops-internal. This is a public channel too, but it’s just for team business, which we’d been putting into DMs in the interest of not cluttering a channel we all intuitively felt was better used for working with other teams. I’ve generally experienced overuse of DMs for routine (read: not sensitive) team business as a warning sign, and the new channel is addressing that nicely. Now other teams don’t have to see our chit chat about what video meeting solution we’re using or what nuance of Terraform we’re dealing with today, but team proceedings are still transparent.
We also needed to address how we talked to each other, it turns out.
Meetings != Conversations 🗣
It’s a pleasure to work with affable people who enjoy discussing subjects both tech and not. It only becomes an issue when a standup, a type of meeting designed to be deliberately short, consistently becomes a broad discussion without an agenda, something from which good decisions can come… but through happenstance rather than good planning.
I’ve dealt with enough disorganized meeting culture that I’ve burned through whatever patience I may have once had for it, and in the last couple of years, I’ve grown accustomed to tightly structured meetings designed to free its captives participants as quickly as possible. This was not that.
Like the other conventions I’ve mentioned, this was a choice that made sense when the team was two people, but made less so at three and completely fell apart with four of us. To remedy this, I proposed that most mellifluous of phrases: meeting facilitation rotation. (I loathed myself a little bit every time I said it, but there’s nothing else for it.) I quote one teammate: “I think I understand a lot of those words individually.” Fair! I’m not generally known for bloviating corporate speak, which this sounds like but isn’t.
What it is is a practice that keeps meetings predictable, levels up everyone’s facilitation skills, and keeps meeting organization from being a single person’s responsibility. Our meetings feel easier, and we’ve taken to all of the roles (facilitator, recordkeeper, and timekeeper) with a surprising amount of enthusiasm.
It’s a relief to my manager not to have to try to perform all three roles, and it’s a relief to me that meetings generally don’t go longer than is actually needed and that their contents are generally communicated ahead of time. And with an ongoing standup notes doc, it’s easy to look back to see what we’ve discussed and if there was a resolution to a matter that we forgot.
Meetings are fast, but decisions aren’t ephemeral — and are available to anyone who wants to duck into our team folder in Dropbox Paper, a nice passive assist to our efforts toward transparency.
A series of all-engineering meetings have also helped us increase transparency across the org at large.
“Ask TechOps” and Ansible School 🎓
On the technical side, part of what we’ve been dealing with technically is identifying and vanquishing silos. It’s a natural side effect of having a one-person team for so long: the team structure literally took the form of a silo, in as much as a silo looks like a single person, standing alone, who has to contain all the things.
But this siloing affected other teams too. I realized in my first couple of months here that some of our tools and processes were near-total mysteries to other engineers, especially ones who had arrived, like me, after Nylas engineering started rapidly expanding in mid-2018.
I started broadly, setting up an Ask TechOps meeting every fourth Friday morning. All available TechOps team members showed up, and so did several members of engineering. We took notes on all questions asked, to ensure the information would persist. We addressed mysteries like:
- What even lives on that server with the weird mountain name?
- How can you tell if an Ansible converge failed?
- What’s Filebeat, and how do we use it?
- How do logs rotate, and what’s a log’s lifecycle?
- Where are things hosted within our infrastructure?
A pleasing mix of “what’s this standard ops tool for” and “why do we do this thing this way” emerged. Every time, I’ve sat in the meeting room at the designated start time, wondering if anyone would show up to our party. And every time, we’ve had great attendance and even better participation.
As we’ve addressed old mysteries and group understanding has increased, the conversation has turned to more day-to-day matters (“what does this error really mean, though”), and I expect the conversation to keep evolving as organizational knowledge increases. It’s a great way to get fast, regular feedback on where common understanding is and on what new mysteries have arisen in the previous month.
Ansible emerged as a common source of questions and confusion, so we added Ansible School meetings, which we expect to run until we’ve turned our notes into comprehensive documentation that makes a regular q&a unnecessary. We’re also working on making our Ansible practice simpler, and hearing people’s recurring problems with it has been incredibly valuable for ensuring our efforts go toward meaningful changes.
The information exchange is valuable, but the desiloing effect has been even more important. All TechOps team members get to contribute something, and I think we all usually learn something new from each other, while we’re filling other engineers in on things. And that’s what we most want to retain as part of our ongoing process.
Keep Talking, Keep Listening
These steps have helped TechOps make more sense to engineering at large, but it’s also helped us make sense to ourselves. And it’s still evolving — I’m still working on a number of fronts, which include things like:
- Listening for cross-team curiosity about new tools, like Terraform, and setting up meetings to talk through it or offering to pair with interested engineers
- Figuring out the ways in which our Ansible setup still isn’t as clear as we’d like and finding how we can simplify it or better document things
- Discussing when we should use Python vs. Bash (I believe there’s a place for both, but others feel differently)
Our processes keep evolving because Nylas keeps evolving. Our team will be evolving soon, too, as it happens.
So what did all this work yield for us?
We went from the perpetual fourth-place choice (out of, yes, four teams) for engineers’ quarterly team selection to a fairly frequent number two. Moving to TechOps is not the same kind of lateral move that going between other teams is because of the tools used and team priorities, so it feels like an extra success to know that we’ve communicated who we are well enough to our fellow engineers that they can see themselves doing the work we do.
Now, our meetings have structure, and our roles in those meetings are defined, so we can stick to discussing technical problems without solving the problem of how we all work together over and over again. This has helped us get closer to our goals of planning further into the future too, which means that our work is more interesting, as we don’t have to bother so much with these subsistence-level problems.
We’re looking to add one more person to our team, and I’d tell our future SRE to expect the chance to learn a lot of interesting things while having the chance to make a big impact across the company. TechOps works with every part of engineering in Nylas and has the gratifying job of making everyone’s lives easier, from on-call to observability to getting the most AWS can offer us. The team is young, and that means there’s still room to make a big difference.
And with shorter standups, there’s more time to do it. ;)
If you’re looking to join a fast-paced team full of technophiles, we’re currently hiring on the TechOps team! Take a peek at our careers page here.
How to Bring Structure to Your Team
- Make meetings matter: have structure, agendas, and roles, so you don’t have to reinvent things every time.
- Know what you’re there to do. Are you there to build, to teach, to scale, or something else? Make sure your team knows and that other teams know too.
- Find what people don’t know and fill them in. Even if you think your team’s function is obvious, someone doesn’t know. Talk to people to ensure you’re doing what you think you’re doing.
- If other teams depend on you, keep an open door. That can be through Slack, online or office presence, or regular q&a meetings. Invite outside perspective and learn from it.