Site Reliability Is Pushing Sendoso Forward

The website is down. The server wouldn’t connect. There has been an unexpected interruption.

There are few things more disruptive to business than the company’s website crashing, causing chaos not only to the company itself, but to all customers relying on its services. Google was the first to realize the need for a specific team to combat these issues, proposing and forming the first site reliability team dedicated solely to infrastructure maintenance and high-level modernisation in 2003.

Nineteen years later, many companies have still not formed such a team. In fact, the adoption of site reliability engineering models (SREs) only grew from 15 percent in 2020 to 20 percent in 2021, according to the DevOps Institute’s 2021 Upskilling Report. So why should more companies consider SRE?

“As the company continues to grow, having a group that is capable of looking at production as a whole to ensure a consistent level of reliability is essential,” explained Luis Davim, a staff site reliability engineer at sending platform Sendoso.

A study by Statista found that the average cost of critical server outages in 2020 was between $301,000 and $400,000 per hour, though 17 percent of respondents said their cost per hour was in excess of $5 million. If time is money, then a downed website is the same as a bounced check.

Site reliability engineering can be the difference between a company’s server running smoothly and efficiently or a financial and reputational setback. We sat down with Davim to talk further about the improvements he’s seen since Sendoso started its site reliability journey, as well as what he looks for in new candidates on the SRE team.

Sendoso team members at Top Golf — Sendoso

Luis Davim

Staff Site Reliability Engineer • Sendoso

What prompted you to create a site reliability team within your engineering organization?

As the company continues to grow, having a group that is capable of looking at production as a whole to ensure a consistent level of reliability is essential. This team also focuses on providing and maintaining common infrastructure and tooling, allowing the other engineering teams to focus on the business logic as much as possible.

What advantages or improvements have you seen since implementing site reliability engineering?

We’re still at the early stages of our SRE journey, but we’ve already seen improvements in consistency on CI/CD approaches across teams and awareness of the importance of observability and performance. We’ve introduced a production review process that helps with a range of areas.

PRODUCTION REVIEW BENEFITS:

Visibility — Ensures all employees are aware of the main production issues and how these affect the bigger picture for customers.
Opening up silos — Allows each team to see how their work affects other teams, which helps with prioritizing and alignment.
Quality — Allows leadership and employees to look at potential areas for improvement and create action items.
Knowledge transfer — Allows teams to see the issues faced by other teams, minimizing similar issues in the future.

How is your site reliability team structured and what are the skills you look for in a good SRE engineer?

Our main goal is to enhance developer experience, making developers more autonomous and enabling self-service by providing tools and libraries that they can use. We’re building a CLI that simplifies interactions with Kubernetes, AWS and other infrastructure.

When we’re hiring an SRE engineer, the first thing we look for is the ability to write code. Experience with any language like Python or Ruby is fine, but we use Golang at Sendoso. We also use Kubernetes and Helm a lot, so it’s important that candidates have some exposure to that and are familiar with at least the basic concepts. Next, we’re strong believers in infrastructure as code, so knowledge of Terraform is a plus. Knowledge of security and networking principles is also beneficial.

PRODUCTION REVIEW BENEFITS:

Recent Articles