If engineers could predict the future, what would they want to see? The outcome of the presidential election? When flying cars will be in space?
More likely, they’d seek a timeline of future customer usage surges so they can properly prepare to scale.
For companies like e-commerce platform BigCommerce and Sovrn, a company that provides advertising tools and services to content creators, calculating when and how to autoscale helps engineers plan product releases and schedule process changes. Other companies, like delivery service Postmates and social platform Discord, saw an unplanned increase in users due to shelter-in-place orders from COVID-19. Open-source technology, new practices to address key issues, and a few late nights in the office ensured customers still experience quality on the respective platforms.
Preparing for scalability is at the forefront of the following nine engineers’ minds. An upgrade on servers or processes can be a drain on resources and time, but an unprepared surge could cause latency issues for customers and cause them to not return. A repeated tip? Embrace simplicity. Scalability means being able for the company to maximize user growth with the least amount of engineering changes.
Infrastructure Architect Jerry Wong said scalability is implemented in all engineering processes at anime and manga digital media company Crunchyroll. When analyzing architectural design, the company asks if the tech can support the current scale — and also if it can support 10 times the users on the same scale.
In your own words, describe what scalability means to you. Why is scalability important for the technology you’re building?
Scalability means that all components of a product — including architecture, infrastructure, as well as the underlying services — are able to handle the current requests of customers and gracefully meet future demands. Scalability is a very important cornerstone to our technology and platform because it serves our fans at a global scale. With over 2 million paying subscribers, our fans expect the best quality, stability and functionalities of our video on demand platform.
How do you build this tech with scalability in mind? What are some specific strategies or engineering practices you follow?
The concept of scalability that is built into our engineering processes encourages all engineers to design and think with scalability in mind for the microservices that they are instantiating. For every product concept, there is always an architectural review of the product concept that is immediately followed by technical discovery. We always ask two questions when analyzing an architectural design: “Can this design support the current Crunchyroll scale?” and “Can this design support 10 times the users of the current Crunchyroll scale?” This is a simple yet effective question that opens up the scalability problem and encourages engineers to think about their design solutions in a very broad way.
These questions are asked whether we’re building a new continuous delivery pipeline, a new machine learning service that will be able to serve better recommendations to users or even prospecting a build versus buy solution.
Moving past architectural design, we always iterate to perfection. Nobody expects that the initial design of anything will also be perfect from day one. There are supporting functions in our company that encourage effective iterations such as the multitude of QA/QE load tests, performance tests, etc. In essence, after a technology has gone out to production, the iteration of that technology is our key to success in terms of scalability.
Our engineering processes encourages all engineers to design and think with scalability in mind for the microservices that they are instantiating.”
What tools or technologies does your team use to support scalability, and why?
AWS is our cloud provider to build scalable infrastructure. The large abundance of services that AWS provides enables us to easily create, build and deploy services that contain core functionalities of our VOD platform on a global scale.
Observability is the next key concept that we embrace to effectively measure how well the service is performing and identify key offending components in which we can then iterate and fix. New Relic is one of the major observability tools that we use to instrument and support scalability through observability. Content delivery networks (CDNs) are another major technology that we use to better serve our fans. CDNs allow us the ability to cache video assets closer to the edge locations of our customers to give them a better and more performant experience by caching popular assets.
Vice President of Infrastructure Dustin Pearce said designing a system with limits helped control scalability at online grocery delivery service Instacart. Circuit breakers and controls that limit data access help small tweaks from customer behavior or code changes turn into tidal wave-sized problems.
In your own words, describe what scalability means to you. Why is scalability important for the technology you’re building?
Scale exaggerates imperfections —this is why simplicity scales. Each part of your system needs to be very well defined and understood in order to scale. As you scale a system, you introduce new complexity in the form of new failure modes. More computers and more connections equal more opportunities for something to go wrong. Reasoning those failure modes and building resilience into your system is important. It is equally important to make sure you are learning from system outages and pouring that learning back into the system in the form of additional resilience.
Scale exaggerates imperfections —this is why simplicity scales.”
How do you build this tech with scalability in mind? What are some specific strategies or engineering practices you follow?
When approaching scale, one of the most important aspects is designing systems with limits. Writing and reading data from a database is often where scale issues are most acute. Since small changes in the behavior of your users or your code can explode into a tidal wave at scale, you need to design circuit breakers and controls that limit data access. In the earliest stages of scaling, this means that developers have to get used to doing work in batches. The most common mechanism used is query interfaces since they require paging and limit how much data can be returned from any given query.
What tools or technologies does your team use to support scalability, and why?
Web servers and other stateless services use immutable infrastructure and auto-scaling. This allows us to rapidly expand capacity as needed. When working with cloud infrastructure, we don’t troubleshoot or debug a single node. If there is a single node misbehaving, it just gets replaced. This is the “cattle, not pets” mentality made popular by Netflix. On the database front, we keep things simple with typical RDBMS systems managed by our cloud provider. We use our application to limit access to these databases and keep their size and workload manageable by spreading the load across several databases that hold parts of the data. This is a process called horizontal sharding.
At fintech company TrueAccord, engineers are given autonomy on how they want to design their services. Director of Engineering Jeffrey Ling said they keep teams cohesive by having experienced engineers act as architects and give advice.
In your own words, describe what scalability means to you. Why is scalability important for the technology you’re building?
Scaling software is more than just having servers work through high load. It’s also about being able to enable our team to build features quickly and safely. Enabling that takes thought on the organization as well as the tech.
Scaling is important to us because we work with millions of active accounts under strict federal and state compliance rules. If our system lags the wrong way, we are liable for any mistakes the system makes. Worse, our consumers would lose trust in us and our ability to help them out of their debts. We’re constantly working on ways to allow us to continue to build new features and meet consumer needs while dealing with these constraints.
How do you build this tech with scalability in mind?
Each engineer has full autonomy to how they want to design their services, though we do provide a recommended toolset with best practices as guidance. We keep things cohesive by having experienced engineers acting as architects to find points of reusability and give advice on potential adverse effects.
As we build our tech, we try to keep a very strong separation of concerns throughout the system. We align the service boundaries with our teams in mind, working through scenarios with the purpose of reducing the amount of cross-team work needed for each scenario.
Scaling software is more than just having servers work through high load.”
What tools or technologies does your team use to support scalability, and why?
For our back end, we use Go, Scala and Java, as well as node where appropriate. Go is great for light microservices like lambda functions, while Scala and Java are great for complex business logic that need a more expressive language.
We use a whole smorgasbord of AWS products to host our services and train our machine learning models. We scale servers with Kubernetes and scale our warehouses with Snowflake.
For our team, we align on architecture using Miro, which deserves a special shoutout for being a great collaborative whiteboard and diagramming tool. We have a live version of our architecture that anyone can comment and ask questions on. This has allowed teams to work on architecture asynchronously as we embrace remote work.
During the onset of stay-at-home orders, delivery service Postmates hit unexpected growth. Manager of Engineering Sanket Agarwal said implemented code review and design processes combined with hiring the right talent helped meet customer demand while keeping essential employees safe.
In your own words, describe what scalability means to you. Why is scalability important for the technology you’re building?
Scalability is providing a high-quality service to our ever-growing base of customers. We need to scale our people organization, our culture and make our systems robust. We also take a larger social responsibility as we scale especially when we are providing essential services and employment during a pandemic.
Postmates is at a point where scale is a way of life. Any line of code that we add affects millions of users. This brings challenges and opportunities. On one hand, we need to have process and automation to avoid bugs, but on the other we can harness the power of our user base to build an experience that is magical. For instance, Postmates can recommend food that you may like. You can only have such experience with the power of data and scale.
Scalability is providing a high-quality service to our ever-growing base of customers.”
How do you build this tech with scalability in mind?
Robust engineering practices and a culture of customer obsession are key to building products with love, care and scalability in mind. We often care about shaving off a few milliseconds to make the user experience better.
We care about our code review process, design process and train people to be self-sufficient. We also invest in a postmortem process where we identify and address key issues with our systems.
Also hiring the right set of individuals is key. With our brand, we’ve been able to attract some of the best talent the industry has to offer. They bring the experience of having built these systems and give us foresight into best practices.
During the pandemic we had a rush of people trying to stay indoors and order food, and you have to react to a surge in demand. We couldn’t have prepared for that. But our team huddled during weekends to keep the lights on. The ability to quickly react and adapt is also key to scalability especially in a hyper-competitive environment.
What tools or technologies does your team use to support scalability, and why?
At Postmates, we rely on battle-hardened technologies. We’ve built a host of custom technology, which doesn’t exist for our use case or scale. But we also heavily rely on open source to avoid reinventing the wheel.
We have built an in-house mapping and planning system that can match couriers to deliveries at scale. This helps optimize delivery times while balancing batching. We also have a highly scalable data infrastructure on top of BigQuery, that, along with machine learning, enables recommendation engines like our feed or search.
Most of our applications are built on a service-oriented architecture. We use open source technologies like Kubernetes and Docker to host our services.
Senior Software Engineer Kirill Golodnov knew to expect a growth surge when editing software Grammarly became available on Google Docs. Replacing programming languages, retiring outdated AWS instances and adding a caching layer prepared them for traffic spikes.
In your own words, describe what scalability means to you. Why is scalability important for the technology you’re building?
Grammarly’s digital writing assistant is used by millions of people every day across web browsers and devices and we’re always expanding availability. To provide this support, we deliver sophisticated writing suggestions to millions of devices in real-time around the world, which means we process millions of simultaneous server connections, requiring hundreds of servers. As Grammarly continues to accelerate growth, finding solutions to these challenges is key to maintaining a seamless experience for our users.
We believe it’s important to take a nuanced approach to scalability and not rely on adding more servers without thinking more critically about our pipelines.”
How do you build this tech with scalability in mind?
We believe it’s important to take a nuanced approach to scalability and not rely on adding more servers without thinking more critically about our pipelines. Horizontal scalability cannot solve all problems because it creates new ones; specifically, the need to run and manage thousands of servers. We use a number of tactics to reduce this need by an order of magnitude. We optimize our algorithms and neural networks. We also strive to identify programming languages and runtimes that will lessen our server load.
We historically used Common Lisp in our suggestion engine but it was not optimized well for the high-load processes required by our product. So we recently replaced it with Clojure, which is a dialect of Lisp that runs on JVM. As a result, we were able to retire many of our AWS instances. We’ve also found that sometimes adding a caching layer helps. When we were preparing to support Grammarly in Google Docs, we were anticipating big spikes in traffic. Caching worked great for us in that scenario.
What tools or technologies does your team use to support scalability, and why?
Our team runs back ends on AWS. For computations, we run Docker containers on AWS elastic container service with auto-scaling, which is pretty typical these days. For us, the more complex challenge is scaling storage and data processing. Grammarly users who write documents in the Grammarly editor need to be able to trust that they can safely store them there, so we’ve needed to find creative solutions for making sure this storage is distributed and resistant to failure. Ultimately, we came up with a custom architecture based on AWS DynamoDB and S3, coordinated by Apache ZooKeeper, to ensure consistency of stored documents. For big data workflows, our team uses AWS-managed Kafka as well as Apache Spark on AWS EMR, and also AWS Athena and Redshift.
Senior Software Engineer Daisy Zhou saw an uptick in users at social platform Discord due to COVID-19. Outside of the pandemic, Discord engineers try to autoscale in order to act proactively to user growth, rather than react to customer traffic.
In your own words, describe what scalability means to you. Why is scalability important for the technology you're building?
At Discord, we are building a welcoming platform where everyone can talk, hang out with their friends and build their communities. With so many people moving big parts of their life online recently, we have seen a huge, unexpected increase in active users of more than 50 percent since the previous year. Discord has always been known as a fast, reliable product. Scalability is the engineering work necessary to maintain the quality that users expect even in the face of such unexpected growth.
How do you build this tech with scalability in mind?
We run stateless services that can be easily scaled up or down whenever possible. Configuring how and when to autoscale ahead of time is much easier than having to manually rescale when traffic changes drastically. It also simplifies many other aspects of maintaining a service, so it is almost always worth setting services up this way from the beginning when possible.
For most other scaling optimizations, finding the balance between not over-engineering early and being ready in time to support more load is the trickiest part. It’s hard to pinpoint, but the ideal time to start working on scalability is after the problems we will encounter are clear but before they overwhelm us. We use a combination of resource utilization metrics and performance metrics to measure our system performance as well as the user’s experience of our system. Some clear indicators that a service needs some love are multiple shards running hotter, slightly degraded performance at peak traffic times or degraded performance for individual power users and groups.
Our specific strategies depend on the specific problem. In the last few months we’ve worked on two significant scaling projects in the chat infrastructure of Discord that use pretty common strategies: request coalescing and horizontal scalability. We recently built a service that stands in front of our messages database that now coalesces requests and will allow us to easily add other optimizations in the future. Another is a re-architecture of our guilds service, which as our biggest communities have grown, started struggling to handle 100,000 connected sessions per Elixir process. We horizontally scaled the Elixir processes so that we can scale as the number of connected sessions increases.
We run stateless services that can be easily scaled up or down whenever possible.”
What tools or technologies does your team use to support scalability, and why?
A lot of Discord runs in Google Cloud Platform (GCP), and whenever possible we run stateless services as GCP autoscaling instance groups.
Many of our stateful services that maintain connections to clients, and process and distribute messages in real-time, are written in Elixir, which is well suited to the real-time message passing workload. But in recent years we have run up against some scaling limitations. Discord has written and open-sourced several libraries that ease some pains we’ve encountered with managing a large, distributed Elixir system with hundreds of servers. These include ZenMonitor for coalescing down monitoring, Manifold to batch message passing across nodes, and Semaphore, which is helpful for throttling as services get close to their limits.
Culturally, we don’t shy away from trying out new technologies when our current ones are not cutting it. ScyllaDB and Rust are two examples where our explorations have paid off and they have been good solutions to some of our problems.
At edtech company Udemy, Vice President of Engineering Vlad Berkovsky and Senior Director of Engineering Cathleen Wang said that scalability is about building solutions that focus on customers’ problems without adding complexities. To do that, they adopted a hybrid cloud infrastructure and run the website from a private and public cloud.
In your own words, describe what scalability means to you. Why is scalability important for the technology you're building?
“Scalability means building high-quality products and services that solve business problems,” Wang said. “This is important because as our business grows, the needs of features and capabilities naturally become more complex. Designing and building solutions that focus on key customer problems without introducing unnecessary complexity or less desired user experience is critical to delivering high-quality products and services that can enable and support business growth. Solutions we build should be both performant and flexible to address business needs.”
We designed our infrastructure with scalability in mind.”
How do you build this tech with scalability in mind?
“Decision on the architecture of systems is typically specific to the application,” Berkovsky said. “Specific issues can arise around the site response times, the volume of data reads and writes, etc. Generally speaking, there are three options for technology scaling: horizontal duplication, splitting the system by function, and splitting the system into individual chunks.
“At Udemy, we designed our infrastructure with scalability in mind. If the infrastructure doesn’t scale, the application scalability won’t save the day. This is why we adopted a hybrid cloud architecture: running the site from both the private and public clouds. Private cloud provides us with predictable performance and cost, and public cloud ensures a virtually unlimited capacity for scaling out as needed.
“On the people side, we are a DevOps company. Our engineers write and then own their code in production. This continued ownership increases quality and accountability. The goal of the site operations teams is to enable developers to maintain this ownership.”
What tools or technologies does your team use to support scalability, and why?
“Once the capacity question is solved and the capacity is secured, the next big questions are around the application scalability,” Berkovsky said. “It is very difficult to predict which parts of the system will become bottlenecks as the site load grows. This is why we are making use of ‘game days’ when we stress test our site or site components to identify the bottlenecks under specific load profiles.
“Automation is essential to help scale the site operations teams. Automation helps to move the focus from performing repeatable tasks manually to automating these and focusing on more important strategic work. At Udemy, we adopted the infrastructure-as-code approach to managing infrastructure. We don’t allow direct manual changes to the infrastructure; any changes need to be implemented as code to ensure predictable and repeatable results.
“We make use of such popular configuration and orchestration tools like Ansible and Terraform. Because these tools are vendor agnostic, they work equally well in public and private cloud implementations. This allows us to use the same toolkit across multiple cloud platforms reducing our efforts to manage these platforms.”
Sandeep Ganapatiraju, a lead software development engineer, said the goal of scalability at the e-commerce platform BigCommerce is to increase customer usage with the least amount of technical changes. Multiple simulations of customer types and usage help them plan out when to release new business features.
In your own words, describe what scalability means to you. Why is scalability important for the technology you’re building?
To me, it’s more about the ability of the platform to adapt to increasing customer usage with the least amount of changes needed.
This means having a clear strategy by simulating various customer usage scenarios upfront based on how the business sees usage likely to grow over the next few months, including product releases from alpha, beta, and becoming generally available.
We can have a clear plan of scalability when we calculate over the next “x” months where we expect customer usage to grow by “y” while the system is able to tolerate “y + z” load for a short period of time.
Scalability is more about the ability of the platform to adapt to increasing customer usage with the least amount of changes needed.”
How do you build this technology with scalability in mind? Share some specific strategies or engineering practices you follow.
We do multiple simulations. We create examples of the largest user creating even larger usage, an average user operating at average usage and also, a sudden short burst of usage spike such as a sale.
Each of these scenarios is specifically thought about while keeping business in mind. Then, we keep the business informed to help a planned rollout of new features.
What tools or technologies does your team use to support scalability, and why?
BigCommerce uses JMeter and queries to explain plans for quick checks. Then we use BlazeMeter once we have a solid test laid out to test overall system performance.
We do a lot of monitoring of production using Grafana dashboards to see how various services are performing in production. We use also do canary deployments to release a new version to a specific software subset of users to see any unprecedented bugs or scales. New Relic alerts if requests get backed up and time out by more than “x” percent in a given time. We also monitor requests that take a long time with distributed tracing tools to see which specific microservice is slowing the overall request.
At Sovrn, scalability is crucial since their network handles tens of thousands of API requests per second. Engineering Manager Theo Chu said they must anticipate future demand when building in order to prevent latency and keep customers happy.
In your own words, describe what scalability means to you. Why is scalability important for the technology you’re building?
Scalability means building our technology and products from the ground up with future scale in mind. It means anticipating future demand and being able to meet that demand without having to re-engineer or overhaul our system. Scalability is especially crucial for us since we handle tens of thousands of API requests per second. Our data pipelines process and analyze billions of events per day. We support traffic from major publishers on the web and our partners expect us to handle their scale on a daily basis without sacrificing latency or reliability.
Scalability means building our technology and products from the ground up with future scale in mind.”
How do you build this tech with scalability in mind?
Building for scalability means designing, building and maintaining engineering systems with a deep technical understanding of the technologies that we use and the performance constraints of our systems. Our approach has been to build for scalability from the bottom up through both technical and product perspectives. On the technical front, we rely on underlying technologies and frameworks that enable scale.
On the product front, we find that building foundational components early on for anticipated scale trades off much better than having to re-architect the system later on. This was exemplified with our Optimize product, where we have been able to scale effortlessly after designing our database to handle hundreds of millions of mappings in earlier iterations.
What tools or technologies does your team use to support scalability, and why?
We rely on a range of technologies that support and process data at scale, including Cassandra, Kafka and Spark. Since these are the foundational blocks of our system, we optimize them heavily and load test each component up to multiples of our current scale to enable scalability and address any bottlenecks. Since our infrastructure is fully on AWS, we also utilize AWS tools that support scalability such as autoscaling groups.