The Technologies These Bay Area Data Science Teams Rely On

Instacart became an essential service during the pandemic, which led to hundreds of new retailers partnering with the platform, millions of additional customers and the company’s first profitable month since launching in 2012.

Massive spikes in data followed.

“Many of our systems had grown to hit their limits last year, and our data pipelines were no exception,” said Senior Data Engineer Zack Wu.

Since joining Instacart in 2018, Wu had been using Snowflake for scalability. But as he pointed out, those cloud-computing data warehouses work best when pipelines are built efficiently, and in Instacart’s case, that wasn’t entirely true.

“There were multiple legacy pipelines that were built on the premise of small datasets that were OK to reprocess the entire historical dataset every run,” Wu said. “We had to rethink how we ran pipelines incrementally in these cases in order to really scale with the business.”

New tools, an organizational shift and the initial stages of a data mesh model have since been implemented to help Instacart keep pace with its swelling business.

Like Instacart, data engineering teams from Coursera, Mindstrong and Opendoor have also had to make similar adjustments. Here’s how all four companies are staying atop their rising volumes of data.

Sherry Wang

Data Science Manager • Opendoor

Data tech stack at Opendoor:

Snowflake, Apache Airflow, Segment and Fivetran

What technologies or tools are you currently using to build Opendoor’s data pipeline, and why did you choose them?

We use a suite of commercial and open-source tools to build our data pipelines. Two primary technologies we rely on are Snowflake and Airflow. We use Snowflake for data warehousing and extraction, transformation and loading processes. The cloud nature and flexible architecture of Snowflake make it easy to integrate with our existing cloud infrastructure on Amazon Web Services (AWS) and allow for scalability when we need to optimize for query performance. Its standardized and intuitive SQL patterns are user-friendly to data scientists and analysts. In addition, the integration with an ecosystem of other technologies, such as Segment and Fivetran for data ingestion, has allowed us to easily expand to new use cases.

We leverage Airflow for data pipeline management. Our engineers built customized internal tools and libraries with Airflow that allow data scientists to create, modify and deploy data pipelines and jobs by simply writing an SQL script.

In order to ensure data pipelines scale, we take a holistic approach with technology, process and culture.”

As Opendoor — and thus, your volume of data — grows, what steps are you taking to ensure your data pipeline continues to scale with the business?

This year, we have already launched in 18 new cities. As we expand our offerings of services and products, challenges arise in managing both the volume and complexity of our data. In order to ensure data pipelines scale, we take a holistic approach with technology, process and culture. We constantly try to understand the evolution of needs and evaluate tools and technology we need to support new capabilities. Our data platform teams follow new developments from the industry to assess and decide the right options for us.

We’re increasingly committed to improving internal processes and documentation so that, as teams grow, we rely less on tribal knowledge. Well-defined documents and guides help more data users and pipeline contributors to get started on new topics seamlessly. Additionally, getting data right is not just the responsibility of data scientists and engineers. We create an insight-driven culture that invites everyone across different functions and roles to leverage data. This incentivizes employees to prioritize capturing new data in useful ways.

Opendoor is a residential real estate platform that makes the entire process of buying and selling property possible from an app.

Xinying Yu

Sr. Manager, Data Science and Machine Learning • Coursera

Powering the Pipeline at Coursera:

Amazon products such as Simple Storage Service (S3), Redshift and SageMaker, as well as Databricks Delta Lake and Apache Airflow

What technologies or tools are you currently using to build Coursera’s data pipeline?

We use S3 for unstructured and unprocessed data, Databricks Delta Lake for structured and standardized data, and Redshift for highly transformed and aggregated data that powers our analytics. We also use Apache Airflow for pipeline orchestrations and SageMaker to develop machine learning models. These technologies enable us to leverage state-of-the-art tools and work more efficiently; infrastructure engineers don’t have to spend cycles building features that are already available in the market. They’ve also improved our data product launch efficiency and scalability.

We’ve enabled new data scientists and engineers to start making an impact on day one.”

As Coursera — and your volume of data — grows, what steps are you taking to ensure your pipeline continues to scale with the business?

We’ve enabled new data scientists and engineers to start making an impact on day one by migrating from homegrown tools to industry-standard software that enhances the data and machine learning stacks. Our design with S3, Databricks, Redshift and Airflow is able to support our data pipeline at a significant scale. Meanwhile, SageMaker gives data scientists the ability to build, train and deploy machine learning models quickly for offline, nearline and online use cases.

In addition, we continuously evaluate the stability and scalability of our data infrastructure. On top of the technologies we’ve migrated to, our team is enhancing end-to-end development and deployment workflows to facilitate the scale of real-time machine learning applications.

Coursera’s edtech platform hosts open online courses, specializations and degrees.

Zack Wu

Senior Data Engineer • Instacart

Tools Backing Instacart's Data Growth:

Snowflake, PostgreSQL, S3 and in-house tools

What technologies or tools are you currently using to build Instacart’s data pipeline?

Even though it’s good to have a level of abstraction between the data sources and the data pipeline that runs downstream, it’s important to understand the many types of data that feed into our pipeline. That includes PostgreSQL, Snowplow, events, data sharing and integrations. Our actual data warehouse lives in Snowflake, which has enabled us to scale out pipelines with a simple click of a button. To coordinate our data pipeline, we leverage Airflow 1.x as well as an in-house scheduler system that utilizes PostgreSQL databases and are actively looking to migrate to a new system for better long-term scalability.

For our data pipeline, we run an in-house tool that’s similar to dbt written in Python, which allows our users to write pipelines more as software as opposed to straight SQL. We wanted to approach data pipelines in a similar way to how software engineers approach code, with reuse and abstraction being key components. Prior to this, data pipelines consisted of various SQL blocks stitched together with rudimentary dependencies in a manner that was not very reproducible.

We wanted to approach data pipelines in a similar way to how software engineers approach code.”

As Instcart — and your volume of data — grows, what steps are you taking to ensure your data pipeline continues to scale with the business?

We’ve had to rethink our approach to data as we scale. We moved our data team to an infrastructure-oriented model, whereas before we had a member of our data team embedded into each business function. We built out self-service tools and platforms that allow any number of people to create their own data pipelines across the company.

Metrics, tables and dashboards all grew in a similar pattern with data, leading to inconsistent definitions and unnecessary duplication, all while making ownership and lineage tough to track. This led us to buy into the concept of a “data mesh,” which serves to decentralize and distribute the creation of data products across business domains while enforcing centralized governance and processes. We’re in the process of moving toward this model over the next few years, which will mean that any given business domain can work cross-functionally to create and operate a data product in a framework that inherently provides a standardized model with best practices.

Instacart operates grocery delivery and pick-up services for more than 600 retailers.

Jon Knights

Director, Data Science • Mindstrong

Mindstrong's Data Pipeline:

Databricks, Delta Lake, Airflow, MongoDB and Apache Spark

What technologies does Mindstrong use to build a data pipeline?

We primarily leverage Databricks and Delta Lake on AWS integrated with a few in-house data stores like Airflow and MongoDB. For our team, some differentiators that come up are the ease of integration and sharing notebooks with Spark, which has provided streamlined ingestion and processing capabilities for our data. Other features that have engaged our group are the ease with which recurring jobs can be set up and monitored, configurable and shareable cluster setting, and easy debugging features that come with accessibility to local log files. Further, the ability to leverage multiple programming languages and the option to plug in RStudio Server offers flexibility that is attractive in the long term. Being in healthcare, the functionality of Delta Lake enabling ACID transactions and easing the ability to comply with the California Consumer Privacy Act and General Data Protection Regulation rules down the road is even more attractive.

The combination of Delta Lake with Apache Spark — coupled with AWS resources — has positioned the team nicely to continue to scale modeling efforts for the midterm.”

As Mindstrong — and its volume of data — grows, what steps are you taking to ensure your data pipeline continues to scale with the business?

Our migration to Databricks happened fairly recently. We believe that the combination of Delta Lake with Apache Spark — coupled with AWS resources — has positioned the team nicely to continue to scale modeling efforts for the midterm. However, we continue to keep an eye out for new best practices and solutions. We partner closely with our data engineering team and external collaborators to assess where our data ingestion and utilization falls with the rest of the industry, other industries and what larger groups are using to solve their issues.

Mindstrong provides care for people who live with seriously mental conditions.

Data tech stack at Opendoor:

Powering the Pipeline at Coursera:

Tools Backing Instacart's Data Growth:

Mindstrong's Data Pipeline:

Recent Articles