Why DataOps Is the Solution for Smarter Machine Learning
Many of us like to binge watch, but if the speed and features of our streaming service don’t keep us engaged, we might not renew our subscription. To address this issue in 2015, Netflix turned to machine learning and released its first global recommendation engine, which used dozens of algorithms to cross-compare data and find similar users in more than 190 countries.
In a quarterly journal for the Association of Computing Machinery, Netflix’s CPO, Neil Hunt, explained that the most recent foray into machine learning would save them $1 billion per year. The reason: the speed and accuracy of their recommendation engine would sustain user interest and significantly reduce churn.
Machine learning doesn’t always go smoothly, however, and the key issue involves exactly what made Netflix so confident: accuracy. While machine learning can spot correlations and process enormous amounts of data, sometimes the algorithms become obsolete and thus the data becomes inaccurate, and other times there are inherent human biases in the training data, including — as reported by ProPublica — incidents of racial bias.
On top of this, Medium reports that data science teams are incurring enormous technical debt by using machine learning systems without the tools and processes to monitor and maintain them.
Enter DataOps. Combining DevOps, statistical process controls and Agile software development, DataOps is an automated, process-oriented methodology that improves the velocity and accuracy of data analytics and supports enhanced machine learning efforts. To learn more, we sat down with Skylar Payne of Mindstrong to discuss exactly how and why his team uses DataOps to support machine learning and improve the data being produced.
Skylar Payne is a senior software engineer at Mindstrong, a mental healthcare company that offers a virtual care platform to people with mental illness. Working with a subject as sensitive as mental wellbeing means reliability is paramount, and Payne said at a certain point, he recognized that long lead times and contention over data were causing significant holdups.
What first prompted your team to adopt a DataOps strategy for machine learning?
We noticed that we had long lead times — sometimes around three weeks — to get access to new data sources for analysis. We also often found surprising issues within our data and had serious contention between producers and consumers of data. It’s difficult to do great data analytics or data science in an environment like that.
What was the biggest challenge you faced in implementing DataOps, and how did you overcome it?
Implementing a DataOps model is challenging because it necessarily involves both processes and technology, and there are so many options. To ensure that we evaluated solutions that focused on the right problems, we first conducted a survey across all data-centric roles. We then evaluated several tools against survey findings, such as Databricks for compute and storage and Bigeye for data monitoring.
By focusing on DataOps, we have made our data more reliable. ”
What’s the biggest efficiency or benefit your team has seen as a result of adopting DataOps?
By focusing on DataOps, we have made our data more reliable. We aren’t as concerned that there are issues with the data, because it’s monitored. We aren’t as concerned that it is stale because we optimized data producers. DataOps is never finished — we still have more we can improve — but we are already seeing value from our investments.