Technologists at 4 SF Companies Explain Their Data Science Methodologies

Before a data science undertaking truly gets off the ground, the minds behind it need to have a clear idea of where it’s going.

According to Pai Liu, who leads data science at e-commerce destination Wish, “starting with the end result” is a foundational move in the elementary stages.

“Impact is what data scientists are shooting for; we achieve it by helping businesses make right decisions through good use of data,” Liu said.

Similarly, having eyes on the “why” is key to TrueAccord’s Aviv Peretz, who described his team’s data science practice as “closest to CRISP-DM,” or cross-industry-standard-process for data mining.

“When we’re conceptualizing a new model, the first step is always understanding its objective from a business perspective,” the principal data scientist said.

For Jonathan Rioux, of design and product consulting organization EPAM Systems, once he and his colleagues define what they’re aiming to address, they can embark on a data science technique he likened to a “series of experiments” that fit into the broader strategic framework.

“The results of each experiment should feed into the greater goal, and learning is always the main objective,” the analytics consultant said.

And while a certain methodology can be an important mechanism for steadying data science efforts, Atlassian’s Head of Data Science Mark Scarr — who described his team’s approach as influenced by CRISP-DM and other methodologies — said that it’s necessary for it to remain malleable.

“In essence, a methodology is only a guide; it should be flexible enough to be tailored to a specific situation as needed,” Scarr said.

Mark Scarr

Head of Data Science • Atlassian

Tell us a bit about the data science methodology your team follows. What are the core tenets or steps of this methodology?

Data science is very much a team sport, and we adopt a collaborative engagement model. We start out by ensuring that any AI projects are firmly anchored within a business context, which is critical for the overall success of any data science team. Then, the key high-level ML lifecycle steps: business context, data prep, build model, validate, deploy, monitor, and refine. This is an iterative process. We ensure a close partnership with the appropriate teams at each stage, ultimately creating a “flywheel effect” to optimize and compress the ML model R&D deployment cycle.

Data science is very much a team sport, and we adopt a collaborative engagement model.”

What steps did you take to develop this methodology? Was it inspired or influenced by another methodology?

The current methodology has evolved over many years through study and experimentation. It is inspired by multiple sources, including: PDSA (plan-do-study-act, or the “Deming” or “Shewhart” Cycle); DMAIC (define-measure-analyze-improve-control) cycle; and CRISP-DM. It is essentially a hybrid of these different approaches tailored to the specific environment and use case.

How have you evolved this methodology over time to ensure it suits your team’s needs? How did you know it needed refining?

As data science here has grown, the approach and process has evolved and become more formalized. Infrastructure development around core ML services and tooling has also allowed us to adjust and scale accordingly. The key principle around this, or any methodology, is flexibility — the ability to adapt and pivot as required in an ever-changing environment. In essence, a methodology is only a guide; it should be flexible enough to be tailored to a specific situation as needed.

Atlassian is Hiring | View 262 Jobs

Pai Liu

Head of Data Science • Wish

Tell us a bit about the data science methodology your team follows. What are the core tenets or steps of this methodology?

The data science approach involves three main steps of generating: data and information; knowledge and insights; and wisdom and influence. The first step involves developing a strong data foundation. The second step involves generating actionable insights from data and validating through experimentation. The last step involves connecting dots and influencing decisions.

We achieve impact by helping businesses make right decisions through good use of data.”

What steps did you take to develop this methodology? Was it inspired or influenced by another methodology?

Developing the approach involves two steps. One, starting with the end result or impact, which is what data scientists are shooting for. We achieve impact by helping businesses make the right decisions through good use of data. Two, consider the life cycle of data — from generation to being processed to generating insight — and improve the quality and efficiency in this step.

How have you evolved this methodology over time to ensure it suits your team’s needs? How did you know it needed refining?

On one hand, technology is always evolving, and we leverage the state-of-the-art technology and develop innovative solutions on our own to improve our efficiency and scalability of making impact. On the other hand, the general flow and life cycle of data doesn’t change much. The impact-driven culture for the data science team probably won’t change in the foreseeable future.

Aviv Peretz

Principal Data Scientist • TrueML

Tell us a bit about the data science methodology your team follows. What are the core tenets or steps of this methodology?

My core tenet of data science methodology is “get the data right.” This may sound simplistic, but it is anything but. A lot has been said about opting for more powerful learning algorithms, as well as avoiding pitfalls such as overfitting. All of that is true. However, having spent many years in the field, I can attest that when a model is not performing as expected, it’s almost always because there are either discrepancies in the data set or the data is modeled incorrectly.

Therefore, “get the data right” essentially means knowing your data set inside and out; paying close attention to data quality and preparation issues; and picking the representation of a data point wisely. In my experience, when these things are taken care of, everything else tends to fall into place. Nevertheless, this work is delicate and necessitates exceptional attention to detail — “The Princess and the Pea”-level attention to detail.

My core tenet of data science methodology is ‘get the data right.’ This may sound simplistic, but it is anything but.”

What steps did you take to develop this methodology? Was it inspired or influenced by another methodology?

I would say that our data science methodology is closest to CRISP-DM. When we’re conceptualizing a new model, the first step is always understanding its objective from a business perspective. To that end, we make it a habit to closely work with the product team, as well as the team responsible for “Heartbeat,” our proprietary decisioning engine. Oftentimes, the business understanding step also includes some brainstorming around the treatments of a model’s future experiment. Next comes the data understanding and data preparation steps, whereby we map out all the data sources and make sure that all required pieces of data are available. Then comes the data modeling phase.

How have you evolved this methodology over time to ensure it suits your team’s needs? How did you know it needed refining?

Not only have we evolved our methodology over time, we’ve also streamlined it. When I joined TrueAccord, I was the only data scientist; I had to literally write from scratch the backbone of the data science codebase. Looking back, I would say that my methodology was formulating as I was going along and laying the data science foundation.

As the team was growing and we were rolling out more models, we started taking advantage of code reuse. As a result, our codebase has become more modular, which in turn helped us streamline model development.

Moreover, tackling various bugs has triggered us to enrich our codebase with debugging modules geared toward validating the underlying data of our models. Once again, this evolution happened naturally, as one thing was leading to another.

Finally, as our models become more intricate, they require a more nuanced grasp of the product. Therefore, gaining a profound understanding of the product has been a significant refinement of our methodology that has helped us up our game.

TrueML is Hiring | View 15 Jobs

Jonathan Rioux

Consultant, Enterprise Analytics • EPAM Systems

Tell us a bit about the data science methodology your team follows. What are the core tenets or steps of this methodology?

We approach data science in a very practical way, since we often develop both the data science model(s) and the platform. Because of this, we’ve adopted approaches from our engineering excellence group and our agile competency center for our insights-driven projects.

The first step of applied data science is finding the right problem to solve and understanding the problem well enough to be impactful. Then, the team usually frames the model development in a series of experiments. I tend to prefer this over the traditional sprint since we are closer to the “hypothesis–validation–conclusion” cycle seen in experimental sciences. We set a goal, build a model(s) to support our experiment and draw conclusions. The results of each experiment should feed into the greater goal, and learning is always the main objective.

What steps did you take to develop this methodology? Was it inspired or influenced by another methodology?

My background is in risk management (actuarial), so the concept of risk mitigation or management really inspired the methodology that my team and I use. Running a data science project comes with multi-faceted risks: Is the problem clearly defined? Does the data support the problem? Do we understand the data adequately? Are we solving the right problem? I found a lot of similarities to how other disciplines approach problems and found the vocabulary approachable and easy to communicate. Plus, the lab analogy works well.

The methodologies listed in the question tend to be hyper-focused on the data; it’s a means and an end. It can be very appropriate when mining or exploring data but, for me, modeling is all about learning something or solving a business problem. By putting the problem statement front and center, you avoid playing “model collection” — building models for the sake of building them. More importantly, you minimize the risk of solving the wrong problem.

The results of each experiment should feed into the greater goal, and learning is always the main objective.”

How have you evolved this methodology over time to ensure it suits your team's needs? How did you know it needed refining?

Data scientists come from diverse backgrounds. As a result, you get many perspectives about how to approach any given problem. As you tackle new problems with new people, your way of working must be adaptable. Every project and colleague becomes a learning agent, and you gain more tools and experiences to approach the next engagement or project along the way. I tend to ask my coworkers, supervisors and team members a lot of questions so that I can incorporate their feedback into a meaningful way when approaching data science problems.

Recent Articles