Data science

The work #

Producing predictive models to power production applications (think user-facing recommendation engines) or operational workflows.

Owned by #

The predictive modeling process is owned by a data scientist or machine learning engineer.

Analytics engineers support prepping datasets, although data scientists may also join in the fun and write their own dbt data models.

Downstream dependencies #

Production ML models will make customer-facing or internal business-user-facing recommendations.

These recommendations often have direct business consequences — either users don’t enjoy your product because it’s feeding them weak recommendations, or business users make poor decisions.

Prerequisites #

Predictive models depend on all predictive features being arranged into a single row per observation ID.

These features may live in many disparate tables, so data transformation work is required to get them into an ML model-ready format.

The easiest way to produce a low-quality predictive model is to start with low-quality training data, so the entire process depends on high-quality analytics engineering work.

Note: Although feature engineering generally falls under the data science workflow, dbt is increasingly being used for pre-processing of features: normalizing, scaling or encoding values.

This light version of feature engineering can be done with the dbt_ml_preprocessing package.

The backstory: how this came to be #

This backstory is excerpted from an NYC dbt community meetup talk by Sam Swift, VP of Data & AI @ Bowery Farming. Catch his full presentation here.

Typically, in the predictive modeling process, we get started by arranging all predictive features into a tabular, organized format.

But, if these features reside in different places, how do you bring them together to train a model, deploy it, then run it in production?

There have typically been 3 types of machine learning tooling setups:

Custom Builds

If you’re a big tech company, you may have an infrastructure team build a custom stack of tools that relate specifically to your ecosystem — but most of us don’t have this luxury.

Notebook Modeling

At the other end of the spectrum are the functional, low-overhead notebook-based solutions where everything runs linearly, doing everything you need it to — in a sense.

A growing system running on notebooks has engineering risks — these tools were designed for exploration and iteration. Not to mention, code that’s built inside a notebook is tough to reuse.

Machine Learning as a Service

Another option is to employ machine learning as a service to handle your data prediction needs — a convenient option indeed, but you’ll make significant tradeoffs for convenience.

You’ll have to move all of your data from a first-party world to a third-party world, and there’s a significant long-term risk when you let someone else make all of the engineering decisions.

So what’s the solution?

dbt for Predictive Modeling #

dbt gives you the best aspects of the above options, such as feature reuse, limited overhead, simple upstream data flow, and reliable implementation - with just enough engineering rigor to avoid burdening the system.

Feature Reuse

For small and mid-size teams, feature reuse is one of the most significant benefits for those using the dbt platform. With dbt, you can reuse official definitions, macros, predictors, or any other useful thing throughout your machine learning world.

Graceful Retraining

In dbt, the machine learning workflow simply involves selecting a subset of observation IDs within the same table, which continues to grow over time because dbt is updating it at regular intervals, so the flow to retraining is fluid and efficient.

Upstream Data Flow

Realizations and observations you’ve made about your data travel upstream back into dbt, so that improvements to upstream stuff in your data warehouse flows through.

dbt Model Structure in the Machine Learning Workflow #

The dbt predict schema entails three models that accompany every machine learning model you want to bring into production, whereby each project has an observations model, a predictions model, and an overall model.

The Observations Model

The observations model has exactly one record per observation for which you could make a prediction, whether the outcome value is known or not. So there are plenty of records to train on and yet to be determined outcomes that are your cases to predict.

The Predictions Model

The predictions model collects all answers you’ve given to your prediction problem in production — one record per prediction and potentially more per observation id, allowing you to compare the accuracy of these predictions over time quite easily.

The Production (Master) Model

This third dbt model joins the observations and predictions while keeping the most recent data. This model is your production master — what everything downstream points back to.

Incorporating dbt as an essential component of your data science evolutionary cycle ensures your predictive modeling operates on regularly updated data, at a scale your organization can handle.

Without dbt, the process is still doable — but like a square peg and round hole, it won’t quite be the right fit.