Data transformation

The work #

Transforming raw source data into the format required for downstream use (whether that’s reporting, operational automation, ML models, or exploratory analysis).

Teams practicing analytics engineering do this in a transparent, easy to debug manner.

This transformation work is generally done in a few chunks:

  • Data modeling itself (writing SQL queries)
  • Writing tests on those models
  • Writing documentation for model + field definitions
  • Version controlling all of the above with Git

We’ll cover each of these 4 chunks of work in the following pages, hearing from various members of the Fishtown Analytics team.

Owned by #

The modeling process is owned by analytics engineers.

Other roles contribute in a couple of ways:

  • Consumers of datasets (analysts, data scientists, business users) contribute requirements for the output of transformations, and generally will review that output during development.
  • Data engineers may own deployment of transformation code, in the event that data models must be orchestrated within a broader code-based (ex: Airflow / Astronomer, dagster, Prefect) data pipeline workflow.

Prerequisites #

Before embarking on a data modeling project, it’s helpful to have:

  1. Raw data flowing into the warehouse, or at least a sample dataset to work with.
  2. A solid sense of the end output that’s required.

Having these two will allow you to get a feel for what shape of transformation will be required:

What tables must be joined together, and how? What metrics must be calculated, and where?

The backstory: how this came to be #

Let’s take a step back to 2016, when dbt was born.

Raw source data was flowing into MPP databases with (relative) ease. At that time, teams were visualizing it in reporting tools like Looker, Tableau and many others. But the transformation layer in between was being duct taped together.

Teams were writing complex stored procedures with often-confusing dependencies. Or sometimes even manually refreshing queries to calculate metrics for visualizations.

Refreshing transformed metrics often required human intervention:

When a query failed, a human would re-run each individual query from that point forward in the dependency graph - inferring those dependencies by intuition or experience.

There wasn’t much of a concept of an automatically-generated dependency graph.

Turns out that many teams were running into this same challenge: plentiful raw data, a useful tool to visualize it, but nothing in between to get datasets ready for use.

dbt emerged to empower data teams to solve this problem.