Modular data modeling

Randy is a solutions architect on the Fishtown Analytics team — prior to joining Fishtown, he was a data consultant and on the data team at a Fortune 100 company.

Editor's note: this post has been heavily scrubbed for expletives.

Back when I started working in the data industry, as part of recruitment you’d get this Army-style pamphlet about all the cool stuff you’re going to do.

Then you sit down at your desk.

Now often in the world, the unhealthy things are sometimes kind of fun: like french fries and Coca-Cola, or fireworks.

Data modeling is not one of those things.

The before times: monolithic SQL scripting #

Before dbt was released, the most reliable way that I had to transform data was SQL scripting.

This often looked like writing one 10,000 line SQL file that needs to run cleanly in one shot. Or, if you wanted to get fancy, you could split that file into a bunch of separate SQL files or stored procedures that are run in order with a Python script.

But if these break, what happens?

Does it keep running? Does it tell you where it broke? Can you start it from that spot? Oh no, you can’t start it from one spot - so you delete the rest of the script manually and re-run from there.

It was a mess: manual dependency management, running hot in production.

The first time I tried building data models modularly with dbt, I wasn’t convinced it’d change anything for the better. I had committed to giving a talk on it - so there I was, working up a demo.

And then it broke. That’s where I actually started to get excited.

Because I’d built models modularly, one stacking on the next, dbt made it really clear to me why the pipeline of models was broken, and where to fix it.

So I could go in, update the broken model, run it again — and everything downstream of that broken model picked up the fix.

That was my “a ha” moment with modular modeling — it was so much faster to debug + fix issues, which is what eats up so much of our time as data people.

Now, having done modular data modeling for the last 3 years with dbt, I’ve found it’s helped me do better work, faster, in 3 few ways:

  1. Get straight to the point: Build the leanest possible proof-of-concept.
  2. Stand on the shoulders of giants: Leverage open source contributions of those who came before you.
  3. Optimize last: Many data products end up as “shelfware” — don’t over-optimize yours before it’s actually being used.
  4. Get the team involved: Non-data people can write better SQL than you think.

Get straight to the point

My career hit the data space right when cloud warehouses like Databricks, BigQuery, Snowflake et al. were first being adopted - I’ve had the luxury of never worrying too much about compute + storage cost.

So when I think about project planning for a data transformation project, I don’t really think in terms of style (Kimball, Data Vault, star schema, etc), although plenty of people find those techniques to be useful.

I just think about the left and right side of the pipeline:

  • On the left, we’ve got raw data sources, which we don’t have a ton of control over. We’ve got to do all the initial upfront cleaning required at the staging layer.
  • On the right, we have use cases — real humans who are trying to solve a problem.

My job is to find the shortest path between the staged data I have on the left, and the data products that would actually solve a problem for them.

Stand on the shoulders of giants

When I was doing data consulting, my client calls sounded like iPhone ads.

Need to standardize the way you build date spines? “There’s a macro for that.”

Need to model Snowplow events? “There’s a package for that.”

Need to calculate MRR? “There’s a playbook for that.”

If you want to be sprinting out data models quickly, you must leverage macros and packages built by those who have come before you. Eventually, you’ll learn how to build your own, and contribute back to that knowledge loop.

I end up writing macros for the things that I can’t natively do in SQL, and the things I hate doing in SQL.

Optimize last

Working modularly and in an open source context like dbt, I always ended up having free hands on the team faster than I’d expected, because we’re doing less unnecessary work.

This frees up time for optimization after you’ve already delivered a proof of concept, because we haven’t been required by our stack to optimize on the front end.

So those free hands are going to dive into our DAG and look for bottlenecks in our data models. We’ll see a triple join there that’s taking forever, and create some surrogate keys on that to speed things up.

Those free hands are going to add incremental materializations to long-running models that don’t need a nuclear full-refresh on every run.

So only after shipping a rough version of the data product, we need to get into the nitty gritty of making it scalable for long-term cost.

We’re much more focused on actually delivering value to our team and fulfilling the use case, vs on doing the housekeeping work of making sure our data models are perfectly designed.

Getting the team involved #

As a consultant, I was always surprised at the high quality of SQL modeling that “non-technical” analysts were able to do - once the technical barriers (managing infrastructure, all the DDL + environments, promotion automation) came down.

What I found is that people are already finding ways to do this themselves, in “shadow IT” like spreadsheets or notebooks.

Giving them a shared Git repo of data models to contribute to, rather than working in saved queries on their local machine, was like releasing a fish back into the water.

They already knew how to do it, if given the right environment.

So that I see as our collective work going forward — how do we allow more people to get their roll up their sleeves, start building data models, and making sense of their own data?

How do we continue to make this modular data modeling learning curve less of a rope ladder, and more like an escalator?