As an analytics engineer on the professional services team at dbt Labs, I’ve helped a dozen or so data teams refactor from a legacy SQL transformation workflow to modular data modeling in dbt.
I’ve noticed a clear pattern: rare is the team that starts with a completely clean slate. More often than not, teams have accumulated years of SQL code powering business-critical, often complex data transformations.
These SQL transformations may run in a stored procedure, in an Airflow SQL operator, or some other scripting environment—but regardless of where it runs, that legacy SQL code must be migrated!
How can that migration be done with a) minimal headache and b) maximum quality?
Announcing a new (free!) on-demand course
Over the last 18 months, the dbt Labs professional services team has evolved a process to help teams migrate legacy SQL queries to dbt SQL.
And as of today, you can now learn + implement that same refactoring process, with a new on-demand course.
If you’re completely new to dbt, you may want to start with the dbt Fundamentals course before diving into this one.
All the links
Everyone learns a bit differently, so you can pick up the course materials in a few different formats:
- The video course, Refactoring SQL for Modularity, which includes a practice query + dataset to test your refactoring skills.
- A pull request in a GitHub repo, where you can follow commit-by-commit how a monolithic query (as from a stored procedure) breaks down modular dbt code (note: these commits are covered in-depth in the video course).
- A written tutorial on legacy SQL -> dbt SQL refactoring, in case you prefer reading or bookmarking the short version.
Our SQL refactoring process, in short
For a bit of background, the migration workflow that we follow internally is:
- Migrate your legacy SQL code, unchanged—make sure it runs and produces an identical output.
- Implement dbt sources (as opposed to raw table references).
- Choose a refactoring strategy.
- Implement clean CTEs (we ❤️ CTEs).
- Separate transformation logic into standardized data model layers (often the longest step).
- Audit the output (often using the audit_helper dbt package).
There is, of course, no one true way to refactor legacy SQL into dbt models!
I’d love to hear from you (on Twitter or in #learn-on-demand in dbt Community Slack) how your approach varies from the one we lay out in the course.
Last modified on: Apr 19, 2022