How Airflow + dbt Work Together
Session Recap: Using dbt with Airflow
The dbt Live: Expert Series features Solution Architects from dbt Labs, taking live audience questions and covering topics like how to design a deployment workflow, how to refactor stored procedures for dbt, or how to split dev and prod environments across databases. Sign up to join future sessions!
In the latest session, Sung Won Chung, Senior Solutions Architect at dbt Labs, addressed a question he hears often on the job:
How does dbt differ from Airflow, and how (or why) might some teams use both?
- Watch the full session replay.
- Read Sung’s full blog post, from which the session and recap below are adapted.
- Guides to setting up dbt Core or Cloud + Airflow
What Do Airflow and dbt Solve?
Airflow and dbt share the same high-level purpose: to help teams deliver reliable data to the people they work with, using a common interface to collaborate on that work.
But the two tools handle different parts of that workflow:
- Airflow helps orchestrate jobs that extract data, load it into a warehouse, and handle machine-learning processes.
- dbt hones in on a subset of those jobs – enabling team members who use SQL to transform data that has already landed in the warehouse.
With a combination of dbt and Airflow, each member of a data team can focus on what they do best, with clarity across analysts and engineers on who needs to dig in (and where to start) when data pipeline issues come up.
The Right Path for Your Team
Consider the skills and resources on your team, versus what is needed to support each path:
- Using Airflow alone
- Using dbt Core/Cloud alone
- Using dbt Core/Cloud + Airflow
For those who are ready to move on to configuration, below are guides to each approach:
Airflow + dbt Cloud
- Install the dbt Cloud Provider, which enables you to orchestrate and monitor dbt jobs in Airflow without needing to configure an API
- Step-by-step tutorial with video
- Code examples for a quick start in your local environment
Airflow + dbt Core
- Considerations for using the dbt CLI + BashOperator, or using the KubernetesPodOperator for each dbt job
- Shopify Engineering recently shared lessons learned from running Apache Airflow at scale, to much discussion from others in the data-engineering community.
- Gitlab open-sources their data engineering infrastructure, including comprehensive docs and examples of how they use dbt Core with Airflow.
After the demo, Sung and fellow Solutions Architect Matt Cutini answered a range of attendee questions.
A sample of questions:
- (22:00) - How can a complex dbt DAG be displayed in Airflow? (Resource)
- (24:00) - How can we automate full refreshes for schema changes in incremental models? (Resource)
- (27:30) - Where can I find examples and best practices for using dbt? (Resource)
- (28:15) - How can I do deferred runs in dbt Cloud – rerun only the things that have changed since a prior run? (Resource)
- (32:15) - Can I use dbt Core with Airflow, and get all of the same functionality as dbt Cloud?
- (45:00) - How do I set up data status files in my BI dashboards? (Resource)
- (47:00) - What is the right use case for the incremental strategy in dbt?
Last modified on: Nov 22, 2023