Building a Resilient dbt Project
The dbt Live: Expert Series features Solution Architects from dbt Labs stepping through technical demos and taking audience questions on how to use dbt. Catch up with recaps of past sessions, and sign up to join future events live!
Aaron Steichen dedicated his session to an overview of the testing and monitoring features you can use to strengthen your data pipeline in dbt.
In former roles as a data engineer, Aaron learned the hard way that there’s no such thing as an error-free data pipeline.
It’s hard to stand up a pipeline, and even harder to make it resilient. Every time a stakeholder gets burned by bad data, they lose trust in the data product. With repeated incidents, they may even revert back to legacy solutions… and then your analytics foundation just becomes a very expensive proof-of-concept.
To help others avoid a similar fate, Aaron proposes a “belt-and-suspenders” approach – using multiple features to routinely test, monitor, and provide transparency on your dbt project to stakeholders.
- Add exposures: Define the downstream dashboards and products that rely on your dbt transformations
- Add dashboard tiles: Signal the quality of data behind key dashboards
- Set up alerts: Get notified when a job run succeeds, fails, or is cancelled
- Use CI/CD: Enable safe and efficient changes to your analytics code
- Use dbt packages: Automate routine tests to make sure code (and data) are quality-checked before they reach production
- Use the Metadata API: Track runtime performance & reliability over time
Watch the full session replay, or read on to learn more about each method.
Define exposures in your DAG
What are exposures?
When defined (via a YAML file in your dbt project), exposures represent the downstream apps, data products, or dashboards that rely on the data transformed in that project.
Why use exposures?
By declaring exposure, you can:
- Run dbt commands that reference those exposures using dbt’s node selection syntax. This can come in handy if you ever need to update a specific dashboard off-schedule – for example, make sure the freshest data is populated before a board update.
Include those exposures in your dbt documentation & lineage graph. This can be useful for:
- Tracing downstream dependencies - for instance, identifying all the dashboards that would be affected if you change or deprecate a specific resource.
- Tracing upstream resources - for instance, helping to answer “where does this number come from?” when debugging unexpected data.
- Enable the use of dashboard status tiles - more on this below.
Add dashboard status tiles
What are dashboard status tiles?
Dashboard status tiles are iframes that can be embedded on a dashboard.
Why use dashboard status tiles?
The tiles can give dashboard viewers a quick, visual way to gauge the freshness and reliability of underlying data, with links to more detail if needed.
Set up alerts and notifications
What types of alerts can I create?
You can configure dbt Cloud to send alerts via email or a chosen Slack channel when a job run succeeds, fails, or is cancelled.
What is CI/CD?
CI (continuous integration) is the practice of ensuring new code will integrate with the existing code base before it is pushed to production.
CD can stand for either of the below, depending on your team’s preferred way of working:
- Continuous Delivery: approved code is batched and pushed to production on a chosen schedule.
- Continuous Deployment: approved code is pushed to production as soon as it passes CI checks.
Why use CI/CD?
Using CI/CD helps ensure that you can make changes to your data pipeline
- reliably, with each change tested before it reaches production
- efficiently, with automated test-and-merge processes to minimize manual overhead.
How do I use CI/CD in dbt Cloud?
- Enabling CI
- Customizing CI/CD pipelines
- See this in action at 8:50 of Aaron’s demo, or watch a more detailed demo on enabling CI in dbt Cloud.
Use dbt packages for testing
What are packages?
dbt packages are libraries of code that help automate or simplify common routines. The package hub is a directory of packages maintained by members of the dbt community.
How do I set up packages in my dbt project?
- How to add a package
- Watch an example of how to set this up at ~24:00 in Sean McIntyre’s dbt live session.
- Hear more about the test packages below at ~13:18 of Aaron’s demo.
Recommended packages to help automate data quality checks
- This package checks test and documentation coverage in a given part of your dbt project.
- Put this on critical parts of your project to ensure models in those directories have tests and/or documentation before they can reach production.
- This package audits your dbt project and checks for violations of best practices for modeling, testing, documentation, and file structure/naming.
This package provides a host of data quality tests, including checks for
- Table shape
- Missing values, unique values, types
- Expected value ranges and distributional functions (e.g. “fail if value is more than 3 standard deviations away from average”)
- This write job artifacts - like model run times and statuses - to a Snowflake table so you can monitor performance over time. This can help you identify brittle parts of your pipeline.
Use the dbt Cloud Metadata API
What is the Metadata API?
With each job run, dbt Cloud generates metadata on the accuracy, recency, configuration, and structure of views and tables in your warehouse.
The Metadata API is a GraphQL service that supports queries over this metadata. This is available to dbt Cloud customers on Team and Enterprise plans.
Why use the Metadata API?
This offers a configurable way to track runtime performance, reliability, and a host of other detail on your dbt project over time, in near-real-time.
How do I use the API?
After the demo, Aaron and fellow Senior Solutions Architect Randy Pitcher answered a range of attendee questions.
To hear all of the Q&A, replay the video (starting ~22:20) and visit the #events-dbt-live-expert-series Slack channel to see topics raised in the chat.
A sample of questions:
- (31:00) - Is there any way to log rows processed over time at the job/model level? (Resource)
- (32:00) - Can you share examples of how to use the dbt API? (Resource)
- (32:50) - Best practices for setting up my dbt infrastructure? (Resource)
- (40:10) - We declare variables in multiple places in dbt. What is the order of precedence dbt will follow to interpret these? (Resource)
- (50:00) - How do I implement blue/green deployment with dbt? (Resource)
Last modified on: Nov 22, 2023