Tracking dbt Test Success Rates Over Time

What is dbt Live?
The dbt Live: Expert Series consists of 60-minute interactive sessions where dbt Labs Solution Architects share practical advice on how to tackle common analytics engineering problems. In this session, Lee Bond-Kennedy demonstrated how to collect and store dbt test results.
Event agendas are shaped by audience requests, so whether you’re new to dbt, or just want to sharpen your skills, we’d love to see you there! Register to take part in the next live session.
Session Recap: Tracking dbt Tests Success
The APAC-friendly edition of dbt Live featured Lee Bond-Kennedy, Solutions Architect and Lucas O’Loughlin, Account Executive from dbt Labs.
Lee dived right into a topic that he’s helped a number of dbt prospects and customers with: How to track the success and failures of dbt tests over time to understand data quality trends.
You can watch the full replay here or read on for a summary of the session.
What are dbt tests?
dbt tests are assertions you make about your models and other resources in your dbt project. Defining tests is a great way to confirm that your code is working correctly, and helps prevent regressions when your code changes. Including tests can help alert your data team or business stakeholders when there is an issue with a particular dbt job that may affect the quality of your data. The faster you can be aware of these issues with your data, the more likely you are to keep your organization trusting your data.
dbt includes tests that already have the SQL developed, saving you and your team the time to code tests or you can write your own SQL-based tests. dbt tests your data during runtime and stores how many rows fail test conditions in the run logs and dbt manifest. It is important to note, dbt does not store data from your data warehouse for failed rows, just a count of the failed rows and a reference to the model where the error occurred.
Out of the box, dbt will run your tests each time a job is run with the dbt build
or dbt test
commands. When your run completes it outputs the test results, but does not aggregate the results over time. But, what if we want to create a history table containing success and failure results over time with details like:
- When the test was run
- Test name
- Test result
- Column name tested
- Model
- Source refs
Lee set the table with this question and then walked us through a couple of ways to aggregate these test results. Let’s see what he suggested.
How do we aggregate test results over time?
Before diving into the how, it’s good to have an understanding of what Jinja and dbt macros are and how they work. Macros in Jinja are pieces of code that can be reused multiple times – they are analogous to “functions” in other programming languages, and are extremely useful if you find yourself repeating code across multiple models. In this session, Lee showed how to use a macro to create the test results output and then insert the data into your data warehouse. The following approaches make it fast to get started with storing your test results.
One way to get started is using dbt_artifacts package that was developed by dbt Labs partner, Brooklyn Data. This package helps you process artifacts dbt produces to create fact and dimension tables in Snowflake to examine data quality. Along with using Snowflake as your data warehouse, you need to be able to upload dbt JSON artifacts to Snowflake. Instructions on how to install and configure this package can be found in the dbt Package hub.
What if you have a different data warehouse than Snowflake? No problem! Lee walked us through how to handle this using the on-run-end
config in the dbt project.yml
file and a macro he created and shared. The on-run-end
config can be used to run a macro after a dbt run or build is complete.
Lee jumped into dbt Cloud to show us how to add his macro to the project.yml
file in the IDE:
on-run-end:
- “{{ store_test_results(results) }}”
This single line of code will execute the store_test_results.sql macro with the variable results. The macro grabs the different node types and if the node is a test then it will store the results in a table in your data warehouse. If you run this macro in your production environment, it will combine the results to a single table, but if you run the macro in your development environment, it will only provide a point in time snapshot of your test results.
Lee built one of his models in his project that contained sources with tests. After the build process ended, he showed the test results from his run in his data warehouse:
You can check out Lee’s full walk through of the macro in the replay below.
dbt Live: Expert Series with Lee (June 17th APAC)
Participant Questions
Following Lee’s presentation and macro run walkthrough, he answered Community member questions received in advance of the session and coming up live from session attendees.
Here are some of the questions:
How can we measure test coverage?
Tim asked, “How can we measure test coverage in our dbt project?” Lee had two suggestions to help with measuring how many models had tests. First, Lee recommended the dbt_meta_testing package that enables you to require tests or documentation to be included for models within a directory in your project. is to establish rules for contributors around when tests are expected. Then he suggested the new dbt-project-evaluator package. In addition to helping illustrate what test coverage looks like in your project this package checks your dbt project structure against dbt best practices. And, identifies how many models are undocumented and any DAG issues.
Is there a way to set thresholds for storing test failure results?
Nadia asked, “Is there a way to only store test failure results if there have been a certain number of failures?” Yes, this can be accomplished with the use of the severity configuration for tests. Here is an example configuration that will check the error_if
condition first. If the error condition is met, the test returns an error. If it’s not met, dbt will then check the warn_if
condition. If the warn condition is met, the test warns; if it’s not met, the test passes.
version: 2
models:
- name: large_table
columns:
- name: slightly_unreliable_column
tests:
- unique:
config:
severity: error
error_if: ">1000"
warn_if: ">10"
How do we share common sources between teams?
The next question was submitted ahead of the session, which asked, “We work in an organization with common sets of data we want to share while we work on our models. How do we share sources between teams while keeping our models in separate projects?”
This is a question we hear more and more as companies expand their use of dbt to multiple business units that share common data sources. They have their own specific use cases and models that do not need to be shared with other parts of the organization. Lee talked about how he’s helped some organizations create a dbt project that houses sources that will be shared across the organization.
The source project is usually maintained by a central data team and then other teams import that project as a package for their project. Lee shared dbt docs that show how to manage packages to help with adopting this approach.
How do we migrate from our existing stored procedures to dbt?
A common pattern Lee and his colleagues see is the adoption of a cloud data platform by a customer and then adoption of dbt to refactor legacy stored procedures. This is done to leverage the software development workflow dbt encourages data teams to follow, especially working with modular code and version control. Lee recommended folks going through this transition check out this guide on refactoring legacy SQL and the free dbt Learn course - Refactoring SQL for Modularity.
How do we pass parameters into models during run time?
Lee said this is one of his favorite topics to discuss, when he read this question, “We are getting started with dbt, how do we pass in arguments or parameters into their models during run time?”
Within dbt there is this concept of variables. You can pass in a key value for a variable during a dbt run to be used in your models where the variable is referenced. Within a model, this is what a SELECT
statement would look like with a variable:
select * from events where event_type = '{{ var("event_type") }}'
And then when it comes time to run the model, you could use a run command like this:
$ dbt run --vars '{"event_type": "concert"}'
There are ways to set up variables that can be used within multiple environments to help keep your build process consistent. You can learn more about environment variables in this dbt blog post.
Wrapping Up
Thank you all for joining our dbt Live: Expert Series with Lee and Lucas!
Please join the dbt Community Slack and the #events-dbt-live-expert-series channel to see more question responses and to ask your questions for upcoming sessions.
Want to see more of these sessions? You’re in luck as we have more of them in store. Go register for future dbt Live: Expert Series sessions with more members of the dbt Labs Solution Architects team. And if you’re as excited as we are for Coalesce 2022, please go here to learn more and to register to attend.
Until next time, keep on sharing your questions and thoughts on this session in the dbt Community Slack!
Last modified on: Dec 6, 2023