It’s Monday morning. You get the dreaded Slack message, “This dashboard seems to be broken.” Well, there go all your plans for the day. You know you’ll be spending the next few hours trying to figure out what went wrong.I know. I’ve been there.
Often, finding the issue is a process of moving backwards: first checking the dashboard code, then the data the dashboard selects from, then the transformations that produce those data, and on and on, until you find the culprit.
The dream for every data analyst is to be able to catch these issues before that dreaded Slack message. I’m here to tell you–it’s possible.
By thinking about data testing holistically, you can add tests where they matter most and separate signal from the noise. Then when something (inevitably) goes wrong, either you’ll know before your stakeholders find out, or, you’ll be able to quickly diagnose to find the underlying issue. Here is the testing framework I use to catch data quality issues before they impact my stakeholders.
Farm-to-table data testing
I like to think of my data tests using the mental model of a farm-to-table supply chain. This helps me focus on four critical questions:
- 🌱 Where is the raw data grown? Which applications are my farm? What can go wrong on the farm?
- 🚚 Where is data moved? How is it transported along the supply chain? Are there any warehouses and distribution centers where we collect data (nodes) and any routes by which we move data (edges)? What can go wrong as data moves?
- 🍳 Where is raw data transformed into reports that our customers consume? How are you preparing data for your end users? How can these steps spoil the data? (I am trying hard here to avoid the metaphor of cooking the data 😃)
- 🍴Where is the data consumed? Which BI tool is your table and how can it fail?
Thinking of your data supply chain is not exactly a novel concept (see Katie Lazell-Fairman’s great piece on data supply chains for Machine Learning models) but keeping specific farm-truck-kitchen-table metaphors in mind as you go through your own data warehouse and pipeline can serve as a fun checklist along the way.
🌱 Farm - Where is data grown?
If the data comes from your own application database, talk to your engineers – they can help you understand how CRUD is implemented in the web application and what referential integrities are enforced in your application database. If your application uses a third-party platform (~ integration to manage your subscriptions) or the data comes from humans (~ salespeople recording funnel steps in your CRM), it is helpful to think about the relationships between your business processes, the data captured in the UI, and the data output by the API. A test that enforces that each subscriber ID from your payment portal has a single corresponding ID in the CRM and a single account ID in production can help catch accidental duplicates and diagnose how they are created. If you use tracking on a website, think about the relationships between your jobs to be done, user flows, and your tracking plan. In this case, talking to UX or product before conferring with engineering can be really helpful in understanding what event data should look like.
➡Questions to ask about your data:
- How do tables relate to each other? Test with relationship tests in dbt
- Do my events have the correct payload? Test with Segment protocols,
🚚 Truck - Where is data moved?
If you set up your project using dbt best practices (bookmark this resource here), you only have two edges (~ moves) in the Extract and Load phases of your ELT pipeline:
- One from the source application to your raw database, loaded into the data warehouse by a third party tool.
- Another from your sources folder to your staging folder. Ideally, here you are only doing minor cleaning, such as renaming columns or recasting a data type for overall consistency, without changing the granularity of the underlying data.
At this stage, there are two critical questions to ask: Did the extraction succeed? & Was all the data extracted? For the first question, you can set up alerting on your ETL tool so you know exactly when an extraction fails. For the second question though, the answer is more complicated. Dbt allows you to test source freshness, which is time based, or you can write custom statistical tests based on the distribution of your data (see some ideas in this post or the Great Expectations documentation here). Questions to ask about your data:
- Is my data arriving regularly? Test with freshness tests in dbt
- Did the extraction succeed? Use alerting in your ETL tool (Stitch, Alooma, Fivetran, etc.)
- Did all my data arrive? Test with distributional tests in Great
🍳 Kitchen - Where is data transformed?
Just as you do exploratory data analysis (EDA) on the source table, you might want to do “pretend” EDA on the table you are building before you write a line of code. Some questions to consider include:
- What do I expect this table to look like?
- Will it have categorical or numeric data? Can there be null values?
- Could it be summarized? And if so, how?
- How do the source and target tables relate? What should remain unchanged between the two?
- Is this going to be a fact or a dimension table?
- If it is a fact table, what type and what is the grain? What do we expect a single table row to look like?
Then, turn those answers into tests. Here’s some more concrete examples:
- If you’re building a transaction fact table test that the ID in the transformed data corresponds to an ID in your source data
- If you’re building a factless fact table, use an accepted values test to ensure a categorical variable doesn’t have any unexpected values
- If you’re building a periodic snapshot fact table, create a surrogate key based on the source columns that make each row of data unique
- If you’re building a dimension table, test that each ID is unique.
Pro-tip: when writing complex transformations, pick out some problematic records in your source data to work with — for example: if you’re modeling recurring revenue, find some customers with a strange history of subscription upgrades and downgrades. (Try to get a representative selection that covers all angles and use synthetic data if you can’t find good samples.) Then, run your transformations on those records only, and check if you get the output you expect using these steps. Questions to ask about your data:
- How do the source and target tables relate? Test with relationship tests in dbt
- What is the grain? Test with unique tests in dbt, surrogate key util in dbt
- What kind of table is it? Test with accepted values and unique tests in dbt
🍴Table - Where is data served?
In this last phase, a great deal depends on the tools you are using to deliver your data. Looker for instance, has an option to add testing to your models directly, see the documentation here and here. Another open-source tool that helps you test LookML is Spectacles. I try to do all my data tests with dbt directly on the tables that my SQL runner is reading. That being said, that last step is still a node in your data supply chain, so it is worth asking when the reports need to be distributed and when their underlying data sources are refreshed. Questions to ask about your data:
- Is my dashboard stale? Test with Looker built-in alerting
- Is my model stale? Test with Looker model testing, Spectacles
- Is data refreshing in a timely fashion? Test with recency test in
How-to get started
Using the farm-to-table mental model will help you think about the health of your unique data pipeline holistically and design a system of tests to catch most data quality issues before they break your dashboards.
However, if all this just sounds too overwhelming, here’s where to begin:
- If you’ve already got an ETL pipeline in place, start at the end. Adding tests later in your pipeline will catch upstream failures — these failures may not be easy to debug since they can be a few steps removed from the actual issue, but they don’t require much overhead to add. When these tests fail and uncover data quality issues, start being more intentional about where you test, by moving tests up the pipeline.
- If you’re building a new ETL pipeline, spend effort up front building tests early in the pipeline. It’s easiest to add these when you’re first working with your data since it’s fresh in your mind. Yes, it’s extra work upfront, but you’ll be able to work faster and break fewer things in the future!
Ready to implement? Here are some great resources:
- dbt documentation on Testing (remember, tests are “gifts you send to your future self” 🎁) and on Custom schema tests
- The phenomenal dbt repo on Macros and Data audit helpers
- Josh Temple’s post, Automated testing in the modern data warehouse
Last modified on: Apr 25, 2022