Creating a data quality framework for scale
It’s 8:30 am. You’re hardly awake, nursing your coffee. Ping.
“Hi! I was just wondering why the count of new subscriptions is showing 30 for yesterday in Looker, but the Facebook Ad platform is saying we had 72.”
And so it begins.
This DM is too familiar to many data practitioners—data quality is one of the highest concerns for data teams. But how do data teams create scalable practices to ensure they have high-quality data? How do they get a world where the DM above goes from a once-a-week thing to a once-a-every-six-months thing?
The answer lies in building data quality frameworks that are supported by software and practices that enable practitioners to work in collaborative and code-based environments.
In this article, we’ll walk through how we think about a scalable data quality framework supports the ability to define:
- What data tests you run on your data,
- where you test your data, and
- who is involved in the process.
Why you need a data quality framework
A data quality framework, beyond improving the actual data quality of your business, does two fundamental things for your data team:
- Develops trust between your team and the business
- Establishes a system of governance and accountability for your data quality
Build trust between data teams and business stakeholders
When your data meets the quality standards of your business, everyone wins. Stakeholders don’t have to second guess the reliability of the data and conduct analyses with confidence and speed. As a result, business users begin to understand that the data team can be trusted with the work they’re handling and therefore start seeing data teams as strong partners for them to drive their initiatives.
For data teams, when business stakeholders are using your data to their highest ability, they feel deeply connected to the business, and the “mythical data team ROI” begins to feel a little more tangible. Being on a data team becomes rewarding when business users trust and use the data analytics practitioners work so hard to provide.
Create an analytics foundation built on governance, scale, and accountability
By having an attainable data quality framework, data teams help ensure that data transformation work is held to the standards of the business because they’re following a process that aligns with those standards. With a proper framework, appropriate tests are defined, placed in the correct parts of the pipeline, and reviewed by the right people. For each new data source or business entity needed, data teams can follow a framework to guarantee they’re building tables of high-quality data and creating accountable parties for when these tests fail.
Notice how I said an attainable data quality framework—there’s little use in creating frameworks that are hard to follow and scale. If a data analyst needs to write hundreds of tests and review checklists beyond their capabilities, this data quality framework will not work in the long term. Instead, data quality frameworks should be feasible and contain elements of automation that can make following a framework less intimidating and more flow-state in the data transformation process.
The framework we propose below is meant to be adjusted for the needs of your business and resources on your data team, but it supports the concepts of feasibility, automation, and peer review.
Step 1: Define which quality checks are important to you
Data quality will never be 100% perfect. (If you work for an organization that has perfect data quality, please send them our way 😉.) But, data teams can use data quality dimensions, paired with data tests, to work towards that perfection.
We generally define data quality around seven primary dimensions: usefulness, accuracy, validity, uniqueness, completeness, consistency, and freshness. These dimensions help identify which tests you should be running on your data and during your data transformation process.
It helps to create a checklist, or requirements list, for each new data source you capture or transformation you create to determine what data quality dimensions and tests are most important. An example of this checklist for an ecommerce company’s
orders table could look like:
- Primary key (
order_id) is unique and non-null
- Order date is non-null
- Order currency is in USD
- Data should be updated in my data warehouse every six hours
- …and so on
Some of these tests will be non-negotiable (ex. primary keys must always be unique and non-null), and some will have more flexibility to them (ex. freshness baselines), so use your understanding of the business to create the tests that are most applicable to your data.
For each new data source you need to transform, we recommend you start off by creating these foundational lists, which can easily be represented, implemented, and automated with code-based tests and jobs in dbt Cloud.
Step 2: Determine where you test
It’s our firm belief that data quality should be tested and monitored throughout the ELT lifecycle: in its raw, transformed, and production form.
|Stage||What’s happening here||Types of recommended testing|
|Source-level||Raw source data is being unloaded into your data warehouse via some type of ETL tool or custom script.||Primary key uniqueness and non-nullness; Source freshness|
|Development||You’re transforming raw data into meaningful business entities via a data modeling process, and merging those transformations into a git repository.||Out-of-the-box dbt tests; dbt Cloud continuous integration (CI) capabilities; Data quality-specific packages such as dbt_expectations, audit_helper, and dbt_utils|
|Production||New data is added, columns are renamed—your data goes through its normal lifecycle. In production, you’ll want to ensure your data is fresh and tested to your standards, even when it’s updated or changed.||dbt Cloud jobs to automatically run models and tests; Dashboard status tiles to embed the freshness and quality of your data directly in your BI tool|
Read more here on the when and where of data quality testing.
The importance of peer review in data testing
One of the powerful things you unlock by having code-based and version-controlled data transformations, tests, and documentation is regular peer review and collaboration.
In the analytics engineering workflow, data folks conduct their own data transformation work in a separate git branch and use pull requests (PR) to merge those changes into a central repository. PRs should be peer-reviewed by data team members for code validity and formatting. With a central repository, data team members can easily review other team members’ transformation code and tests to double-check that data work is meeting the standards and expectations of the business, as well as being correct!
With a system like this, data work is no longer siloed to one data engineer or analyst—instead, a process of accountability, partnership, and understanding is created when multiple team members leave a thumbprint on data transformation and testing work. There can now be multiple data team members who are familiar with specific parts of the pipeline (or failing tests!), instead of one person managing all of that work.
Step 3: Repeat
Your data quality framework will be utilized anytime there’s a new data source (say your marketing team starts serving ads on a new ad platform) or there are business updates that change the logic in your data transformation (say a new way of calculating gross margin). For each of these additions or refinements to your data transformation work, you’ll follow your data quality framework:
- Determine which data quality tests are appropriate for this data
- Determine where you test those things along your data transformation workflow
- Refine and review with peers
Analytics work is often iterative—whether it’s updating a dashboard, transformation logic, or testing requirements—and your data framework is not excluded from this type of development cycle. As your team toggles with what practices, requirements, and tools work for enforcing and standardizing data quality, your framework may change over time. And that’s okay!
At the end of the day, high data quality is what keeps data teams employed and business teams empowered, so a data quality framework deserves this level of iteration and attention.
If you’re interested in learning more about how dbt can support your team with a data quality framework, check out the following resources:
- The 5 essential data quality checks in analytics
- How we cut our tests by 80% while increasing data quality: the power of aggregating test failures in dbt
- Power up your data quality with grouped checks
Last modified on: Nov 29, 2023