Data quality testing: What it is, where and why you should have it

on Apr 09, 2025

Data teams hold a lot of responsibility—is the data they provide to end users reliable? Are their questions being answered with the data? Did I double-check that there weren’t any duplicate rows in that table?

The stress of data quality is very real; it’s what keeps data analysts up at night. When business users question the data quality, they don’t trust the data (or the data team) and start conducting ad hoc analyses and creating their own metrics—forming the wild west of data analysis and reporting.

Instead of waiting for data quality issues to seep into your business intelligence (BI) tool and eventually be caught by a stakeholder, the practice of analytics engineering believes data quality testing should be proactive—integrated throughout the data transformation process. Data teams should test the data in its raw format once it hits your data warehouse, rigorously tested as its transformation and business logic is added to it, before it’s moved into production (and exposed to end users), and throughout its lifecycle as new data is added and updated.

What is data quality testing?

Data quality testing is the practice of making assertions about your data, and then testing whether these assertions are valid. You can use this concept to test both the quality of your raw source data and to validate that the code in your data transformations is working as intended.

Why is data quality testing important?

Having your data meet the quality standards for your organization is the foundation for any analytics team to develop trust with their end business users. With high data quality, your data team ensures:

Business users can make the best data-informed decisions
Data teams can trust business users for reporting
Big data audits become less scary

When data quality is high—and trust in the data team is high—there’s this almost indescribable flow state that exists for data teams.

But how are you regularly guaranteeing to your end users that the data you're exposing in a BI tool is accurate and reliable? Where is there accountability built into your data pipeline to hold the data team to these standards?

The answer: code-based and automated data quality tests.

What data quality tests should you be running?

The general data quality dimensions of accuracy, validity, completeness, freshness, and consistency will guide the data quality tests you run. Data tests in the analytics world typically cover these dimensions across two concepts:

Correctness: Are key columns unique and non-null? Do column values match your expectations? Are columns in the correct format? Are there concerning duplicate rows?
Freshness: How recently was raw data updated? If it wasn’t, how do I alert someone about this? Is the data available to me on the cadence I need?

These two heuristics and the tests that form them are the foundation for how modern data teams are approaching data quality testing. Below, we’ll unpack where exactly to implement these tests of correctness and freshness in your data pipelines.

Where and when should you test your data?

There are regular points in time at which you’ll want to test your data for correctness and freshness. During development, you’ll be testing for data quality as well as testing data transformation logic.

When you push new analytics code to production you’ll want to run your tests to make sure that your new code isn’t breaking anything in the existing code base. And then once your code is in production, you’ll want to have automated tests running at regular intervals to make sure that everything in your ETL pipeline and transformation logic is still working as expected.

1. During development

When you’re creating new data transformations for new data sources and business entities, you’ll want to integrate testing through this development cycle. In general, you’ll likely find yourself testing both your raw source data and new transformations.

Raw source data

With your raw data, you’re really checking the original quality of your data and asking yourself, “How much work am I really going to need to do here?” 😉 Here, you’ll likely find yourself creating tests checking for:

Primary key uniqueness and non-nullness
Column values meeting your basic assumptions
Duplicate rows

Data transformations

When you transform data, you’re cleaning, aggregating, joining, and implementing business logic to raw data. As you weave in more data transformations, fan out joins, and create new metrics and dimensions with SQL and Python, the room for error grows. In these newly formed tables, you’re going to want to test that:

Primary keys are unique and non-null
Row counts are correct
Duplicate rows are not introduced in joins
Relationships between upstream and downstream dependencies line up with your expectations

This is not an exhaustive list of what you might test after you transform your data—this shouldn’t be an exhaustive list. How you test data, and how your team defines high-quality data, is dependent on your raw data and the needs of your business.

2. During pull requests

Before merging data transformation changes into your analytics code base, test your data.

By having an additional guardrail when you use a git-based data transformation tool, you welcome in other data team members into your work; other peers can review your code changes, help you debug errors, and ultimately build out an analytics foundation that is created with transparency and of higher quality.

However you manage the process, it’s vital to ensure that new data models and data transformation code is never entering your code base without being tested against your standards and peer-reviewed.

3. In production

Once your data transformations and tests are built and merged into your main production branch, you want to run them on a regularly scheduled cadence. Why? After your data model is created, things will happen to it over time:

An engineer will push a new feature into production that changes your source data
A business user will add a new field in the CRM, breaking the business logic of your data transformation
Your ETL pipeline will experience an issue, pushing duplicate or missing data into your warehouse
Etc, etc.

Automated tests ensure that when something changes about your business or the data, you—not your end user—are the first to know.

Popular tools for running data tests in production include Airflow, automation servers like GitLab CI/CD or CodeBuild, or scheduling cron jobs (read more about the pro’s and con’s of these different options here).

Building a foundation for data quality

To sum it all up:

Data quality testing should be throughout your data pipelines, integrated in a way that’s seamless and accessible.
You should test your data in its raw format, transformed shape, and throughout its lifecycle.
Data practitioners should proactively test data to build trust between data teams and their end users.

To level-set, these things don’t happen overnight; building out a high-quality data stack, regularly tested, and organized takes time, effort, and tooling.

However, there is new software, such as dbt, that’s making data testing a standard in the analytics workflow, providing built-in testing capabilities, and encouraging data teams to embrace software engineering best practices.

Try dbt Cloud for free and start building trust in your data with built-in testing today.

Published on: Mar 24, 2023

Don’t just read about data — watch it live at Coalesce Online

Register for FREE online access to keynotes and curated sessions from the premier event for data teams rewriting the future of data.

Secure your spot now

Set your organization up for success. Read the business case guide to accelerate time to value with dbt.

Read now

Latest posts

Community6 min

Here’s why you (and your team) should attend Coalesce 2025

Daniel Poppy

on Aug 05, 2025

Learn16 min

From stored procedures to dbt: A modern migration playbook

Kathryn Chubb

on Aug 04, 2025

Product4 min

The dbt Fusion engine public beta is now available on Redshift

Azzam Aijazi

on Aug 04, 2025

The dbt Community

Join the largest community shaping data

The dbt Community is your gateway to best practices, innovation, and direct collaboration with thousands of data leaders and AI practitioners worldwide. Ask questions, share insights, and build better with the experts.

Join the Community Explore the community

100,000+active members

50k+teams using dbt weekly

50+Community meetups