dbt
Blog Data quality best practices: Six essential principles for analytics and AI

Data quality best practices: Six essential principles for analytics and AI

Poor-quality data can torpedo your analytics and AI initiatives. One survey by FiveTran found that bad data can cost companies an average of USD $406M yearly.

Basic data hygiene techniques, such as testing, play a huge role in improving data quality. However, writing a few tests isn't enough. You need a comprehensive approach that implements checks and balances and multiple points in the process. In this article, we'll look at six principles you can use to enhance data quality at every step of the data life cycle.

Data quality best practices

In working with hundreds of clients in the data space, we’ve distilled a few common best practices any company can benefit from. These include:

  • Build a quality-oriented analytics workflow
  • Work with data through a single data control plane
  • Shift data quality testing left
  • Monitor and report continuously in production
  • Control access to data
  • Don’t over-index on data quality

Let’s take a look at each one in turn, what it means, and how you can implement it.

Build a quality-oriented analytics workflow

Historically, a lot of work in data analytics has been ad hoc. Data engineers write data transformation scripts in a variety of languages and often for their own personal use (i.e., not checked into source control). ‌Changes may get run against production systems with minimal verification or testing. Over time, this leads to poor data quality and data corruption.

The first step towards improving data quality is creating a mature analytics workflow. This is a workflow that:

  • Represents all analytics changes as code and puts all assets under version control
  • Enables collaboration at scale
  • Is accessible to the widest variety of data users possible
  • Enables a high deployment velocity
  • Supports security, auditing, and governance

We've laid out our approach to a mature analytic workflow, which we call the Analytics Development Lifecycle (ADLC). ‌The ADLC is a vendor agnostic approach that builds quality into every part of an analytics workflow:

  • At the plan stage, a unified team - including data engineers, analysts, and business users - define data quality explicitly, as well as determine who should have access to what data
  • During development, engineers track all of their changes using version control. They save their code to non-production branches of a source repo and can only merge their changes to the production branch by creating a Pull Request (PR) and undergoing a peer review
  • Test requires all engineers to validate their changes to ensure their code does what they intend it to do under a variety of circumstances, and fails gracefully when it encounters an unexpected situation
  • Deploy uses automated Continuous Integration/Continuous Development (CI/CD) pipelines to run tests in pre-production environments before exposing changes to users, ensuring changes can be deployed both quickly and safely
  • Observe and analyze provides always-on access to analytics and continuously tests in production to find errors before users do
  • Discover and analyze makes all data models and their accompanying documentation easily discoverable and available to data stakeholders, enabling them to validate both the correctness and timeliness of the data they’re using

A mature analytics process does more than test code. It provides transparency and visibility to all data stakeholders throughout the entire development process, so that everyone understands and trusts the data they’re using.

Work with data through a single data control plane

In the past, data pipelines were complicated. They are written in a plethora of languages, were often buggy and temperamental, and required detailed knowledge of arcane processes to run. ‌This meant there was no single, common approach to implementing and verifying data quality.

The explosion of data and growth of data-driven applications — including analytics and, now AI—means companies need easy-to-use tools for transforming data and verifying data quality that they can use across the business. ‌Ideally, they need a single place that data stakeholders can use to manage all data transformations and verify the accuracy, timeliness, and origins of data.

We call this the data control plane—an open, cross-platform approach that provides a single point of access for transforming, storing, discovering, using, and managing data. With a data control plane, you should be able to:

  • Model, transform, and test data
  • Democratize data so that data engineers, analysts, and even savvy business users can build and deploy analytics code changes with high velocity
  • Optimize data platform costs across all workloads
  • Discover data and its accompanying documentation and see a map of your data estate using automatically generated data lineage

dbt Cloud functions as a data control plane by providing a common framework for managing data. Engineers and analysts can create models and tests using SQL or Python with the same toolset and process each time, reusing models and tests across projects. Meanwhile, data stakeholders can view data, documentation, and important data quality information— such as lineage and data freshness—via a single portal.

Shift data quality testing left

Shift-left testing has been an important concept for years in the software development world. It refers to moving testing earlier in the development process rather than treating testing as a separate phase that happens at the end.

Shifting testing left recognizes two realities. First, bugs detected earlier in the development process are easier and cheaper to fix than bugs discovered in production. Second, doing testing “at the end” usually means testing never happens at all. Starting testing earlier —for example, during the development phase—ensures that testing is an integral part of the process rather than an afterthought.

With dbt Cloud, developers can create tests alongside their code. Teams new to shift-left testing can start by implementing the essential data quality checks. From there, they can implement more advanced testing, including domain-specific custom tests.

One aspect of shift-left testing that's remained challenging for data has been testing locally during development. Developers have either had to maintain their own data warehouse instances or run their changes through a full deployment pipeline iteration to test them fully.

To accelerate this, dbt is integrating SDF, which emulates data stores like Snowflake locally to enable full end-to-end testing before developers even check in their code. ‌This will enable accelerated testing and reduce time spent in the dev cycle fixing broken check-ins. The result is higher data quality delivered in less time.

Monitor and report continuously in production

CI/CD pipelines contribute to data quality by running your data tests against mock data in pre-production environments. This identifies the majority of issues in your code before you make your changes live for stakeholders.

However, you can never fully predict how your changes will act in production against real world data. Especially in long-running data systems, it's almost impossible to foresee every exception and edge case you may encounter.

You can account for this by continuing to test even after deployment. Testing in production runs your tests against incoming data changes, raising alerts whenever an issue is discovered. ‌This keeps your data clean and helps find errors before they impact customers.

You can further keep your data clean by using data profiling to assess and track overall data quality along different data quality dimensions. Assessing your overall data quality means you can set benchmarks, measure progress across teams, and mentor individual teams on how to improve their overall data quality.

Control access to data

An often overlooked aspect of data quality is preventing unauthorized changes. Ad hoc changes that don't go through your analytics workflow development process can inject errors that might take weeks or even months to discover.

Rather than provide direct access to data systems, use your data control plane to ensure all analytics code changes go through the entire ADLC process, which includes peer review and automated testing. This ensures accountability and conformance to your company’s data quality standards.

Your data control plane should provide a way to control access to data based on roles versus hard-coding access to individuals. Use role-based access control (RBAC) wherever possible to both grant and revoke access based on a user’s job role. (dbt Cloud supports RBAC for controlling access to dbt assets on a per-project basis.)

Use Single Sign-On (SSO) so that a user’s permissions are revoked as soon as their corporate ID is decommissioned. This prevents situations where, for example, individuals maintain access to critical systems even after their employment is terminated.

Don’t over-index on data quality

The 10x9 rule in systems engineering says that, for every nine of reliability you add, you increase reliability by 10x— but at 10x of the total cost of your solution.

What does this mean for data testing? It means that, at some point, more testing is too much. ‌The exact definition of “too much” will differ between teams and depend on your use cases.

Use metrics to measure your data reliability across the various data quality dimensions and establish a KPI for each dimension that's easy to achieve. Once you hit that, focus more on making it easier and faster to ship high-quality analytics code rather than blowing money on chasing perfection.

Conclusion

A serious commitment to data quality requires a cultural shift. It means establishing processes and inculcating habits that may feel new or strange to all parties involved. Done well, however, they enable data stakeholders to create, deploy, and use new data products with higher quality and in less time than ever.

While data quality is more than just tools, a good tools framework can simplify and democratize access to data across your entire organization. Using dbt Cloud as your data control plane, you can provide a consistent approach to data quality to all data stakeholders with minimal overhead. To learn more, ask for a demo today.

Last modified on: Mar 07, 2025

dbt Developer Day

Join us on March 19th to hear from dbt Labs product leads about exciting new and coming-soon features designed to supercharge data developer workflows.

Set your organization up for success. Read the business case guide to accelerate time to value with dbt Cloud.

Read now

Recent Posts