Blog Data quality dimensions: Tackling data quality with intentionality

Data quality dimensions: Tackling data quality with intentionality

Data quality dimensions create a general profile of your data's quality. How you define what "good" data quality is depends on how that data is used. Read now
Data quality dimensions: Tackling data quality with intentionality

If you’re reading this page, you’re looking to improve the data quality within your organization. You know there’s room for improvement, and you want to learn the secret sauce that makes for higher data quality.

This is where we take a step back and take a good, hard look at our industry: there is no secret sauce for higher quality. But there is a framework and series of data quality dimensions that when thought of with intentionality, can help you figure out the most effective way to improve the data quality within your organization.

The list of data quality dimensions we list in this article might be different, or defined differently, than others you might find. And that’s okay! This page is meant to question and further your understanding and expectations of data quality, and force you to look at data quality through the lens of your organizational needs.

The data quality dimensions you need

We’ll get deeper into this later, but data quality is something that is relatively tailored to your business; what good or bad data quality looks like is dependent on the data’s original state and how it’s transformed and ultimately used for business decisions.

However, there are core dimensions of data quality that can help you create a general profile of where your data quality stands, and where it can go:

  • Usefulness
  • Accuracy
  • Validity
  • Uniqueness
  • Completeness
  • Consistency
  • Freshness

Use this page to understand what these pillars are and how they might apply to your analytics work, and how you may use modern data transformation tooling to improve your data quality and save yourself and your business time, money, and energy.

Usefulness

The data quality tenant of usefulness—the intrinsic value it brings to your organization—is often forgotten in talks about data quality. There’s little point in spending hours of time and energy building complex tables that no one ever uses; ask any data practitioner—this is their literal nightmare.

Before you go down the path of data ingestion, exploration, transformation, and testing, it’s important to truly understand the use cases for the data: what business problem would this solve if someone had this data? Is this the right data to solve your problems? Is that data useful for a point in time, or will it be a data source that will need to be revisited and iterated on in the future?

All these questions will lead to tough, important conversations with your end users to help you get a grasp on the true usefulness of your data. This data quality dimension is also unique in that it’s more difficult to quantify, say compared to accuracy or completeness. But usefulness is one of those heuristics that will cost you some time up front, and save you considerable time and energy in the long run.

There are no real tools, tricks, or tomfoolery to help you unpack the usefulness of your data; the other data quality dimensions will help you identify which data may be usable, but it’s the deep knowledge of your business and end users that will ultimately help guide this tenant.

Accuracy

The cornerstone of all data—is it accurate? Is it trustworthy?

As data practitioners, it’s your job to ensure that the data you expose to end users is accurate, as in it contains the values that reflect reality. If your business sold 198 new subscriptions today, 198 should be represented in your raw data.

When data is accurate, it’s trustworthy. When there’s trust between data teams and business users, collaboration is easier, insights are faster, and the Slack notifications are quieter.

That sounds great and all, but ensuring accuracy is probably one of the greatest challenges of any analytics practitioner. However, do you test for accuracy in your system in an efficient way?

Data quality testing solves for this question. dbt offers code-based tests to help data folks create levels of accountability for their data—are specific column values the correct value, are primary keys non-null, etc.—and supports packages that contain robust testing methods to use across your data. 

Data accuracy will likely always be the uphill climb for data practitioners, but there’s a way to make that trail a little less lonely, a little more supported using software that makes data quality testing automated and second nature in the analytics workflow.

Validity

Validity is the pulse check on your data—if a column value is not valid (does not meet your expectations), it will likely break downstream analyses. To check if your data is valid, we recommend using accepted values tests in your data transformations, to check if your data meets accepted requirements, or fails when it doesn’t. 

Validity, uniqueness, and completeness of your data ultimately roll up into the concept of your overall data accuracy.

Uniqueness

Data is unique when it’s non-duplicative and structured in a way that allows for growth. The baseline for uniqueness is usually checking rows for duplication, usually through the test of primary key uniqueness. What takes an analytics foundation to next-level uniqueness is creating a structure that supports data transformation modularity; data transformation logic that is used widely should be accessible and referenceable across your data models, not land-locked in a script stored on your analyst’s local machine.

If you use a data transformation tool like dbt, you can use packages such as the dbt_project_evaluator package, to automatically test if your data transformations are following best practices of structure, modularity, and naming conventions.

Completeness

Completeness in data quality is a bit like an onion:

  • The outermost layer of the onion is the foundational bits: If you run an e-commerce shop, your order_id and order_date fields should never be null in your orders table. But perhaps it’s okay if the returned_date field has some nulls—after all, that would be bad for business 😉
  • The inner layers of the onion are where things start getting interesting: Your e-commerce shop needs reporting on orders, customers, and shipments—there’s more data modeling work to do here. Data sources need to be extracted and loaded into your data warehouse, new data models need to be built and tested, and this new data needs to be brought into your BI tool. This could be considered the layer of schema completeness.

Completeness starts with checking that individual tables in your database meet your standards of accuracy and validity; as completeness there develops, you begin folding in new data sources, expanding the reach of your data work, and building out a lineage of data transformations that scale with the demands of your business.

In modern analytics work, this is the data quality concept of completeness.

Consistency

Data consistency is two-folded:

  1. Are data values consistent across platforms?
  2. Is the structure of the data and the transformations supporting them consistent in a way that anyone could maneuver through your data pipelines with minimal effort?

The first concern is likely the one most data practitioners worry about. Fact: it’s challenging to ensure column values and core metrics are consistent across platforms. This is because raw data from different sources comes in different timezones, currencies, structures, and naming conventions. SQL and Python are needed to make this data readable and consumable during the data transformation process.  When you use a tool like dbt for data transformation that uses version control, encourages peer review, and supports integrated testing, data consistency becomes something that is more controllable and truly ingrained into your analytics workflow.

Because dbt is also built upon the concept of a DAG, or a directed acyclic graph, when there are changes made to upstream models—say a condition on a case when statement changed, so there’s a new value for a user_bucket column—you can rebuild your table with the new column value and impact all downstream cases of user_bucket because dbt’s ability to reference.

dbt also supports the use of metrics via its Semantic Layer, allowing you to define a core metric once (in a version-controlled, code-based environment), and ultimately exposing that same core metric across your database and BI tool. At the end of the month, the CFO and COO shouldn’t have two different values for ARR; with a semantic layer approach to metrics, key KPIs are governed and maintained in a way that creates unparalleled consistency.

The second aspect of consistency is something that comes with maturity and time; standardizing and governing your data transformations to use consistent naming conventions for tables and columns and structure that encourages consistent and scalable growth is something that will naturally develop as your team figures out what standards work for you, and which ones don’t. There’s no perfect formula for developing this level of organizational consistency, but we recommend investing in a data transformation tool that supports modular data modeling and growth, collaborative development environments, and version control to build an analytics foundation that can one day be governed with ease.

Freshness

Data is only useful if it’s timely—if it’s there when a stakeholder needs it. When an end user runs a query against your analytics database, creates a visualization in a BI tool, or receives an automated report to their email, how are you ensuring the data they’re looking at is up-to-date?

How often your data is refreshed and new is dependent on your use cases—intentionality plays a big role in these decisions. Is your business making minute-by-minute optimizations? (Probably not). Do you need the data refreshed every few hours to understand if today’s going to be a good day for your business? (Probably.) The cadence you determine to refresh and renew data is based on your end users’ needs, but should be approached with intentionality. The cadence must meet user expectations, respects the limits of your data pipelines, and recognizes impacts on cost.

To ensure the data that you’re exposing is fresh (to your standards), you can use a tool like dbt that supports automated tests on data source freshness to indicate whether source data is out-of-sync in an ETL tool, and dashboard status tiles to clearly display data freshness and quality statuses within your BI tool. These features allow you to put the knowledge business users need about the data in the tools they feel most comfortable with—building a system of trust and reliability between data teams and end consumers.

Lead with rigor, lean into intentionality

So there it is: all you need is useful, accurate, valid, unique, complete, consistent, and fresh data. That’s all there is to it, right?

Data quality is one of those things data practitioners will forever strive for and perfection is beautiful, but beauty is in the eye of the beholder; perfection is often not attainable given resources and tooling. At the end of the day, your data quality dimensions and tests are dependent on your business and standards. Maybe your data doesn’t need to be 100% accurate 100% of the time; maybe 90% directionally accurate 100% of the time is all your end users need to make a data-informed decision.

Ensuring high data quality is hard—as is defining what high-quality data looks like—but using a data transformation tool that allows you to create data transformations, tests, and pipelines that reflect the state of your business is the starting point to building a high-quality, trusted analytics workflow.

Footnote

https://www.gwenwindflower.com/blog/1

Last modified on: Apr 21, 2023