Poor quality data has a domino effect that negatively impacts decision-making, compliance, operational efficiency, and trust. By contrast, working actively to improve data quality gives your stakeholders the confidence that they can rely on it to make critical decisions and power new business ideas, such as AI apps.
Achieving - and maintaining - a high level of data quality requires an active commitment from both data producers and data consumers. In this article, we’ll introduce the principles of data quality management, discuss how to implement data quality management in practice, and look at some tools your teams can leverage in their day-to-day data workflows.
What is data quality management?
Data quality management is the set of policies, practices, and methods an organization uses to ensure its data is accurate, complete, and trustworthy. Ideally, these data quality management standards are a part of a larger data governance framework that sets criteria for data quality and consistency across the entire company.
The outputs of a data quality management process may include artifacts such as:
Data retention policies
How long data should be held and what happens to it after that time period. These policies will take into consideration, not just the business value of the data but compliance with applicable laws and regulations.
Data formatting standards
How to encode and represent data across the organization. This can include general rules (e.g., date formats) as well as guidelines for company-specific data (e.g., uniform customer IDs, acceptable ranges of values for a specific field). Data formatting standards help prevent downstream pipeline breakages and make data synchronization easier.
Data tests
Code that checks a data set for various attributes of data quality. A data test might ensure, for example, that all of the required fields in a record are specified, that individual fields are formatted correctly, and that no anomalies exist between records.
Data quality metrics
Data quality metrics provide a quantitative measure of data quality, enabling your organization to identify gaps and improve quality over time. Examples of data quality metrics include total number of data incidents, time to data incident detection, time since last data refresh, and number of passed/failed data tests for a table, among others.
Data quality management in practice
Data profiling. Assesses the structure of your organization’s data and how each table and field relates to the others. This often takes the form of:
- A repository of data models describing data sources, data destinations, and data transformation rules.
- Data lineage, typically represented as a Directed Acyclic Graph (DAG) that shows how both tables and columns relate to one another.
A data profile gives you a complete picture of the data you own and how it flows throughout your organization. Data engineers can leverage it to identify the root causes of data quality issues and fix them at their source.
Data transformation
Corrals your data into tables and views that meet the business requirements of data consumers. Transformation also involves cleaning and formatting data to fit your organization and team’s data quality policies as set forth in your data governance framework.
Data validation
Uses automated and manual validation to check data for its overall accuracy and sensibility. (For example, ensuring that an Age field never has a negative value.) Data validation ensures that the work done during the transformation phase is correct and that the data is free of obvious defects.
Metadata development
Identifies characteristics such as owner of a table, date data was last modified, a description of the data and how it was calculated, related data tests, etc.
By capturing its current status and business purpose, metadata enables better data discovery and usage. This helps you reduce “dark data,” or data that goes unused because no one can find it or validate its meaning.
Monitoring and reporting
Tools and alert systems that track effectiveness and proactively monitor for errors. This will include data quality metrics dashboards and data usage statistics. Data engineering team members can set up ongoing testing on production data and issue alerts immediately upon detecting a data anomaly.
Implementing data quality management
Implementing data quality management requires a combination of processes and tools.
Processes ensure that all appropriate stakeholders—both technical and business line leaders—are involved in data quality management. In particular, processes need to involve data domain owners to validate that data conforms to business requirements and outcomes.
Tools support key elements of the data quality management process, including creating data transformation pipelines and automating data quality standards, testing, and metrics collection and reporting.
Both processes and tools are necessary to implement data quality management at scale. Approaches such as the Analytics Development Lifecycle (ADLC) unite processes and tools into a united framework that enable data producers and consumers to improve data quality via short, rapid development cycles.
Using an ADLC approach, companies can identify important data quality use cases and address them through iterative improvements.
A typical data quality management life cycle will include:
Plan
Technical and business stakeholders work to identify existing data quality issues and prioritize them. For example, the team might identify that duplicate records in sales data are preventing an accurate analysis of purchasing trends.
After identifying use cases, the team will establish procedures for reconciling records, preventing duplicates, and creating a data set that accurately reflects the current business reality. The team should also establish metrics to monitor and verify correctness (e.g., less than x% duplicates tested in the final data set, % data test pass rate).
Develop
The data engineering team will then create data transformation pipelines that produce a clean data set in line with data consumer’s requirements. They’ll also create tests to run against both pre-production and production data to verify the quality of the output.
Test and deploy
The data engineering team checks all of its data transformation and testing code into source control, using Pull Requests (PRs) to review data code changes internally before deployment. It also implements a Continuous Integration and Continuous Deployment (CI/CD) process to test data quality management code in a pre-production environment before releasing to production.
Operate, observe, discover, and analyze
Data consumers use the new, clean data set to create reports and data-driven applications. Along the way, they report any identifiable issues back to the data engineering team for fixing. Simultaneously, the data team tracks metrics and alerts to identify potential issues before they result in report or application downtime.
All teams involved in the ADLC continue iterating over this cycle, delivering new data quality use cases with every release.
Automating data quality management with dbt
It takes time to implement the processes and tools required for an effective data quality management program. It takes even longer if you have to build all of your tooling and pipelines from scratch.
dbt Cloud offers a host of features that significantly reduce the time and effort required to ship high-quality data. These include:
- Transformation: Create models that import data from multiple sources, cleaning and transforming them into new data sets that are ready to use.
- Documentation: Add descriptions directly to data models. Publish new documentation automatically with every push to production, providing other data users with detailed information on the origin and meaning of your data.
- Testing: Leverage built-in tests (e.g., not-null checks) and create custom tests that implement quality tests specific to each data domain.
- Version control integration. Check data models, transformation, and tests into source control to ensure all changes are tracked and reviewed. Isolate in-development changes in branches so that data engineers can work freely on new features or fixes without affecting the current state of production.
- Job scheduling and orchestration: Regularly run your dbt models and tests to bring data changes into production and ensure that data quality checks are performed continuously. Unlike other tools, dbt Cloud easily enables automating data imports and testing in a single data pipeline.
- CI/CD support. Automatically run jobs fro your Git provider based on check-in or completed PRs. Test changes in a pre-production environment before releasing to users.
- Data cataloging: Data producers and consumers alike can use dbt Explorer to find existing data sets and related documentation, as well as trace data lineage to verify the origin of data and troubleshoot upstream data issues.
- Monitoring and dashboards: Monitor metrics and fire alerts in response to dbt Cloud test failures.
Leveraging these tools in dbt Cloud, your team can build a robust data quality management process in a fraction of the time it’d take to build from scratch.
Learn more about how dbt Cloud can kickstart your data quality management journey—contact us today for a demo.
Last modified on: Oct 15, 2024
Set your organization up for success. Read the business case guide to accelerate time to value with dbt Cloud.