Data products and data mesh: Leveraging both to simplify data management
May 21, 2024
ProductMany data engineering teams have struggled to keep up with the growing demand for data. Two related solutions have emerged to make it easier for teams to manage data at scale: data mesh and data products.
A data mesh is an architectural pattern for scaling data systems and providing domain teams with more local, granular control of their data. Data products are the tangible outputs of this process - curated and managed data assets that serve specific business needs. Used together, the two can make it easier for enterprises to scale, discover, and manage high-quality data sets.
In this article, we’ll examine how data mesh and data products work, how the two are closely connected, and the tools available for implementing and maintaining both.
What are data mesh and data products?
Even as software application teams have embraced microservice architectures, many data engineering teams have continued managing data in monolithic data stores. This has resulted in data engineering teams becoming the single point of contact for data requests.
This approach has two major downsides. The first is that the data team becomes a chokepoint. Inundated with requests for new data or data pipelines revisions, the team quickly falls behind, unable to devote much - if any - energy to improving the organization’s overall data infrastructure.
The second is that the quality of data decreases. A single centralized data engineering team often doesn’t understand the full business context behind a given data set. This can result in a disconnect between data producers and data consumers, resulting in inaccurate data. Consequently, this can propagate downstream to reports and dashboards, leading to misinformed insights and decision-making.
Data mesh and data products are approaches to managing and packaging data that aim to make it easier to manage data at scale.
What is data mesh?
A data mesh is a decentralized data management architecture comprising domain-specific data. Instead of maintaining a single centralized data repository, data engineering enables teams to own the processes around their own data.
A data mesh architecture implements four key design principles. These include:
Data as a product. In a Data as a Product (DaaP) mindset, data teams think of data, not as a monolith, but as individual products that are built for the benefit of consumers. This means taking the needs of data users across the organization into account when designing data sets, as well as managing changes to avoid breaking existing users.
Domain-oriented data and pipelines. In the data mesh pattern, business teams own their own data. Each team is responsible for maintaining its data stores, pipelines, and tests. This enables each team to define and enforce its own rules around its data. It can then provide this data to other teams via well-defined interfaces.
Self-service data infrastructure. Requiring every domain team to build its own data infrastructure would be a waste of resources. It’d also create an impossibly high bar for many teams to hurdle. To enable data ownership, the data engineering team creates tools required to provision data stores, transform and clean data, verify data, and render analytics.
Security and federated governance. The opposite of a “data monarchy” isn’t “data anarchy.” Federated governance enables teams to secure and classify their data to remain compliant with industry and governmental regulations. It also provides the organizations with the tools required to automatically monitor and ensure compliance in a distributed data environment.
What is a data product?
A data product is a data container or unit of data that directly solves a customer or business problem. An outgrowth of a Data as a Product mindset, a data product contains a usable unit of data - e.g., a table, a report, a Machine Learning model - as well as the metadata, pipelines, API contracts, and documentation required to produce and use it.
Data products are defined by having a core set of attributes, including:
Discoverable. Other teams can search and find them via some mechanism, such as a data catalog. This eliminates the wasteful overhead involved in finding and using existing data.
Addressable. Each data set has a unique, labeled location that everyone across the organization can reference to identify it.
Trustworthy & truthful. The data product contains information about who owns the data set, how often it’s updated, and where it sources its data from. This increases confidence in a data set by providing potential users with self-service answers to fundamental questions about data quality and provenance.
Self-describing. Data products use metadata, documentation, and contracts to define the data’s format, its intended business purpose, and any other relevant usage information.
Interoperable. Data products leverage standardized concepts and field across the organization. Additionally, they expose their information via standard query languages, data interchange formats, and Application Programming Interfaces (APIs) to enable data exchange across the org.
Secure & governed. Data products encrypt their data ar rest and in transit, control access to data via role-based access control, and adhere to organization standards for classifying and protecting sensitive data, such as a customer’s Personally Identifiable Information (PII).
The connection between data products and data mesh
A data product is a key component within a data mesh, designed to make data easy to find, consume, share, and govern. It embodies the architecture's four key design principles:
Data as a product
When working in a data as a product mindset, organizations put data products through a development lifecycle. They view data itself as a valuable product that must be managed, curated, and delivered with the same rigor as software applications. Since high-quality data is fundamental to a successful data product, teams put a greater emphasis on ensuring quality and usability for data consumers.
Domain-oriented data and pipelines
In a data mesh architecture, business domain teams - the teams that truly understand their own needs and requirements - own their own data. This includes owning all of the associate processes, including ingestion, transformation, governance, and serving.
In addition, data domain teams are responsible for maintaining data quality, versioning their data sets, and monitoring data usage and storage to reduce costs wherever possible.
Packaging data as a data product - with its associated metadata, documentation, contracts, and interoperability mechanisms - makes achieving these objectives easier. Data products give teams a uniform structure for packaging, deploying, and discovering their data sets.
Self-service data products
It doesn’t make sense for every team that owns its own data to reinvent the wheel. Key data and basic data functions—e.g., the tools required to store data, create data pipelines, render analytics, etc.—should still be owned by the data engineering team.
In a data mesh paradigm, the difference is that these tools are open and available to all data domain teams who need them. This open data architecture democratizes data by giving every team a consistent and reliable method for creating their own data products.
Strong security and federated governance
Data products are built with discoverability and governance as part of their design. Each team is responsible for associating access controls with its products and classifying and tagging data.
Teams are also responsible for publishing their data products to a data catalog, where they can not only be discovered by other teams, but monitored centrally to audit and report on organizational compliance initiatives.
Data product and data mesh tools
Standing up a data mesh architecture and creating data products isn’t something that happens overnight. Fortunately, frameworks like dbt Mesh can make this easier with out-of-the-box tooling support for data mesh and data products.
Here’s are the systems and processes that are part of a combined data mesh/data product architecture, along with the support that dbt provides:
Data sources: The internal and external data storage systems - relational database, NoSQL databases, data warehouses, etc. - that hold your raw, unprocessed data.
Data pipeline and data transformation tools. dbt enables using SQL or Python to define models that transform source data into transformed and cleaned data for use in a specific data product. dbt also supports versioning model code in source control and automatically running data quality tests using Continuous Integration (CI).
Data destinations. The locations - RDBMS databases like MySQL and PostgreSQL, data warehouses like Snowflake and Amazon Redshift, data lakes, object storage, etc. - where you store data after it’s been transformed.
Data management tools. Your architecture needs a mechanism to version data products, enabling teams to release breaking changes without interrupting the workflows of downstream consumers. dbt enables explicitly versioning data products via dbt models, as well as establishing contracts that specify the constraints to which a data product’s data commits.
Additionally, dbt supports defining fine-grained access controls on data, as well as using metadata to define data sensitivity and classification levels.
Data discovery tools. Providing tools to find and use data products reduces the risk of data silos and enables organizations to gain maximum business value from their data. dbt Explorer gives data consumers a full, 360-degree view of organizational data, along with its corresponding metadata, documentation, and data lineage.
Data analytics tools. Once exported to a data destination and exposed in a data catalog, a data product’s end users can use their favorite Business Intelligence and analytics tools to consume a team’s work. You can further use the dbt Semantic Layer to standardize key organizational metrics in a single source.
Provisioning architecture. The data engineering team, instead of spending all of its time building data pipelines, builds tooling that ties all of these elements together into a self-service data infrastructure. This includes provisioning storage capacity, configuring access controls, and creating data pipeline frameworks and base data models from standardized templates. In terms of provisioning support, you can use dbt to modularize and centralize your analytics code. You can simultaneously provide your data team with the ability to collaborate on data models, version them, and test and document your queries before safely deploying them to production. dbt Cloud plays a critical role in this infrastructure by providing a centralized turn key platform for transformations and modeling.
Conclusion
Data mesh and data products work together to give business teams the ability to own and publish their own data products. That enables greater data scalability and reliability across the organization. By leveraging dbt, you can eliminate a lot of the architectural grunt work involved in standing up the infrastructure required to support this decentralized and democratized approach to data management.
For more information on how to get started with dbt Mesh in your company, contact us for a demo today.
Last modified on: Oct 15, 2024
Set your organization up for success. Read the business case guide to accelerate time to value with dbt Cloud.