Key components of data mesh: Creating and managing data products
In a previous article in our data mesh series, we looked at the four principles of data mesh. Let’s dive more deeply into one of those principles: Data as a Product. We’ll review what data products are, how they can improve data discovery and governance, and how to create them.
Check out all of the key components of data mesh:
What is a data product?
A data product is a data container or a unit of data that directly solves a customer or business problem.
As a deliverable, a data product can be as simple as a report or as complex as a new Machine Learning model. Data products will also contain any metadata required for consumption, such as API contracts and documentation.
In a data mesh architecture, data domain teams own their own data and all attendant processes. Think of data products as the domain team’s primary deliverables.
Attributes of a data product
So what makes something a “data product” as opposed to a lump of data?
Data products have a core set of attributes that set them apart from bare-bones data. A data product should be:
- Trustworthy & truthful
- Secure & governed
Let’s look at each of these attributes in detail.
A data product should be easy to find - e.g., via central registration in a data catalog that tracks data products across the company.
Discoverability solves the chronic problem of finding relevant, high-quality data within a company. A survey by Coveo found that employees spend up to 3.6 hours per day searching for relevant information. The stats are even worse for IT professionals, who spend 4.2 hours per day.
Improving data discoverability reduces or eliminates this wasteful overhead and ensures that teams can start creating new data products promptly. For example, a team that wants to build a product recommendation engine could use discoverability tools—such as dbt’s native documentation and lineage functionality—to find where the organization keeps an anonymized dataset of past customer orders. Once discovered, they can request permission to the dataset, then create their own data pipelines to transform it into the format they need.
Addressable means that a data product has a unique, labeled location from which data teams can retrieve the asset.
The addressing format will differ based on the asset. For a database table, this may consist of a server name, port number, and schema/table path. For data exported by a partner, it might be a Parquet or CSV file stored in an Amazon S3 bucket. The only requirements are that the address uniquely identifies the asset and that it can be retrieved on demand by anyone with the proper permissions.
Trustworthy & truthful
In one survey of 220 data governance professionals, 46% said concern over data quality impedes use. Even if employees can find data, they won’t use it if they can’t trust it.
Data products can help improve people’s confidence in data by providing self-service answers to fundamental questions about data:
- Who owns the dataset?
- How often is it updated (e.g., near-real-time, x times/day, every 24 hours, weekly) and when was the last update?
- What procedures do the data owners take to cleanse and validate data?
- Has the data been tested?
Data contracts also help enforce the trustworthiness of data products by ensuring they meet a specific structure and standard.
A limitation of data without accompanying metadata is that it can be difficult to figure out what it means or why it exists. By contrast, data products provide mechanisms to describe what data they make available, its format, its intended business purpose, and any other relevant usage information.
A good example of a technology used to create a self-describing data product is a dbt data model. Models do more in dbt than specify how to transform data. They can also describe each model’s data and how it relates to other models in the company. This provides critical information to other teams looking to leverage the data in their projects.
Another example is a data model contract. A contract specifies a set of constraints to which a data model adheres. Teams can version contracts when they need to make breaking changes, allowing dependent teams time to transition from the old to the new model.
Data products within a company should be interoperable - i.e., provide mechanisms for working seamlessly with other data products. Part of this work involves standardization of concepts (e.g., “customer”, “product”) common to the organization. Another part involves defining Application Programming Interfaces (APIs) and data formats to enable data exchange across teams and divisions.
Secure & governed
Finally, data products should leverage federated computational governance to ensure security and compliance. All data should be stored encrypted at rest and in transit, and protected via a strong “zero-trust” permissions model. The company should also set global rules and standards for classifying and restricting access to sensitive data, such as a customer’s Personally Identifiable Information (PII).
Why do we need data products?
Thinking of data as a “data product” reverses the way most teams approach data.
Traditionally, the industry has focused on data from the technology side (where is it stored? how is it processed?). Data products put the focus instead on people,processes & delivering business value. The emphasis is less on the technology used to deliver data and more on how it’s packaged and delivered to the end user.
Thinking about data this way can resolve three chronic issues with handling data at scale.
Data products define standards for documenting, cataloging, and discovering data. Without such standards, companies end up with terabytes of underutilized business information. This so-called “dark data” can comprise over half of a company’s data assets. This data is not only not generating income—it’s costing the company money in storage and compute to maintain it.
Data products can help a company standardize access control mechanisms for sensitive data. For example, in dbt, you can specify public, private, and protected access levels for data models. This provides greater security by cleanly distinguishing between data of interest to the company and data internal to a team.
Without data products, teams often don’t publish detailed documentation or contracts specifying what data they expose to others. This means that, when they make a change - e.g., delete a column, change the format of data in a string - there’s no way to communicate this to downstream consumers. That ends up breaking reports and applications that depend on the data.
Data products provide definition around the structure of their data. This enables teams to publish and support multiple versions of their data products. The reduction in unexpected reporting and pipeline errors saves both time and money.
How to define and deploy data products
Architecturally speaking, a data product consists of several components. The most important of these are the data specification and data contract.
Data specifications are one or more human- and machine-readable documents that define the format of data, data definitions, access policies, and data transformations.
Data contracts are guarantees of behavior for a given version of the data product. You can think of them like APIs in a service-oriented software architecture.
Contracts guarantee that the data product’s output for that version will always return consistent results. That’s because a contract is a machine-readable specification that can be used for testing and verification. For example, a centralized quality management system (e.g., a data catalog) can run a contract against a portion of a data domain team’s data when it submits a new data product.
Other assets that comprise the data product include:
- Tests. Code that verifies the validity of your models against representative data.
- Version control: Leverage git to check in and track changes to data definitions, contracts, and data pipeline code. This serves to document changes and enable rollback to previous versions when required.
- Data storage: Object file storage, RDMBS/NoSQL database tables, data warehouses, date lakes, etc. to hold raw and transformed data.
- Orchestration pipeline. Computing processes that transform data, run tests, and deploy changes to one or more environments.
- Additional deliverables. Any additional artifacts that comprise the data product, such as reports and metrics.
In a data mesh architecture, a centralized data platform team will often provide self-service methods that enable a data domain team to provision the elements of a new data stack initialize their domain. This ensures that every team is developing data products against the same standards and with a consistent toolset.
Once provisioned, the data domain team develops the models, permissions, tests, ELT processes, reports, and other deliverables that comprise its data product. After extensive testing, the team publishes its initial version to the data catalog. The registration includes information such as the data model, the current contract specification, the address of the data product, and any additional metadata required by the registry.
After registration, the data domain team resolves any security and compliance issues detected by the registry. From there, other teams can discover and use the data product in their workflows.
When the data team needs to introduce a breaking change, it creates a new contract with a new version and publishes it to the registry. It also provides an “end of life” date for obsoleting the previous contract. The registry can use data lineage information to inform owners of downstream data products of the upcoming change.
Data products represent a new, consumer-oriented way of thinking about data. By treating containers of data as modular, self-contained deliverables, companies can enhance data discovery, tighten data security, and reduce costly rework. That lowers the barrier to creating new data-driven products - and opens up a world of new business opportunities.
Last modified on: Nov 22, 2023