What are the four principles of data mesh?
How do you implement a data mesh? Let’s explore the four principles of data mesh, how they relate to one another, and how you can use them to guide and optimize your implementation process.
What is data mesh?
A data mesh is a decentralized data management architecture comprising domain-specific data.
In a data monolith architecture, all elements of a data product are stored and managed in one centralized location by one centralized team. By contrast, with data mesh, a centralized team enables data mesh scenarios through core data handling services, while individual teams retain ownership and control over their own domain-specific data.
Implemented properly, data mesh provides an ideal balance between data democratization and data governance. It enables data domain teams to move quickly and change in response to requirements changes or market conditions, while simultaneously enabling the organization to effectively manage and monitor data for quality and compliance.
[Join us at Coalesce October 16-19 to learn from data leaders about how they built scalable data mesh architecture at their organizations.]
Why the four principles of data mesh are important
Zhamak Dehghani, the progenitor of the data mesh architecture, laid out the four principles of data mesh during her time at Thoughtworks.
These principles are key because data mesh requires more than just re-architecting your data-driven applications. It involves a mindset shift in how organizations manage data. It also takes diligence to achieve the correct balance between agility and effective oversight.
You can use these four principles in guiding your organization through its own data mesh journey. Let’s look at each one in detail.
Principle 1: Domain-driven data ownership
The foundational principle of data ownership is that individual business domain teams should own their own data. This idea is built on Eric Evans’ work around domain-driven design. The aim of domain-driven data ownership is to align responsibility with the business (marketing, finance, etc.), and not with the technology (data warehouses, data lakes).
As an example, consider a fictional e-commerce company. At the highest level, the company breaks down into natural data domains by department and function: e.g., sales, marketing, finance, engineering.
This principle of data mesh is critical and is also often the hardest to come to terms with. It requires a fundamental shift in how organizations approach data. Principally, it requires a new data architecture to support it: a self-serve data platform operating on a federated governance model. (We’ll dive into those principles further below.)
However, when done correctly, domain-driven ownership brings a number of benefits to an organization compared to the data monoliths of old:
- Clear ownership and demarcation. One of the persistent problems with existing data architectures is determining who owns a given data set. With domain-driven ownership, teams own their own data and register their ownership with a data catalog. In our example above, this means finance would own financial data, sales would own sales data, and so on.
- Scalability. In data monolith architectures, a single data engineering team must own and manage the data and associated business logic for every team across the organization. That creates bottlenecks. With domain-driven ownership, individual teams own their own data and pipelines, which distributes the burden.
- Increased data quality. Centralized data engineering teams don’t always have all the information they need to make the best decisions about what constitutes “good” data. Yielding these decisions back to the data domain owners results in better decision-making and better data quality across the organization.
- Faster time to market. Data domain teams are more knowledgeable about their own data and own their own data pipelines and business logic. This means, on average, that they can deliver new solutions in less time than if they had to hand the full implementation over to a centralized data engineering team.
Getting started with domain-driven ownership
It doesn’t make sense for every domain data team to reinvent the wheel. To support domain-driven data, organizations should task a data platform team to create the necessary infrastructure that enables domain teams to manage their own data.
A centralized data mesh enablement architecture offers centralized services for data management, including storage, orchestration, ingestion, transformation, cataloging and classification, and monitoring & alerting.
On the data domain side, teams need to define their own data contexts and data products (which we’ll discuss more below). They may also want to have embedded data engineers and analytics engineers to support managing their own data pipelines and reports.
Principle 2: Data as a product
No team should live in a silo. Teams need to share data with one another constantly. This is where the mesh portion of data mesh comes in. Each team defines not just the data that they own, but what data they produce and consume from others.
In software engineering, many teams have shifted to service-oriented architectures (SOAs) in which each team defines explicit interfaces that other teams and components can call. This enables teams to stay small (“two-pizza” size) and move quickly. It also enables teams to version their contracts so they can introduce changes without breaking existing dependencies.
One goal of data mesh architecture is to achieve the same level of speed and agility as service-oriented Agile software teams. This is where the idea of a data product comes in.
A data product is a well-defined, self-contained unit of data that solves a business problem. A data product can be simple (a table or a report) or complex (an entire machine learning model).
In order to succeed, data products must be:
- Discoverable. Other teams need a way to find data products and incorporate them into their own data-driven applications.
- Trustworthy. Teams that rely on a data product should be confident that it will not suddenly change or break, causing their own workflows to grind to a halt. They should also feel that they can trust the underlying data.
- Accessible. Teams should be able to access and provision data products easily—ideally in a self-service format, such as a self-service BI report.
How do data products achieve this?
- Interfaces that define how the team exposes data to others: e.g. the columns, data types, and constraints that compose a given data product
- Contracts that serve as written specifications for interfaces and that all teams can use to validate their conformance to the interface
- Versions to define new revisions of a contract while supporting previous versions for backward compatibility
- Access rules that define who can access what data in a data product. For example, the product should mask sensitive data fields—such as personally identifiable information (PII)—from personnel who do not have a clear business reason to see it.
When teams fail to think of data as a product, they end up with no clear plan for interoperability. This wastes time as they scramble to understand how partners structure their data. It also leads to broken data pipelines and reports when some upstream or downstream team changes its implicit data contract unexpectedly (e.g., removing a column, or changing a data type).
By contrast, defining contracts makes interfaces explicit and reduces roadblocks to interoperability. It also limits downstream breakages, as teams can introduce v2 of an interface while offering limited-time support for v1.
Getting started with data as a product
To implement data as a product, data platform teams need to support tools for defining, validating, and versioning contracts. For example, using dbt’s model contracts features, teams can easily define their models’ columns, data types, and constraints.
Model access control gives teams finer-grained control over which of their models they make public. This enables differentiating between public models designed for interoperability, protected models used within a project, and private models that are exposed only within a group.
Treating data as a product also means managing it as a product. Data domain teams should work to understand and document their existing data workflows, as well as create backlogs to track and manage upcoming releases.
An important success factor in data mesh is a data enablement team. The enablement team assists domain teams in making the shift to data as a product by defining modeling best practices, designing reference examples, and training users on tools and processes. Organizationally, the data enablement team is often a part of the data platform team.
Principle 3: Self-serve data platform
Domain data teams manage their own data products from nuts to bolts. This includes all associated ingestions, data transformations, data quality testing, and analytics.
But as mentioned above, it doesn’t make sense for every domain data team to stand up this toolset on their own. Most won’t have the time or capability. At best, this results in redundant contracts (e.g., multiple teams licensing multiple data storage solutions from different vendors) and incompatible tooling.
A more efficient way to manage this is by creating a self-serve data platform. This platform consists of the tools that data domain teams need to ingest, transform, store, clean, test, and analyze data.
The self-service data platform will also define other tools, such as:
- Security controls for managing access to data
- A data catalog for registering, finding, and managing data products across the company
- An orchestration platform for governing access to data products and provisioning them
Without this self-service platform, many teams will lack the tools required to join the data mesh. By enabling these tools, the data platform team unlocks the scalability of a data mesh architecture.
Getting started with a self-serve data platform
The self-serve data platform is where the data mesh moves from theory to reality. So it’s critical that the data platform team and data domain teams work closely together to create a toolset that works for all stakeholders.
The most successful toolsets will be those that leverage technology that your engineers and analysts already know. For example, using a tool like dbt for data transformation requires less ramp-up time, as it leverages a language—SQL—that most engineers and analysts already know and use daily.
When introducing a major change such as a self-serve data platform, it’s best to start small. The data platform team should gather requirements as broadly as possible—but implementation should start with a single data domain team. After onboarding a single team and incorporating their feedback, the data platform and enablement teams can onboard the next team and then the next. Along the way, the platform and domain teams iterate over the toolset and processes, while the enablement team expands its training and repertoire of best practices.
Principle 4: Federated computational governance
A data mesh can become a data anarchy without proper governance controls. That’s why the final—and perhaps most important—principle of data mesh is federated computational governance.
Compliance laws like the General Data Protection Regulation (GDPR) require companies to classify, secure, and—when needed—delete sensitive customer data from their systems. In a data monolith architecture, the data platform team could enforce these rules from the top down. While this worked (mostly), it was often manual, hard to scale, and introduced what domain needs saw as bureaucratic delays.
In a data mesh architecture, while domain teams own their data products, the data platform and the corporate data governance team track and manage compliance centrally via a data catalog and data governance tools.
The data governance team sets standards for compliance—what constitutes sensitive information, who should have access to it, and how it should be labeled in the system. The data governance team also defines standards for data quality to ensure consistency across teams. The data platform team implements these policies through automation (the “computational” part of computational governance).
For example, the data governance team would mandate that all government-issued ID numbers need to be marked as PII in a data catalog. The data platform team can enforce this computationally by requiring appropriate tagging for all PII across all registered data products. It could also scan for fields that match sensitive data patterns and issue warnings if it discovers PII with inadequate access controls.
When creating a new data product, data domain teams onboard them to the data platform’s data catalog. They are then responsible for responding to and fixing any compliance issues with their data, such as classifying unclassified values or removing sensitive information from access logs.
Federated computational governance enables data governance at scale. Automated policy enforcement reduces the manual labor required to remain compliant with the growing, complex body of data regulations worldwide. And when there are issues, the data domain team—the data owners—can resolve them quickly.
Getting started with federated computational governance
After establishing firm data governance policies, the data governance and data platform teams invest in tools that support federated computational governance. Many data catalogs offer robust data governance tools out of the box or as add-ons purchased separately. Data platforms and data transformation tools also provide governance features like role-based access control, testing, and model governance.
It’s the data platform’s job to convert data governance policies into automated governance. This includes setting appropriate access controls, enforcing classification rules, establishing rules for data quality, and configuring anomaly detection, among others.
The data governance team, which itself is comprised of domain experts, work with the enablement and data domain teams to educate everyone on data governance best practices, including the domain teams’ new responsibilities as data owners.
The four principles of data mesh define a new approach to data architecture comprised of domain-driven data ownership, data as a product, a self-serve data platform, and federated computational governance. By putting these four principles into practice, you can shift your organization to a highly scalable data model that effectively balances speed with responsibility.
[Join us at Coalesce October 16-19 to learn from data leaders about how they built scalable data mesh architecture at their organizations.]
Last modified on: Nov 22, 2023