What is data mesh? The definition and importance of data mesh
There are many parallels between data analytics workflows and software engineering processes. Both have had to deal with mounting scale and complexity, larger networks of collaborators, and tighter deadlines.
Software engineering has dealt with this complexity by moving from a hero mentality to a team mindset. Many in the industry realized that creating monolithic applications with massive teams was a recipe for increased costs and decreased quality. As a result, companies focused on creating small teams building well-defined components in a service-oriented architecture.
But the same thing hasn’t happened with data. Data analytics, for the most part, still centers on creating monolithic stores managed by single data engineering teams. This results in overworked teams - which leads to shipping delays and a decline in data quality.
How do we bring the hard-won lessons of software engineering into the data realm? In this article, we’ll examine how data mesh architecture turns the monolithic data paradigm on its head - and how it can help you deliver data-driven projects more quickly and with higher reliability.
What is a data mesh?
A data mesh is a decentralized data management architecture comprising domain-specific data. Instead of having a single centralized data platform, teams own the processes around their own data.
In a data mesh framework, teams own, not only their own data, but also the data pipelines and processes associated with transforming it. A central data engineering team maintains both key data sets and a suite of self-service tools to enable individual data ownership. Domain-specific data teams then exchange data via well-defined and versioned contracts.
Why data mesh was born
Zhamak Dehghani first incubated the concepts behind data mesh during her time at Thoughtworks. She created the data mesh architecture to address what she saw as a set of problems with the way companies handle their data.
Slowdowns and silos in the data monolith
With the advent of the cloud, more development teams have moved away from monolithic applications and embraced microservices architectures. However, in the data world, many companies still store their data in monolithic databases, data warehouses, or data lakes.
The architectural choice to use a data monolith has numerous knock-on effects. Monolithic approaches break down a data processing pipeline into several stages—ingestion, processing, and serving. A single team often handles all of these stages. This approach can work at first, but breaks down with scale. Ultimately, funneling all requests through a single team slows down the delivery of new features.
In this approach, data engineering teams often can’t gain the full context behind the underlying data in this model. Since they’re responsible for maintaining data sets from multiple disparate teams, they often don’t fully understand the business rationale behind the data.
This can lead them to make uninformed—and, sometimes, harmful—decisions that impact business decision making. For example, a data engineering team may format data in a way that the sales department doesn’t expect. This can lead to broken reports or even lost data.
Complex and brittle systems
Monolithic systems rarely have clear contracts or boundaries. This means that data formatting changes upstream can break an untold number of downstream consumers. The result? Teams end up not making necessary changes for fear of breaking everything. This leads to monolithic systems gradually becoming outdated, brittle, and hard to maintain.
Finally, as dbt founder Tristan Handy notes, collaboration also becomes more difficult in a monolithic system. Since no one is familiar with the entire codebase, it takes more people and more time to complete data-related tasks. This affects time to market for new products and features - which impacts the company’s bottom line.
Principles of data mesh
To see more clearly how data mesh differs from the monolithic approach, let’s look at some of its main principles.
Data as a product
At the heart of data mesh is the idea of creating, not just data, but data products: well-defined, self-contained containers of data that solve a business problem. A data product can be simple (a table or a report) or complex (an entire machine learning model).
The hallmark of a data product is that it has defined interfaces with validated contracts and versioning. This ensures that anyone who depends on the data product knows exactly how to integrate with it. It also prevents sudden and unexpected breakages, as the data domain team packages and deploys all changes as new versions.
Domain-oriented data and pipelines
In a data mesh, the team that owns the data truly owns the data. This includes all related processes, including ingestion, processing, and serving. In other words, each team owns its own data pipelines.
In a data mesh model, pipelines function as internal implementation. The caller of a method on a class in an object-oriented programming language doesn’t need to know how the method is implemented. Likewise, users of a data product don’t require visibility into how data was processed.
Because data domain teams own their own data, it increases their sense of responsibility and stewardship. This leads to more accurate and higher quality data over time.
Data infrastructure as a self-service platform
It doesn’t make sense for every team that owns its own data to reinvent the wheel. Key data and basic data functions—e.g., the tools required to store data, create data pipelines, render analytics, etc.—should still be owned by the data engineering team.
In a data mesh paradigm, the difference is that these tools are open and available to all data domain teams who need them. This open data architecture democratizes data by giving every team a consistent and reliable method for creating their own data products.
Strong security and federated governance
Making data self-service means ending the “data monarchy” imposed by monolithic data stores. But ending data monarchy doesn’t mean embracing data anarchy.
Companies should still set and enforce standards for secure access, data formatting, and data quality. And it’s critical to monitor all data sources for compliance with industry and governmental regulations, such as the General Data Protection Regulation (GDPR).
As part of the self-service platform it provides, data engineering also provides a consistent framework for security and data governance. This includes tools such as data catalogs for indexing and finding data, as well as tagging tools to classify sensitive data elements (e.g., personally identifiable information), and policy enforcement automation to flag discrepancies and verify compliance.
[Join us at Coalesce October 16-19 to learn from data leaders about how they built scalable data mesh architecture at their organizations.]
Data mesh team structure
Based on these three principles, we can identify three major teams and areas of responsibility in a data mesh architecture.
Data platform team
The data platform team owns the self-service data platform. This is the main, centralized data plus all architectural components owned by data engineering and/or IT.
The data platform team typically owns architectural components such as data stores (databases, data warehouses, non-structured large object storage), BI and analytics tools, security, policy automation, monitoring, and alerting. They also maintain the tools that domain data teams will use, including contract enforcement, data transformation, and data pipeline creation tools.
What the data platform team does not own in a data mesh architecture is the individual models, workflows, reports, and processes for a specific data domain. That work now belongs to the data domain teams - the true owners of the data.
Data domain teams
Domain teams use the tools provided by the data platform team to create their own domain-specific data products. These teams own their own data pipelines, data contracts and versioning, and reporting and analytics.
Domain data teams are also responsible for maintaining data quality, versioning their changes properly, and monitoring and reducing data-related costs where possible.
Governance and enablement team
The governance team defines data quality and formatting standards, as well as data governance policies for defining, tagging, and flagging sensitive data.
Data platform teams implement these decisions via security controls and enforcement automation. Domain data teams implement the standards set by the governance team into their own data sets.
Finally, the enablement team assists domain data teams in understanding and adopting the self-service tools provided by the data platform team.
Benefits of adopting a data mesh architecture
The primary benefit of a data mesh architecture is scalability. It achieves this through the democratization of data.
The monolithic approach creates unnecessary roadblocks by funneling all data projects and requests through a single team. By returning ownership of data to its owners, domain data teams can create new data products without waiting on an overwhelmed data engineering team. The result is improved time to market, as well as more accurate and up-to-date data on which to base business decisions.
Data mesh achieves this democratization without sacrificing governance. Compliance concerns often lead organizations to adopt centralization strategies to better enable monitoring and enforcement. The data mesh architecture addresses security and compliance needs by combining node autonomy with centralized governance. It’s a “best of both worlds” approach that turns compliance into an enabling function instead of a roadblock.
Need proof? Check out how the supply chain experts at Flexport built a scalable data mesh with Snowflake and dbt: ““Since shifting to data mesh with Snowflake, 5.5x more people are using data across the business on a regular basis,” says Flexport’s Head of Growth and Analytics, Abhi Sivasailam.
A Fortune 500 company 1) doubled the number of people collaborating on data modeling, 2) eliminated 3 weeks of work on regulatory reporting and 3)** freed up $10 million that it was able to put back into the business.**
Challenges of a data mesh architecture
This doesn’t mean there aren’t any challenges with adopting data mesh.
Data mesh requires a high level of organizational maturity. Technologically, it requires a robust data platform layer that can serve the needs of a diverse user base. Organizationally, it requires buy-in from data domain teams - and people on each team who can effectively use the new data platform.
Moving to data mesh isn’t an overnight job. It requires planning, careful design, implementation, and an effective training strategy backed by a strong enablement function.
Additionally, without proper governance controls, it’s possible to slip into “data anarchy” and end up with bad, duplicative data proliferating throughout the org. Companies that adopt data mesh must have a strong data governance policy in place, along with the automated tooling to back it up.
Why is data mesh important?
For years, software engineering has successfully embraced the concept of small units of work performed by “two-pizza teams”. Each team owns its own component of a larger system. Teams integrate with one another through well-defined, versioned interfaces.
Sadly, data has yet to catch up with software. Monolithic data architecture is still the norm - even though there are clear drawbacks.
Data mesh brings the hard lessons learned from software engineering into data engineering. In the data mesh framework, each team can define its own contracts and integrate with other teams’ data via that team’s contracts. Instead of understanding a monolithic data model, data domain teams need only understand their own surface area plus the contracts exposed by partner teams.
The result is a more scalable model for our modern data age. Data domain teams can develop new data products more quickly and with less overhead. Contracts and versioning minimize downstream breakages - and can even eliminate them entirely. Meanwhile, the central data team can continue to enforce standards and track data lineage across the system.
Data mesh isn’t a magic bullet that will solve all of today’s data engineering woes. But it’s an important and necessary paradigm shift in the way we manage data.
Last modified on: Sep 20, 2023