The components of a data mesh architecture
Sep 22, 2023
OpinionWe’ve talked about the four principles that drive data mesh architecture. But what are the components of the data mesh architecture in today’s modern data applications? Let’s examine each of the layers of a data mesh architecture and how each piece works together to deliver increased scalability, greater trust in and reliability of data, and faster delivery of data products.
How data mesh changes data architecture
Data mesh is a new approach to data architecture. Rather than managing all data and data processing as a single monolith, it decomposes data into a series of data domains, each owned by the team closest to that data.
In a monolithic data management approach, technology drives ownership. A single data engineering team typically owns all the data storage, pipelines, testing, and analytics for multiple teams—such as Finance, Sales, etc.
In a data mesh architecture, business function drives ownership. The data engineering team still owns a centralized data platform that offers services such as storage, ingestion, analytics, security, and governance. But teams such as Finance and Sales would each own their data and its full lifecycle (e.g. making code changes and maintaining code in production). Moving to a data mesh architecture brings numerous benefits:
- It removes roadblocks to innovation by creating a self-service model for teams to create new data products
- It democratizes data while retaining centralized governance and security controls
- It decreases data project development cycles, saving money and time that can be driven back into the business
Because it’s evolved from previous approaches to data management, data mesh uses many of the same tools and systems that monolithic approaches use, yet exposes these tools in a self-service model combining agility, team ownership, and organizational oversight.
[Join us at Coalesce October 16-19 to learn from data leaders about how they built scalable data mesh architectures at their organizations.]
The architectural elements of a data mesh implementation
How does data mesh accomplish this? What does a data mesh architecture look like?
Central services
The central services component of a data mesh architecture implements the technologies and processes required to enable a self-service data platform with federated computational (automated) governance. It’s further subdivided into two areas: management and discovery.
Management includes functions for provisioning software stacks necessary to process and store data. This software stack will form the data platform that will then be leveraged by various domain teams. Central services implements a solution that creates the resources a team needs to manage a new stack.
Self-service data stacks include a standardized set of infrastructure that each team can leverage. This includes storage subsystems (object storage, databases, data warehouses, data lakes), data pipeline tools to import data from raw sources, and ELT tools such as dbt. They also include tools for creating versioned data contracts so that teams can register and expose their work to others as a reusable data product.
Management also includes federated computational governance. It enforces access controls, provides tools for enabling and enforcing data classification for regulatory compliance, and enforces policies around data quality and other data governance standards. It also provides centralized monitoring, alerting, and metrics services for organizational data users.
Because central services acts as a clearinghouse for managing an organization’s data, it also serves an important discovery function. Users can use a data catalog to search organizational data and find both raw data and data products that they can incorporate into their own data sets.
Producers (data domains)
The producers represent the collection of data domains owned by data teams.
Architecturally, producers make use of the stacks provisioned for them by centralized services. The producers leverage one or (usually many) more data sources through data pipelines to create new data sets.
The output of the producers’ work is one or more data products. A data product exposes a subset of the producers’ data that other teams can leverage in their work. Each data product may contract that specifies the structure of the data it exposes, as well as access policies that control who can see what data and code.
Consumers
Consumers take the output from the producers and use it to drive business decisions. Consumers can be salespeople or decision-makers developing BI reports, analytics engineers further refining data for data analytics, data scientists building machine learning or AI models, or others.
Additionally, producers and consumers often overlap. A team can be a consumer of one team’s data while also producing data that another team uses. Because every team publishes its output as data products that others can discover through the central data catalog, it’s easy for teams to build a web of connectivity between each other. This is what puts the “mesh” in “data mesh.”
How the pieces fit together
With all of these pieces in place, a workflow between all of these architectural components emerges.
Data governance leaders—a combination of business stakeholders and members of the data platform team—define policies for data governance and data quality. Centralized services then implement support for self-service data products and federated centralized governance, enforcing data governance policies via code.
From there, data producers use the self-service data platform to create a new stack they can use to create a new data product, using the data catalog supported by central services to find other data and data products.
Once ready, producers publish the initial version of their data product to the data catalog, where consumers and other producers can find them. Consumers find and utilize data products either as an end product in itself (reports) or as input to another process (a machine learning model).
As data and business requirements evolve, data producers release new versions of their data products with new contracts to preserve backwards compatibility. Consumers and other producers receive alerts about the new version of the contract for the data product they’re consuming, and update their workflows to use this new contract before the previous version expires.
Centralized services and data governance leaders work to onboard more teams to the self-service data platform and use metrics on data quality and usage from the data catalog to measure progress towards KPIs.
What data mesh brings to modern data architecture
These components of data mesh all work together to bring a number of benefits to your existing data architecture.
Scalability
Scalability comes from two areas. First, it comes from the self-service data platform. In a monolithic data management architecture, employees who wanted a certain report or data set created would have to send the request to a central data engineering team. That inevitably results in large backlogs and delays. With a self-service platform, data producers can receive the capability they need to create new data products automatically.
Second, scalability comes from the data producer layer itself. Each team (and possibly each data product) can request the computing resources it needs to store, transform, and analyze data. Each data domain, architecturally, runs as its own separate data processing hub.
Increased trust in data
The centralized services layer supports a data catalog that enables all data producers to register their data products and data sources in a single location. The data catalog serves as the single source of truth within the company. This enables producers to own their own data domains while the company enforces data quality and classification standards across all owners.
Via the data catalog and other data governance tools, the data governance team can quantify and track the quality of data across the company. For example, it can report statistics on the accuracy, consistency, and completeness of the data it monitors, as well as produce reports on how much of the company’s data is properly classified.
Finally, because data domain teams own their own data, they can ensure that it’s kept up to date and that its structure reflects the changing realities of their business. All of these factors lead to an increased trust in data as companies move closer towards a data mesh architecture.
Greater reliability and reduced rework
The self-service data platform also helps enforce uniformity across data domains through the use of contracts for data products.
One of the primary sources of data issues is disruptions caused by sudden and unexpected changes in the format of data. By enforcing the need for data contracts on data products, producers can alert consumers to upcoming breaking changes. This improves reliability across the data ecosystem. That saves everyone involved time and further increases confidence and trust in the company’s data.
Faster deployment of data products
The increased scalability, increased trust in data, and greater reliability of data mean teams can bring new data products to market faster.
One of the largest obstacles to launching new data products is finding data you can trust. In a 2022 study by Enterprise Strategy Group, 46% of respondents said that identifying the quality of source data was an impediment to using data effectively. By building increased trust in data, companies can empower data domain teams to ship innovative ideas in less time.
Data mesh uses new and existing data management technologies to create a distributed, federated approach to managing data at scale. Understanding what each layer contains and how each one works together gives you a roadmap for transitioning to the next evolution of modern data architecture.
[Join us at Coalesce October 16-19 to learn from data leaders about how they built scalable data mesh architectures at their organizations.]
Last modified on: Jun 03, 2024
Achieve a 194% ROI with dbt Cloud. Access the Total Economic Impact™️ study to learn how. Download now ›