Data mesh architecture: How did we get here?
We discussed what data mesh is and how companies can benefit from it, but how did this new data architecture come to be? Let’s explore how data mesh addresses the issues that came before it and the road that it’s paving to manage the explosive growth of data.
When we say “data architecture” we mean not only how data is stored, how data is ingested, but more importantly, who in an organization owns these processes.
The emergence of data mesh architecture
Data mesh is a decentralized approach to data management that enables individual teams to own their own data and associated pipelines.
Done well, data mesh balances two competing yet important priorities: data democratization and data governance. It unblocks data domain teams from waiting on data engineering to implement their data pipelines for them, which enables faster time-to-market for data-driven products. It also enables data engineering teams to be specific about which datasets are “consumption ready” and as a result, what standards they agree to meet for those data. Data mesh combines this independence with security and policy controls that prevent data democracy from degenerating into a data anarchy.
With data mesh architecture, teams leverage a shared infrastructure consisting of core data as well as tools for creating data pipelines, managing security and access, and enforcing corporate data governance policies across all data products. This architecture enables decentralization while ensuring data consistency, quality, and compliance. It also allows teams to leverage centralized functionality for managing data and data transformation pipelines without each team re-inventing the wheel.
If this seems more complex than “just stick everything in the one big database”…well, it is. But there’s a good reason for the complexity.
To understand why we need to strike this balance between data democratization and data governance, let’s look at what came before data mesh and the limitations each approach ultimately ran up against.
The journey to a data mesh architecture
BC (Before Cloud) Appliances
Before cloud systems, data was stored on-premises. Companies stored most of it in large online transaction processing (OLTP) systems built on a symmetric multi-processing (SMP) model like Teradata. These technology stacks were sometimes determined by the CEO and CTO, leaving teams with little flexibility in how to manage their own data.
Some companies—those who could afford it—stood up data warehouses. Built around an online analytical processing (OLAP) model and massively parallel processing (MPP), data warehouses gave data analysts, business analysts, and managers faster access to data for reporting and analytics.
To load data into data warehousing systems, IT or data engineering teams used Extract, Transform, and Load (ETL) processes. This cleaned and transformed the data into a state suitable for general use.
While this approach worked, it had numerous drawbacks:
- The expense of building data warehouses on-prem prohibited many companies from leveraging the technology.
- Most workloads had to go through the IT and (later) data engineering teams, which quickly became bottlenecks
- These centralized solutions produced one-size-fits-all datasets that couldn’t meet the needs of every team’s use case
- Many ETL pipelines were hard-coded and brittle, resulting in fire drills whenever one went down due to bad data or faulty code
- Some teams, frustrated by these bottlenecks, formed their own solutions, resulting in data silos and the growth of “shadow data IT”
The Hadoop era
Before the next major shift came the Hadoop era. Hadoop enabled utilizing clusters of machines to process massive datasets in parallel.
Hadoop enabled more teams to run their own data workloads and processes. This gave teams more flexibility than a centralized data warehouse architecture offered.
But Hadoop still has multiple drawbacks. First off, it was hard to work with; query tools were limited, and resource management and performance remained a constant struggle.
In addition, because everyone ran their own Hadoop clusters, it made the shadow data IT issue worse. Multiple teams were now managing their own unmonitored, unregulated data silos. It also pushed a lot of business logic out of data pipelines and into front-end BI tools.
Redshift and the modern data stack
As dbt Labs founder Tristan Handy has written before, the emergence of Amazon Redshift changed the game.
Cloud computing eliminated the need for massive up-front capital expenditures for large-scale computing projects. Any team or company could now spin up a data warehouse, which enabled more organizations to crunch large volumes of data and mine business value from it.
The availability of on-demand computing power also changed the way we processed data. Because cloud computing was so affordable, many teams could move from ETL processes to an ELT—Extract, Load, Transform—model that transformed unstructured and partially structured data after loading it into a data warehouse. Because it leveraged the processing power of the target system and instituted a schema on read approach, ELT often proved more performant and more flexible.
Most of what we regard as the modern data stack, as Tristan says, revolved around Redshift. BI tools like Looker and Chartio emerged that enabled data engineers and business analysts to plumb data to drive business decisions. More people had access to more data than ever before.
From data warehouses to data lakes (where we are today)
Despite these advances, some of the old problems with data management continued to linger. Concepts such as version control, testing, and transparent data lineage were mostly unheard of. Thus, data warehouses were still primarily storehouses for structured data. They couldn’t serve as a source of truth for both structured and unstructured data.
And businesses were consuming more unstructured data. Large volumes of data from IoT devices, social media feeds, mobile applications, and other sources.
Around this time, more organizations also realized they had a problem with finding data. The cloud didn’t solve the data silo problem; if anything it helped data silos proliferate. That made it harder than ever for data teams and business users to find the data they needed. Even when they found it, they often had no way to tell whether the data was accurate, and it often proved easier to build their own solution from scratch.
This proliferation of data also complicated data governance. Without a way to track all of their data consistently, companies struggled to comply with regulations such as the European Union’s General Data Protection Regulation (GDPR). Many companies lacked the tools to tell who had access to what data.
In addition, while cloud computing made data easier to process than ever, it also made it easier to run up large bills. Cloud charges ballooned as teams ran large data warehouses with minimal oversight.
Data lakes addressed the first problem of storing unstructured data. The data lake gave companies a scalable and cost-effective way to store unstructured data in affordable cloud storage systems like Amazon S3.
Storing raw data also injected additional flexibility into the applications, as data engineers could transform the same data set multiple ways to fit different requirements. This approach also fit well with ELT processes.
On the data discovery front, data catalogs became an increasingly popular way to search, tag, and classify data. Using a data catalog, companies could build a single index of all data and associated metadata across various departments. Many data catalog tools also added or improved their data governance features, enabling them to enforce data policies and flag potential governance violations.
The arrival of data mesh architecture
However, data lakes and data catalogs still don’t address some of the core structural issues around data processing.
Many companies still rely on a centralized model for data storage and processing. A data engineering team must still import data into a central repository - a data warehouse or data lake - via a data pipeline. Problem is, they won’t know if they did it correctly until the data analysts downstream verify that the data is correct.
In other words, a single team (still) remains a single bottleneck to data processing. And the distance between the data processors and the data owners results in long load-test-fix cycles that delay data project delivery times.
The complexity of these systems means few people understand how all the parts work together. And since there are no contracts or versions associated with any of the stored data, downstream systems have no elegant way to handle change. An alteration to a data type or even the format of a text field can break any application or report that depends on that data.
Neither of these problems are new. They’ve plagued data-driven applications since the on-prem days. But they’ve become more noticeable as data architectures grow more complex in scope and scale.
Data mesh architecture aims to solve these two lingering issues by adopting the same approach to data systems that software engineering teams take to software systems.
Decentralizing data returns power to the data domain teams by entrusting them, not just with their data, but with the entire end-to-end process of managing their data. Teams can eliminate the dependency on data engineering by utilizing their own data transformation pipelines and analytics tools to create new data-driven solutions.
In a data mesh architecture, every team creates not just data, but a data product. A data product is a self-contained data deliverable that acts as a trusted source of data for a specific application.
To ensure other teams can interface with their deliverable, teams create contracts that specify exactly what data the product returns. Whenever teams change their data model, they can version the contract and provide a plan to obsolete the older version. This gives partner teams time to revise their own data products to remain compatible and prevent breakages.
In other words, each data domain team maintains a deliverable akin to a microservice: a complete, self-contained, deployable unit of functionality, defined explicitly through versioned contracts.
Data mesh architecture builds on the technologies that preceded it. Readily available cloud computing tools and data transformation tools like dbt make it technically and financially feasible for teams to own their own data. Semantic layer tools allow BI applications to provide consistent, self-service analytics. And data catalogs enable tracking of all the data across a company’s data architecture, regardless of where it lives.
The world has witnessed explosive growth in the scale and scope of data in the past decade, and it has created a lot of new opportunities. It has also created new problems and highlighted some old ones that remain unaddressed.
Data mesh architecture is a natural evolution of the growth of data and the need to manage it at scale. By combining local ownership with global oversight, data mesh can help organizations deliver new data-drive solutions with greater speed and reliability.
Last modified on: Sep 20, 2023