Key components of data mesh: self-serve data platform
In our series on data mesh, we’ve looked at each of the four principles behind data mesh architecture. In this article, we’re looking at one of the most exciting ones: the self-serve data platform—what it is, why you need it, and how to begin building one.
Check out all of the key components of data mesh:
What is a self-serve data platform?
A self-serve data platform is a data platform that supports creating new data products without the need for custom tooling or specialized knowledge. It’s the technical heart of data mesh, enabling data domain teams to take ownership of their data without unnecessary bottlenecks.
The self-serve data platform fulfills three critical functions:
- Support developing data products
- Enable data discovery
- Institute data governance policies
Support developing data products
In a data mesh architecture, data domain teams take control of their data and all related artifacts and processes. The end product of their work is a data product - a self-contained unit of data that directly solves a customer or business problem.
The data product consists of a number of assets. Chief among these is the data contract, which defines the data product’s ownership, structure, and current version. Other assets include data transformation pipelines, tests, orchestration pipelines, and data storage, among others.
In a data mesh approach, a centralized data platform team creates the infrastructure and tooling that supports creating data products. This ensures that each team is creating data products in a uniform and consistent manner.
Enable data discovery
The distributed nature of data mesh means there needs to be a systematic way for teams to discover and use the data products that other teams have developed. This fosters collaboration and helps prevent data from becoming “siloed,” locked away and leveraged only by those in the know. As part of the self-serve data platform, the data platform team provides tools such as a data catalog for registering, discovering, and enriching data products.
Institute data governance policies
Another potential challenge with the distributed nature of data mesh is maintaining a consistent quality bar across data domains. Without a way to enforce data quality and compliance standards, a data mesh can quickly turn into a data anarchy.
That’s why another key component of the self-serve data platform is federated computational governance. This consists of a set of data governance tools that monitor data mesh nodes to ensure all data meets organizational standards for formatting, validity, accuracy, timeliness, and compliance. It ensures that all potentially “sensitive” data - e.g., a customer’s Personally Identifiable Information (PII) - is properly classified, secured, and stored.
Benefits of a self-serve data platform
A self-serve data platform gives a data mesh several edges over more “monolithic” approaches to storing and managing data.
Reduces “shadow IT” and unnecessary duplicate costs
In a monolithic approach to data architecture, the data storage system often becomes so unwieldy that only a handful of people know it well enough to build upon it. This leads some teams to build “shadow IT” data stacks that are simpler to use. That, in turn, leads to redundant effort and licensing costs.
A self-serve data platform prevents teams from feeling like they have to “reinvent the wheel.” It reduces shadow IT efforts and guarantees everyone is using a single, consistent toolset. It also enables the data platform team to reduce spend via discount bulk licensing and optimizing cloud spend on expensive resources such as data warehouses.
Easier to onboard data product developers
By providing a consistent set of tooling for data products, the data platform team can also simplify the experience of creating new products.
Instead of delivering a complicated set of tools that only other data engineers might find usable, it can incorporate standard languages like SQL that promote simplicity at scale. This potentially expands the number of company employees who can create data products to analytics engineers, software engineers, business analysts, and more.
Better standardization across teams
Because everyone is using the same toolset to build data products, the end products themselves are more consistent from team to team. This results in better documentation across teams, better interoperability of data sets, ability for members to move across domains, and better security standardization and data governance.
Capabilities of a self-serve data platform
A self-serve data platform isn’t a simple undertaking. In one of her original articles on the topic, data mesh inventor Zhamak Dehghani gave a long list of capabilities self-serve data platforms offer to data domain teams.
Let’s look at a few of the most important ones below, with the understanding that this is by no means an exhaustive list.
Basic data storage capabilities
If you have data, you need some place to put it. A self-serve data platform administers and provides secure access to storage in multiple forms - RDBMS, NoSQL, object storage, data warehouses, data lakes, etc. It also standardizes transfer and storage protocols (e.g., encryption at rest and in transit).
Data product creation
The self-serve data platform also offers all the tooling necessary to turn data storage into data products. This can include:
- A format for defining and testing model contracts, which define the commitments the data domain team makes to its consumers about the data it emits.
- A data transformation layer, such as dbt models, for converting source data into the formats required by the data domain team.
- Tools for orchestrating and running data pipeline jobs to transform new, incoming data periodically.
- Source control (i.e., Git support) for tracking changes to contracts, transformations, and other project assets.
- Support for data caching, data access API creation, and other related functionality.
The data catalog and accompanying data lineage provide basic data discoverability capabilities. A data catalog provides visibility into how data flows throughout the entire company, which enables advanced data debugging techniques such as root cause analysis and change impact analysis.
Security and data governance
A self-serve data platform must also provide methods for restricting data access to authorized individuals.
On top of security, the platform must provide a uniform approach to data governance. Identifying sensitive values, such as a customer’s Personally Identifiable Information (PII), ensures it’s properly handled and never falls into unauthorized hands. It also simplifies handling requirements for laws such as the General Data Protection Regulation (GDPR) that guarantee consumers the right to control their information.
Data governance support includes:
- An ability to tag, classify, and mask/redact sensitive information
- Tools (like a data catalog) for discovering where a customer’s data lives throughout the company
- Data quality policy enforcement tools that ensure data is formatted, stored, and updated to a consistent level of (high) quality across all data domain teams
- Anomaly detection and alerting (e.g., issuing warnings when a field isn’t tagged as sensitive but appears to contain PII)
Reporting and monitoring
As part of its base-level support for data products, the self-serve data platform should also provide basic support for logging and monitoring (e.g., auditing data changes, logging access requests, etc.). It can also provide useful statistics about data across the company, including:
- Which data assets are the most/least used
- Metrics on overall data quality and corporate compliance
- Alerts on important changes - e.g., alerting downstream consumers of a data product when a new version of the data contract has been published
- Reporting on costs and usage of the platform
How to implement a self-serve data platform
How do you build a self-serve data platform from the ground up?
The good news is, you’re probably already running some of the technological components - data warehouses, data catalogs, transformation tool, pipelines, etc. - that the new system will need. Beyond technology, however, you’ll also need to determine how to structure your data and how different data domain teams will interface with each other.
Here are some general guidelines to get you started.
Obtain executive buy-in
A self-serve data platform isn’t a minor undertaking. It’ll require a significant, ongoing investment in data platform engineering resources.
The first step is creating a business plan showing how the platform will deliver a return on investment. You can calculate this by contrasting the estimated cost of the new system in terms of time, labor, and licensing with revenue and cost-saving factors such as:
- The business value of the data projects that the new approach will unblock
- The reduction in time/resources spent debugging data quality issues
- The reduction in redundant infrastructure effort and associated licensing costs
Create first cut of self-serve data platform architecture
Identify the technology you currently have to support a self-serve data platform, as well as any gaps (e.g., data contract authoring/validation tools). Design a rollout plan for the new platform that introduces key features over subsequent version releases. The platform itself is a product with releases, roadmaps, SLAs etc.
For example, your initial release might contain the minimum features required to support self-service for new data products built on top of your data warehouse. This may include scoping permissions correctly for each team to access the storage and existing data they need, and providing a data catalog for all teams to discover and view data contracts for other teams’ data products.
Onboard first key team and iterate
When you’re ready, onboard your first key team to the new platform. It’s important to move slowly and onboard a single team at a time. This enables you to vet your initial assumptions and make adjustments during onboarding. It also gives you a quick win you can use to demonstrate the value of the new systems to other stakeholders and senior leadership.
Once you’ve onboarded your first team, continue onboarding additional teams while also iterating on the capabilities of the self-serve data platform.
The self-serve data platform is the beating heart of a data mesh, powering a decentralized, distributed data architecture that empowers teams to take ownership of their data. Implemented well, it can reduce the time and cost required to bring new data products to market while improving data quality and oversight.
Last modified on: Nov 22, 2023