Key components of data mesh: Federated computational governance
In our series on data mesh architecture, we’ve looked at what the data mesh is, its components, and the four principles behind data mesh. Let’s review how federated computational governance helps ensure security and compliance across all of the nodes of a data mesh, preventing it from becoming a data anarchy.
Check out all of the key components of data mesh:
Why we need federated computational governance in data mesh
In a data mesh architecture, data producers own and operate their own data domains via a self-serve data platform. This enables companies to assign ownership of data by business function as opposed to by who owns the underlying technology. An individual data domain can produce data for and consume data from other domains, thus putting the “mesh” in data mesh.
That raises the question: How do you maintain consistency and oversight in such a distributed, highly-scalable data network? There are several reasons organizations need to exert some measure of verification and oversight over the data that domain teams manage:
- Ensure data quality. Poor quality data -malformatted dates, missing values, corrupt structured data, etc. - can lead to poor decision-making and broken data pipelines. According to some estimates, poor data quality costs organizations up to $12.9 million a year. Enforcing standards around data formatting and testing can eliminate these errors before they happen - and save the company millions in expensive troubleshooting.
- Verify regulatory compliance. Regulations such as the General Data Protection Regulation (GDPR), the Health Insurance Portability & Accountability Act (HIPAA), and many others worldwide specify firm rules around handling consumer data - and stiff penalties for violating them. This requires that companies have mechanisms for tagging, classifying, protecting, and expunging consumer data no matter where it lives in the organization.
- Simplify data interoperability. If different teams define similar concepts in different ways, it can complicate sharing data and building on one another’s work. Companies that define standard data structures for common concepts such as “customer” or “product” help foster a data environment where interoperability is painless.
- Provide accountability and transparency. A clearly-documented set of data governance policies defines the rules every data product team must follow in an open and auditable format. This minimizes the risk of errors and intentional abuse.
This is where federated computational governance comes into play. Each data domain team continues to own its data (plus associated assets and processes). Each team is also required to register its data products with one or more data governance platforms. The data governance platforms, in turn, run automated data governance policies that ensure all data products conform to organizational standards for quality and compliance.
Hallmarks of federated computational governance
Governance isn’t a new concept. Many of the tools used to implement federated computational governance - data catalogs and data governance platforms - pre-date the concept of data mesh.
Federated computation governance in a data mesh is unique in that it balances decentralized autonomy with global interoperability. It combines the benefits of data mesh’s scalable data model and local ownership with automated implementation of policies and best practices. It accomplishes this by being:
Data mesh is an alternative to data architectures that separate the operational and analytical data planes. Data mesh, by contrast, seeks to harmonize these two planes through a federated model that focuses on data domains instead of technology.
Federation frees data domain teams up from the downsides of working with data monoliths and gives them greater control over their own data and its processes. But it does so without sacrificing oversight. Without federation via data governance software, each data domain risks becoming its own data silo, enforcing (or, even worse, not enforcing) its own data governance and quality rules.
Another aspect of federated computational governance is automation. It leverages data governance software to implement data governance, security, compliance, data quality, and more as code - e.g., as declarative policies, scripts, etc.
This isn’t to say there are never human-verified quality gates, checks, or inspections in a data mesh architecture. However, the majority of data governance and quality checks are automated in order to achieve greater scalability and faster delivery time for new data products.
Data governance policies are written down in both human- and machine-readable formats. That means they can be checked in to Git and version controlled, code reviewed, tested, and read by others. This increases accountability, transparency, and accuracy in the data governance process.
Federation and automation mean that data projects can scale faster. Automated policies can verify more data in a shorter time than a human ever could through manual inspection. They can also react to fast-changing data more quickly. This helps establish confidence that a new data product is high quality and fully compliant with all appropriate regulations, which means it can ship faster.
Automation can also increase scalability by performing some data governance actions automatically. For example, if a field is marked as sensitive, an automated process could use data lineage information to propagate this sensitivity tag to all of the data sources both upstream and downstream from the field’s data store.
In many companies, compliance is often perceived as red tape - an obstacle. When imposed from the top-down with no transparency, it can generate resentment and frustration.
In a data mesh, federated computational governance is more of a community function in which data engineering, data governance, and data producers collaborate on data quality and governance. Since policies are clearly set forth in code kept under source control, teams can work together on refining data governance policies to cover any unexpected edge cases.
How federated computational governance works
So how do you implement federated computational governance in practice?
As with the other components of a data mesh, federated computational governance isn’t a one-and-done project. It’s an ongoing effort involving trial-and-error that grows in both scope and quality with each iteration. It’s also one that grows in tandem with the other components of your overall data mesh architecture.
Here’s a brief roadmap on how to get started.
Build data governance into a self-serve data platform
We’ve written before about how the self-serve data platform acts as the technological framework for a data mesh implementation. Federated computational governance is a critical part of the self-serve data platform.
A self-serve data platform will usually provide all or most of the components below to support federated computational governance:
- Data catalog. A data catalog acts as a single source of truth for all data within a company. It stores information such as a data’s sources, its metadata, and data lineage (the relationship between a data assets, its producers, and its consumers). It enables easy discovery of data, enrichment of metadata, and tagging and classification of sensitive information.
- Data transformation tools. A data mesh often also provides support for a data transformation layer. Data transformation tools, such as dbt, can act as a common intermediary layer for all of the data stores in your modern data stack. In particular, dbt provides support for data governance via data lineage generation, rich documentation and metadata support, tests, model contracts, and model versions.
- Data governance policy software. This is the toolset that interprets and runs automated policies on your data. It may be a built-in part of your data catalog or an extension of it.
- Access control framework. The platform will also specify a set of role-based rules for who can access what data, particularly data marked as sensitive - e.g., a user’s Personally Identifiable Information, or PII. This may utilize one or more security mechanisms, including Microsoft Active Directory, OAuth, SAML-based identity frameworks, Amazon Web Services (IAM) Identity & Access Management (IAM), and others.
Design and implement data governance policies and access controls
The data governance and data engineering teams work together to define the policies that will govern all data in the organization. These become the rules to which all data domain teams must adhere when they publish new data products. The data engineering teams builds these into the self-serve data platform as automations that run against the data described in the data catalog.
A company might implement numerous types of policies, including:
- Security policies - who can access what data, both at a tabular and columnar level.
- Compliance and privacy policies - which field formats indicate potentially sensitive data (e.g., address, national ID number, birth date) and should be labeled as such.
- Data quality policies - standards for the format of common fields, such as dates, as well as standards for accuracy (eg., decimal-point precision in financial values; use of common measurement units), duplicate data, and freshness.
- Interoperability policies - policies that define standard representations for common data objects; policies requiring the use of data contracts and other assets that define data products and make it easier for one team to leverage another team’s data.
- Documentation and metadata policies - which metadata should accompany all data assets and how teams should document assets to convey their business purpose and proper usage.
Register for data governance monitoring and respond to issues
With this framework in place, data domain teams register their data, data contracts, and other assets (such as dbt projects) with the data catalog. The central data governance policies will run periodically on the team’s data looking for potential compliance violations.
If the data governance platform detects a potential violation, it can send a notification to the data owners on the data domain team. One or more team members responsible for compliance can navigate to a dashboard to see a list of all outstanding issues, along with a due-by date for resolution. The compliance point person can then work with their team to deploy fixes for these issues that will be picked up in the next scan.
Data domain teams may also have additional programmatic requirements for compliance that they must meet. As an example, a team may need to supply some method (e.g., a REST endpoint) to help the company comply with right to erasure requests under GDPR.
A data mesh architecture can help your organization overcome scaling issues and take its data operations to the next level. But without a solid plan for governance, a data mesh can quickly become a data anarchy.
Federated computational governance balances scalability with oversight. By combining a distributed, decentralized data architecture with a centralized, automated monitoring capability, it enables maintaining a high level of data quality and compliance without sacrificing speed or growth.
Last modified on: Nov 22, 2023