Why we need a universal semantic layer
It’s true that in the modern business environment, the ability to make effective data-driven decisions is a competitive advantage.
This, as it turns out, is easier said than done. To really make the most out of data, we need to have a consistent, unified, and understandable view of it, one that is easily accessible to both technical and non-technical stakeholders. More often than not, businesses face hurdles with this.
Data consumers who don’t have context on the underlying data need to put in a lot of work to understand the nuances of data sets, and finding this information is rarely easy. The interfaces (SQL, BI Tools) can be prohibitively complicated to interact with for users without modeling experience. This creates a new requirement (and a bottleneck) for meaningful analysis to be done: a data producer to act as a translation layer.
It’s also no surprise that different teams across a company use different tools, and when critical metric calculations lead to different answers depending on where they are, it’s easy to get caught up debating whose version of reality is right. These inconsistencies ultimately lead to distrust in data.
Distrust in data sounds nebulous, but it has a very real cost. It looms over organizations, preventing employees from focusing on higher-leverage work and reducing the quality of decisions.
You might have your own story about how distrust in data in has affected you. I can share a personal anecdote. Once, I reported a metric - that I retrieved from an analytics tool - to the CEO of a previous company I worked for. At the time, I didn’t realize the number had multiple caveats attached to it and it wasn’t exactly correct. It caused some alarm within the leadership team and led to decisions being made with unnecessarily high priority. When I discovered that the metric was not quite what it seemed, I felt as if I had lost my confidence to independently gather information without the help of someone on the data team.
The promise of a universal semantic layer
A universal semantic layer serves as a translation layer between data and language. As I found out with my unfortunate story, translating can be complicated, and it requires a significant amount of knowledge about the underlying data and the nuances around how the data was created.
At its core, the job of a semantic layer is to serve as the source of truth for all an organization’s metrics. It allows data producers and others with context to define metrics once, and cement them as canonical.
But for a semantic layer to be genuinely meaningful, it needs to incorporate not just raw metrics, but also the context around them. This context is vital because it maintains awareness of how a number was calculated, and by whom. Just as importantly, it sheds light on data’s historical trajectory. For instance, conveying the meaning of a single, unexpected drop-off last holiday season alongside a revenue metric is essential to prevent misinterpretations that could lead to erroneous decisions or unwarranted concern. (Bitter? Me? Of course not.)
It should also simplify complex tasks, such as bridging entities across a semantic graph, thereby removing the need for pre-aggregation or the construction of OLAP cubes.
For a semantic layer to actually work, it must of necessity sit between the source of data and the tools where end users consume it.
The benefits of a universal semantic layer are numerous. Here are some of the most significant ones:
- Consistency: A universal semantic layer provides consistency in the definitions of metrics across the organization. It eliminates the need for business users to define metrics on their own, which can result in inconsistencies and errors.
- Flexibility: A semantic layer is deliberately agnostic to where end users consume data. You don’t need to be locked into a BI vendor or learn a new tool that isn’t useful for you; instead the generic interfaces of a semantic layer should enable multiple consumption endpoints.
- Reusability: Data teams only need to create and update metrics in one place, so this saves time and reduces the risk of errors and downstream impacts. Additionally, if they decide to change data platforms or downstream BI tools, a semantic layer will allow a much easier migration due to our support for data platforms and agnostic approach to ecosystem tools.
- Reduced cost / compute: Without a semantic layer, often many duplicate slices and dices of data will be made available in a data platform. Managing these in a warehouse is difficult. Often consumers will execute overlapping data transformation procedures from consumption interfaces. This represents a large portion of a company’s compute budget and easily the majority of the duplication.
- Governance & auditing: Using a universal semantic layer leads to an auditable record of changes and clear ownership. It also makes it easier to stipulate who can and can’t define new metrics.
- Reduce data inequality: Data inequality is the uneven distribution of access to data to reinforce arguments or beliefs. It leads directly to lost learnings and missed opportunities from groups that are disempowered from advocating as effectively as groups that do have access. An accessible semantic layer helps eliminate this.
What are the alternatives?
Now let’s explore the alternatives to a semantic layer, listed in the order of least to most effective (in my opinion).
- A wiki or document that contains a query or calculation for given metrics: This approach may work for a small set of metrics and users. However, it will not scale. Ultimately, these documents will become outdated, different people will have different versions of this logic, there is no code or approval process, and users will demand more dynamic ways to get the information they need.
- Metric logic in BI tools: In this approach, each independently used tool has its own business logic. While this may work for a small set of users and metrics, at scale, this approach runs the risk of creating inconsistency, duplication of work, and ultimately, distrust because each tool can report different results. Plus, what do you do if you ever want to change BI tools?
- Creating tables in the data warehouse/lakehouse based on metric logic. While this option is an improvement over the first and second alternatives, it still may not be sufficient at scale. Metrics are seldom useful on their own; they often require dimensions and attributes that business users can use to slice and dice the data. If you create a table with a metric, you’ll need to create numerous other tables derived from that table to show the desired metric sliced by the desired dimension. Mature data models have thousands of dimensions, so you can see how this will add up quickly, and lead to a lot of duplication and added cost.
- OLAP cubes. This is similar to the previous, but it attempts to join in and pre-aggregate data resulting all the dimensions that can be joined on the model ahead of time. Sometimes the dimensions in these cubes are rarely used, which leads to an additional cost. This is static and prohibitive, and again results in the same maintenance problem where users are going to want more slices of data, requiring additional work to join in those dimensions or groups of metrics and dimensions.
All in all, some of these approaches could work, but each have their own limitations that become increasingly painful at scale.
Why the dbt Semantic Layer?
Here’s why I believe the dbt Semantic Layer is the best way to solve the problems we laid out here.
dbt already allows for exactly the the creation of clean, well-defined data sets that are crucial for a well-functioning semantic layer. And as a ubiquitous transformation solution, it’s used by more than 20,000 organizations today, across all sizes and industries.
It also seamlessly integrates with several data platforms, including BigQuery, Databricks, Redshift, and Snowflake. This enables it to easily access and build on data stored in these platforms, making it a perfect place to define and manage a semantic layer. Just as importantly, dbt’s framework natively supports metric definition in code and version control using git, enabling smoother collaboration and the ability to roll back problematic changes.
Our long term vision is that everyone should be able to dynamically access metrics that are critical to their work. They should be able to understand what each metric means and where it came from using metadata from the Semantic Layer, and take action from any downstream tool.
To aid in fulfilling this vision, dbt Labs recently acquired Transform, which provided a sophisticated metric framework, based on MetricFlow. We’re currently working on incorporating MetricFlow into the dbt Semantic Layer. This blog post highlights what we’ve been up to since the acquisition and what you can expect in the coming months towards this vision.
Our Semantic Layer will allow you to define your semantic models and metrics on top of your dbt models in code. At query time, the dbt Semantic Layer creates a semantic graph of all the metrics and dimensions you defined and that can be joined in from your semantic models. These will be accessible seamlessly in downstream tools using robust APIs.
This will enable your organization, regardless of scale, to have confidence that everyone is aligned on key metrics, and that data consumers feel empowered to answer their own questions using the tools they want.
The new-look dbt Semantic Layer, powered by MetricFlow, is almost ready for you to take for a spin. We’ll be launching a beta program in the coming month, and encourage you to talk to a dbt Labs sales representative if you’re interested in taking part.
Last modified on: Dec 6, 2023