The dbt Semantic Layer: what’s next
Hello dbt Community! It’s been a couple of months now since Transform was acquired by dbt Labs, and whether you’re already using the Semantic Layer today or just evaluating it, chances are you want to know more about what’s coming next.
We’ve been hard at work on defining a concrete plan to incorporate MetricFlow into the dbt Semantic Layer. Today we’re here to share our first major update on this integration journey with you.
Where things stand today
Effective decision-making based on KPIs is crucial for any organization, particularly in today’s data-rich environment. However, with numerous tools available, end-users can easily define business logic differently in each one, leading to inconsistency and frustration. This, in turn, undermines trust in data, creating a significant challenge for organizations and causing friction in interactions between data producers and data consumers.
We believe that a universal semantic layer is the ideal solution to this problem. A semantic layer acts as a translator between data and language; metrics serve as the primary unit of translation.
The dbt Semantic Layer is one such universal translation layer, and it aims to allow end users to access metrics and their contextual information wherever they may be.
The dbt Semantic Layer launched in Public Preview last October with a robust suite of partner integrations that we are continuing to prioritize. But as Tristan mentioned in his post on the acquisition, while integration coverage is very important, it’s insufficient on its own to allow the Semantic Layer to meet the needs of many organizations.
Functional coverage matters a great deal too. The Semantic Layer has to be powerful and flexible enough to make working with metrics a frictionless experience at scale. Transform was acquired to accelerate the process of making that happen. (Transform created and maintained MetricFlow, an advanced metric query engine.)
We believe that the following abilities are crucial for a well-rounded Semantic Layer product, and we’re working toward bringing them all into dbt:
- Ability to express complex business logic
- Interfaces to query (and enable downstream integrations)
- Broad data platform connectivity
- Low latency
- Robust security and governance
What’s coming next
The integration of MetricFlow is poised to make dbt’s Semantic Layer considerably more sophisticated, enabling more complex metric definition and querying to be done highly efficiently, at scale.
The problems that MetricFlow will help dbt solve are some of the hardest technical challenges in the semantic layer space, and they will make a huge difference on the day-to-day experience of using the Semantic Layer.
The following product improvements will be part of the upcoming first re-release of the dbt Semantic Layer with MetricFlow, available in Beta in Q3 2023 and in Preview in Q4.
1. Join navigation
dbt will soon be able to create a semantic graph of your data through support for joins across tables. That means it will be able to infer the appropriate traversal paths between multiple tables when generating a metric, making all valid dimensions available for your metrics on the fly.
Take the following example of a simple transactions table and products table that you’ve built using dbt.
Let’s assume you’ve defined a metric that calculates revenue per customer on the transactions table.
Now let’s say an end-user notices wants to dive deeper on an interesting trend, and see customer revenue by product category. In the past, this would have been impossible for them to do without some up-front work: category is defined in a different table.
With its knowledge of your semantic graph, the dbt Semantic Layer will now be able to traverse that join to dynamically execute a query to return this data, on the fly.
Of course, this is a simple example with two tables that happen to be directly adjoining. When your data model spans hundreds of tables, with many potential “hops” between them, this capability can be a huge time saver. It means data consumers will no longer have to think about the combinations of dimensions that they can slice a metric by – the Semantic Layer will be able to do the heavy lifting to abstract that way.
This has likely been the dbt Semantic Layer’s single biggest functional gap thus far. Now, MetricFlow will allow dbt to elegantly solve this problem in a way that avoids the construction of tricky problems related to fan-out and chasm joins.
If you’re familiar with the Semantic Layer today, you might be wondering how this affects the spec for metrics — stay tuned, we cover that below!
2. Expanded data platform support
We want you to be able to benefit from the value of consistently defined metrics regardless of your data platform of choice. That’s why in addition to Snowflake, we’re pleased to announce that we’re also adding support for the other major dbt Cloud supported platforms: BigQuery, Databricks, and Redshift will be supported soon by the Semantic Layer, with Starburst to follow shortly after.
3. More optimized query plans and SQL generation
Part of the job of any semantic layer is to generate SQL code that runs against data platforms in order to return the results of a metric calculation. But generating efficient SQL code dynamically can be tricky!
MetricFlow does a lot of work under the hood to make sure that query is performant and legible for you to understand what’s going on.
Consider a query for a “bookings” metric that requires a join to a slowly-changing dimension. This not-entirely-unusual metric query generated without any optimization or improvements can be easily >150 lines long. After MetricFlow’s optimizations, it’s roughly one third of the length, and is significantly easier to understand.
4. More complex metric types out of the box
In the past, the dbt Semantic Layer has natively supported a relatively limited number of metric types. As part of the MetricFlow integration, however, dbt will soon natively support more complex metrics to support a broader array of business scenarios.
For example, the following will now be first-class metric types:
- Expressions: (eg. transactions – cancellations)
- Ratios: (eg. revenue per customer)
- Cumulative Metrics (eg. weekly active users)
- New aggregation types (such as sum_boolean and percentile)
Importantly, by introducing the construct of “measures” that act as the building blocks to a metric, dbt will allow you to easily define complex calculations that can rely on individual measures without having to repeat them, leading to “DRYer” code.
5. Semantic Layer GraphQL API
In addition to the existing JDBC interface, we will also be exposing a GraphQL API.
GraphQL is a strongly-typed and self-documenting interface. The dbt Semantic Layer’s API will allow you to query metric data from various data applications, and will help enable richer integration experiences in third party tools.
6. Metric quality checks and local validation
By adding checks related to your metrics into your CI pipeline, dbt will be able to help you find problems as early as possible.
After building your semantic models and metrics, dbt will be able to test them to help ensure they’re correct. Some examples of things that dbt Cloud will now check for include:
- Metrics you created point to valid and defined measures
- Semantic models actually exist in your data platform
- Entities exist in more than one semantic model to ensure joinability
- Data sources have only a single primary entity (ie., one primary key)
- Measure names are unique across your semantic graph
This keeps everything organized and helps maintain the quality of your data, and ultimately, more trust in your work.
Additionally, you will be able to download the MetricFlow CLI locally to test your configurations when committing metrics, and run simple queries to ensure everything is working as expected. You’ll be able to run commands like the following to validate that your metric definitions are in good shape.
-- Check on your metrics mf list-metrics -- Run simple query to validate the output mf query --metrics revenue --dimensions user__country
7. Unified permissions
In addition to setting up highly privileged deployment credentials for a project, dbt Cloud administrators will now also be able to configure another set of less privileged project credentials for users of the Semantic Layer.
These organizationally-scoped credentials will be used when data consumers are running queries in downstream tools. This approach is not only more easily auditable, but also ensures that everyone with access to query the Semantic Layer has the same privileges, reducing the possibility of actions that might compromise the integrity of the data.
What’s coming after that
Beyond the improvements we’ve listed above that will come with the first re-release of the Semantic Layer, the integration of MetricFlow will also make some very important functionality easier for us to add in the medium-term future as well.
These product improvements will follow in the future, after the initial re-release:
You’ll soon be able to get results faster – with less warehouse spend – using intelligent caching. dbt will initially support declarative caching for metrics you manually specify as the most important; we’ll then at a later date build more sophisticated caching that’s automatically informed by what metrics are queried most frequently, allowing you to avoid having to repeatedly query the data warehouse from scratch.
2. More metric types and query flexibility
We know there are a lot of different flavors of business logic and approaches to data modeling. We plan to add support for the following scenarios, with the goal of making it easier for all organizations to define all their metrics in dbt.
- Conversion metrics are typically used to determine if a web strategy is achieving its desired outcomes with visitors. (eg., of all users that visited a website, how many became customers?)
- Experimentation metrics are crucial for product-led growth strategies. A common example is of A/B testing, in which you might compare metrics on two or more versions of a web page to determine which has the desired impact.
- Slowly Changing Dimensions store and manage both current and historical data over time. MetricFlow has some support for Type II natively, but there are more scenarios we would like to enable.
In addition, we also plan to build abstractions in the Semantic Layer to power more complex analyses. These analyses, which would typically be lengthy SQL for an end user, would be written in just a few lines. For example, we want to enable users to perform cohort analysis, to create and compare forecasts and goals against their metrics, perform offset time comparisons, as well as choose to trim and fill incomplete data.
3. Even more interfaces
Inevitably, developers building integrations with the dbt Semantic Layer have their own preferences on what interfaces to use. In addition to JDBC and GraphQL, we also plan to add ODBC and REST interfaces to give application developers as much flexibility as possible.
What’s going to change (or break)
Unfortunately, as much as we’d like it to be, this will not be an entirely seamless transition for current users of dbt metrics.
The metric spec
First and foremost: the integration of MetricFlow with the dbt Semantic Layer will result in significant changes to the current dbt metric spec. Unfortunately, that means if you’re using dbt metrics in production today, refactoring will be required. (We will provide tools and migration guides to support you.)
We understand that’s likely to be quite inconvenient, but we think the long term payoff is going to be very worth it, and wanted to rip off the band-aid as early as reasonably possible. We definitely appreciate your willingness to bear with this.
⚠️ We will be deprecating the dbt_metrics package. If you’re using dbt metrics today, you will have to migrate to the new spec if you want to use the new functionality. ⚠️
We’ll be providing more communication about this as we get closer to the deprecation date, and will also provide tools and migration guides. More on why we’ve chosen to make this decision will be posted on our Developer Blog soon.
You can read more about the new metric spec, see it in action, and share your thoughts in this GitHub issue.
As part of deprecating the metrics package, we will shift our focus to dynamic semantic layer capabilities. As such, the dbt Semantic Layer will not support materializing metric queries as static database objects. Instead we’ll focus on caching functionality that increases performance without reducing capability.
JDBC querying syntax
We’re introducing a new querying syntax that will allow you to run queries in your downstream tools. The syntax will be based on jinja and will be python-like (similar to the current querying interface). It will include changes that reflect new abstractions in the metric spec. For example, the ability to express join paths through the use of “dunders” (double underscores). Stay tuned for more documentation on this.
Connections to external integrations
We’ll be working closely with existing integration partners to make sure their integrations are compatible with the upcoming updates to the Semantic Layer.
However, in order to benefit from the updates to the Semantic Layer, you will have to redeploy any existing connections you’ve created to downstream tools. This will be a one-off activity; look out for more information soon on what it may entail.
We initially will not support secondary calculations as defined today, but we plan to reintroduce these in the future. Note that some of these calculations can be translated to first-class metric objects with the new spec.
Timelines and next steps
There is a lot of hard implementation work ahead of us, and we couldn’t be more excited to do it. We expect to have the first iteration of the dbt Semantic Layer with MetricFlow in the hands of users, in a usable beta state, in Q3 of this year. We are planning to then re-launch the new-look Semantic Layer back into Public Preview in Q4.
Our next wave of improvements, including metric caching, more metric types, and more interfaces will follow over the course of this year and the next. We will definitely keep you posted as we get closer to launch and have more to share with you!
In the meantime, if you have any questions or suggestions, feel free to hop into the #dbt-cloud-semantic-layer channel on dbt Slack, or DM us at @roxi.pourzand or @Nick.
Last modified on: Nov 29, 2023