Table of Contents

Getting Meta About Metadata: Building Trustworthy Data Products Backed by dbt

Experienced Consultant with a BS in Information System focused in Information Systems from University of Utah - David Eccles School of Business

Originally presented on 2021-12-10

More and more organizations are beginning to see how building data products and machine learning platforms can add significant value to their business. As such, data teams are under more pressure than ever to ensure that these platforms, and the data backing them, are both highly trustworthy and highly flexible to rapid development of new features and use cases.

In this workshop, we will share the key insights and design patterns that we have identified in helping numerous data teams build flexible, trustworthy customer-facing data experiences and machine learning platforms backed by dbt. In particular, we’ll highlight how dbt’s robust testing, Exposures, and artifacts can be used to bolster the trustworthiness and value of downstream consumers of your data warehouse.

Browse this talk’s Slack archives #

The day-of-talk conversation is archived here in dbt Community Slack.

Not a member of the dbt Community yet? You can join here to view the Coalesce chat archives.

Full transcript #

[00:00:00] Joel Labes: Thank you for joining us at Coalesce. If we haven’t met yet, my name is Joel Labes. I use he/him pronouns and I’m part of the community team at dbt Labs focusing on developer experience. Welcome to Getting Meta About Metadata: Building Trustworthy Data Products Backed by dbt. We’re very lucky to have Kelechi Erondu and Angie Brown from 4 Mile Analytics presenting today. Kelechi led the creation of the analytics engineering practice at 4 Mile and loves creating teams.

Angie loves geeking out about dbt and when she’s not working on dbt models, you’ll find her camping or hiking with her family in Colorado, which means that they might have all collided with Lauren at some point while she was in the wilderness.

All chat conversation today is taking place in the Coalesce Getting Meta channel of dbt Slack. If you’re not part of the chat, you have time to join right now, visit getdbt.com/community and search for Coalesce Getting Meta when you enter the space. We as always encouraged you to ask other attendees questions, to make comments or to react at any point in the channel. Bring the [00:01:00] memes. After the session Kelechi and Angie will be available in the Slack channel to answer your questions and we’ll try and do some live if we’ve got time. So keep the questions flowing in throughout the session. And if you throw a red question mark at the start of your message, it’ll be easier to make sure it gets an answer. You can also upvote other questions by putting the plus emoji on.

Let’s get started. Over to you, team.

[00:01:24] Kelechi Erondu: Hello. Hello. Hello. Hope everyone is doing well this evening, this afternoon, this morning. Thank you for the introduction. We are excited to share this virtual space with you all and share experiences in the analytics engineering world. Specifically, we want to use this time to have meta conversations and we want to walk you through how we explored using dbt and metadata to generate data trust within organizations.

My name is Kelechi Erondu, senior manager at 4 Mile Analytics.

[00:01:58] Angie Brown: And I’m Angie Brown. I’m [00:02:00] a senior analytics here at 4 Mile, and I’m stoked to show you what you can do with dbt and metadata.

[00:02:06] Kelechi Erondu: I am stoked as well, so let’s see what’s on the First we want to get related and share more of ourselves and who we are in the world.

I want to introduce 4 Mile Analytics to you all. And we want to take some time to get to know each of you in this virtual space. And then we’ll introduce a common data problem and what I’m deeming the old world of data experiences. And then Angie will introduce a solution. She looked up in the new paradigm of data experiences and then we’ll have a Q &A session and then we’ll be out of here.

[00:02:35] About 4 Mile Analytics #

[00:02:35] Kelechi Erondu: All right. So let’s get related. First, I want to introduce 4 Mile Analytics. is headquartered in Santa Cruz, California. It’s composed of 52 human beings distributed across the US, Canada and Australia. We serve clients large and small around the world who delve into all kinds of data of different industries but who all share the same desire to understand [00:03:00] their business and their operations with the other data. At 4 Mile, we apply a human first approach.

And what I mean by that is, we take care of ourselves in support of our team and support of our clients to understand their needs and we explore their possibilities and finally implement solutions that yield sustained and powerful data experience. For example, we might deploy tailored modern data stacks for our clients.

And what that looks like is, we integrate data into a data warehouse, we apply governance and modeling to their data. And we an integrate tools that elicit good experiences for our clients, that our clients crave, including creating custom digitalizations that might meet their particular use case.

Okay. All right. So I want to share a little bit about our ourselves here. I’m again, my name is Kelechi Erondu. I’m based in Brooklyn, New York. I was raised in [00:04:00] Augusta, Georgia. I enjoy flow moments like playing soccer, listening to, talking about, or mixing music. As Joel alluded to, I really enjoyed building teams and teamwork.

One thing I’m working on is keeping my shoes clean. I’ve struggled a bit, just compared to these pearly white loafers that you see in this picture. I’m still working through how to, just keep my shoes clean, my sneakers. It’s all a work in progress.

[00:04:26] Angie Brown: And I’m Angie Brown. I’m located in Broomfield, Colorado. If you’re familiar with Colorado, Broomfield sits between Boulder and Denver, and I love to go biking, hiking, camping, playing soccer, and my two young girls, they keep me young at heart.

[00:04:42] Kelechi Erondu: Excellent. All right. So now we want to get to know who’s in this virtual room. Emma should be sharing a poll right now just to get a sense of who’s with us right now.

[00:04:53] Our 4 audience categories #

[00:04:53] Kelechi Erondu: We’ve categorized the audience into four potential groups. We want to see who’s here. So you might [00:05:00] be an analytics engineer. These are people who are using dbt every day and are fully bought into the analytics engineering mindset. You are deep into the weeds of creating models to effectively transform your data on behalf of your end users.

Speaking of end users, you might be a business user. A business user might not necessarily be in the weeds of course, of your company data but and how it’s transformed and how it’s consolidated and transformed, but you have pristine knowledge of the business and operational functions.

Given, now every company is essentially a data company, your work is dependent on the work of your data team. And thus you work closely with analysts and engineers, to get those answers to your business questions. You might be a team lead. This person runs the data team, you’re whipping up implementing project plans, creating epics so that you can achieve some sort of shorter longterm project initiative.

You’re responsible of ensuring those initiatives are achieved in a timely manner. Or you might be a decision maker. We’re thinking of CTOs, [00:06:00] founders. You’re making the call in terms of what technologies would be really useful for the business today and for the future for years down the road.

Okay. So yeah, I’m just going to check the chat to see who’s here with us, right?

[00:06:21] Angie Brown: Lots of analytics engineers here.

[00:06:22] Kelechi Erondu: All right. Hello fellow analyst engineers. Great, right and we have one decision maker. Awesome.

[00:06:35] A common data problem #

[00:06:35] Kelechi Erondu: Great. All right. Now that we’ve got to know each other, I want to introduce a common data problem. I’m going to show a Slack message. This is a Slack message that I received earlier this year, and I’m very curious to see who’s, for the folks in the room, has experienced this or received a message like this, or maybe [00:07:00] sent a message like this. So the message reads, "Kelechi, can you review the signup numbers in the self-serve dashboard? These do not match our reports that we sent out on 6/29th." So yeah, very curious to see if anyone has experienced this type of issue or problem. Let’s see, I’m gonna check the Slack again. Yeah, I like it, everyone. I like that. Thank you, Joel.

So I would assert that this is a very common occurrence for data teams and teams that work with data teams around the world today. It’s not the most ideal occurrence, in the context of who’s in this room. Again, thinking about the analytics engineers. You start your workday with an idea or agenda of what to work on, and when you receive a message like that, you essentially have [00:08:00] to stop what you’re doing and help troubleshoot. If you’re the business user who sent that message, right? You might grow skeptical of the dashboards or the data products that your data teams are creating for your analysis.

And, it just creates some anxiety, especially if you have an upcoming meeting with leadership and you’re sharing metrics that you might not trust. You might be a team leader who’s disempowered. You have your project plan, and given that your resources are having to stop what you know is intended for them to work on, to solve for these issues might put your data, your project plan at risk. And thinking about decision-makers, where you made the call in terms of investing in those data tools to meet in some sort of initiative and come frustrated in terms of, Hey, did I make the right decision?

Cause right now we’re dealing with some unforeseen problems in terms of being stopped.

[00:08:50] Essential questions about data #

[00:08:50] Kelechi Erondu: Our essential questions are these, is there a way to be more proactive in solving data issues? Can we make sure that our source tables are up to date? [00:09:00] Can we ensure that we test the data to ensure that it’s robust and quality? Can we visualize, essentially, the health of our data stack and visualize everything that, I’ve described in these questions above in a single place.

Angie, what do you think? Is it possible?

[00:09:19] Angie Brown: What if I told you, you can be more proactive, that you can ensure your sources are fresh and you can ensure your data is accurate all in one place. It’s a whole new world with dbt. Let me show you.

[00:09:32] An example of a meta metadata dashboard #

[00:09:32] Angie Brown: Here is our meta metadata dashboard. This Looker dashboard gives you the proactive visibility into the accuracy of your dbt project.

We built this as a proof of concept to show you what’s possible and showcase what dbt can do for you. Of course, we want to cater your dashboard to what’s important to you. Let’s take a look.

So here we have the most recent run and whether it was a success or failure. If I click on this link, it will take me directly [00:10:00] into the logs of this run.

So if there was a failure, I could go and look real quickly. We have the date the last time that this was run in the history thereof. We have our source freshness warnings. Oh, it looks like we have one source that’s giving us a warning. There’s 31 that passed. So that’s good. Next we have our test failures or our tests that have passed in the history. With dbt, you’re also able to see your overall model status. So here I have my dim customer, my fact order and their tests accordingly, and whether or not they passed or failed. If you’re worried about query optimization, you can actually even get the execution time of your model. Exposure status. If you’re not familiar with exposures, don’t worry.

I’ll be going over that later on in this presentation. What you can gather from this is that I was able to make an exposure, ABC test, which is a [00:11:00] dashboard. And that dashboard is utilizing our dim customer, our fact order and our order source. And these are the unique tests that we set up for each of those models and whether or not they were successful or failures.

And it looks like we have a warning on our source of orders. So we’ll want to make sure that we fix that before anyone consumes this data. Additionally with dbt, you’re able to get your model paths. So here I have dim customer, stage customer, and then I have my source, all that make up my dim customer and their relative node times.

And then we have our executions over time. If you want to see if there’s any trends of things failing at a certain time every week this is a good place to go. And then we have our source freshness over time. So as we mentioned, our orders is giving us a warning. It looks like we set up a test to make sure that the sources refresh daily.

And if I look at my max loaded that date, uh [00:12:00] oh, it hasn’t been refreshed since March 23rd, 2021. That’s a big problem. And in the real world, we would want to take care of that right away. So now that you can see how you can get visibility into the meta metadata by creating your own dashboard, let me show you how we’re able to accomplish this.

Let’s take a deeper dive on how we’re able to create that robust dashboard with this implementation strategy. We’ll be going over each of these in depth so you can be empowered to implement the dbt tools in your project. We’re going to go over our dbt tests, they’re schema, generic and singular tests.

[00:12:37] Implementation strategy #

[00:12:37] Angie Brown: We’re going to go over our source freshness checks and then our exposures. As I promised within our implementation strategy, we first set up our schema tests. What are schema tests? They’re configured within your YAML file and come out of the box with dbt. They are the unique, not null, accepted values, relationship tests that are configured within your YAML file. So in this [00:13:00] example, we have our fact order and our primary key on that table is our order ID. So we want to ensure that it’s always unique and it’s not null. Additionally, we have our customer ID. We want to make sure that for every order, we have a customer.

So we want to ensure that’s not null. Additionally, if there’s a value in that customer ID, we want to make sure that it lives in our dim customer. And that’s where this relationship test comes into play. For this presentation, we didn’t set up any generic tests, but I wanted to introduce them to you.

These are assertions that you can make about your data or even regression type tests that can be reused over and over again. In essence, you can parameterize a query as a template. So in this example, you just create a macro and here we have a model and column name. And then you can create your validation check. For this validation, we’re just checking whether or not it’s even or odd.

And if it returns anything it’s going to [00:14:00] fail. So you only want to write these queries so that they’re returning failing rows. Once you set that up, all you have to do is within your YAML file is put within the test parameter the name of your macro.

Singular tests are similar to the generic tests except for these, are one case scenarios. And to set these up within your dbt project demo file, you’ll need to set up your test paths. So here, I just use the default, which is test, but you can name it whatever you want. And then you write your SQL query to return failing rows, and you save it as a SQL file within that directory.

So in this example, we have our shipped quantity and our received quantity, and we’re checking whether or not they match. If they don’t match, we want to be notified so that we can take care of that ASAP.

Source freshness. This is the acceptable amount of time between the most recent record and now for your table to be considered fresh.

So within your [00:15:00] sources YAML file, you set it up just as you normally would with your database schema, but then you can add this freshness. And you can set up the warn after and you can also set up an air after. You don’t have to do both. And then the next thing you want to do is you want to set up your loaded app field.

So what dbt does is it’s going to take the max value of this column and check if it’s a fresh within the last day. One thing to keep in mind is that this is high-key and that if you set it up this sources level, it’s going to check all of the tables that you’ve listed below for this freshness check.

Now you probably don’t have every single source that’s going to be refreshed on a daily cadence. So you can actually override that table by table using these same parameters. When you run a source freshness check, it’s going to run the SQL queries for you in your database. So it’s going to pull the max of whatever column you put in that parameter, and then it’s going to compare it to the current time.[00:16:00]

So exposures. As promised, these define and describe your downstream use of your dbt projects, such as your dashboard application, or even a data science pipeline. So here we have an exposure. I have set up an application of Looker. We have the type of application, and then I set up a date, a dashboard. So ABC tests, like we saw on our actual Looker dashboard and you just specify the type and what it depends on.

So dim customer effect order and our sources, orders and customers. Now we can visualize from source to dashboard. So if you’re familiar with DAGs and dbt, if you set up your sources they will show up as green. So here we have our Jaffle shop customers and orders. We have our light transformation of stage customer and orders, and then our fact and dim tables all being consumed by our ABC test dashboard.[00:17:00]

[00:17:03] Using dbt artifacts #

[00:17:03] Angie Brown: So now that we’ve implemented the building blocks to create our meta metadata dashboard with tests, source, freshness, and exposures, how do we get the data into dim and fact tables to then create our own dashboard? With every invocation, dbt generates and saves one or more artifacts as JSON files. We’re going to go over the manifest, the run results and the sources, JSON files.

So first is the manifest file. This is a full representation of your dbt project. So it’s going to have your models, your tests, and your exposures over here. I have an example. So in your manifest, you’ll be able to get your job ID, you’ll get your generated act date. And then all of the nodes thereof. One thing to keep in mind is your manifest file will be created and it will replace the previous manifest file.

Every time you compile, do a dbt run test [00:18:00] or generate your documentation, this is going to come into play when we do our job orchestration. So keep that in mind. Next we have our run results JSON file. This is the timing and status of every node within your project. So again, we have our generated update and then we have our node of stage customer, whether or not it was a success and the execution.

Again, we have the stage order. It was a success. And then the execution time, again, it’s important to understand that whenever you do a dbt run, a test seed, snapshot, or generate documentation, it’s going to override and save. It’s going to save a new run results JSON file. And the last file I want to go over as the sources, this is going to list your sources and your freshness checks.

So here I have my Jaffle shop orders and I have the snapshot of that date. So this is the date that I ran this, which was [00:19:00] back in October. So that was the current day. And then I have my max load of that date. And what dbt does is it’s going to compare these two dates and check if it’s within this parameter that you defined.

And if it’s not, it’s going to give you the status of warn. The sources JSON file is going to be produced whenever you do the dbt snapshots freshness or dbt source freshness commands.

[00:19:29] The process of getting meta about metadata #

[00:19:29] Angie Brown: So let’s get meta about metadata. How do we get these JSON artifact files into dim and fact tables so then can be consumed by our dashboard. First we’re going to create a macro to extract the JSON files. Then we’re going to clean up and parse the JSON into our own model files. We’re going to set it up in our jobs to make sure that we get all of the new information into our dim and fact tables.

Then we’re going to create an update dashboards instead alerts accordingly. If we haven’t done so yet we will be sharing a link in the Slack channel [00:20:00] that will direct you to a public GitHub repo with a full macro that we used. We want to empower you to build your own metadata dashboard. For this demonstration, we’ve used Snowflake, so the JSON extraction macro and parsing our data is warehouse specific. So keep that in mind. This is just to show you what’s possible with the artifacts that dbt provides. So I just picked certain snippets from the macro. Again, you’ll be getting the macro but the first thing that we set is the source dbt artifacts then we created a stage that is the type of JSON and then within a for loop, we put the file into the stage right here, and then we copy it into a table and voila, we have our JSON files in Snowflake. That easy. So now that we have them in Snowflake, we can flatten that JSON and create our stage model files.

And then we begin our metadata transformation. [00:21:00] You can see that we were able with those three files of the manifests, the run results and the sources we’re able to create stage files that give us our exposures, our models, our model execution, run results, sources, tests, and test executions. You may have noticed that we also put our stage files into their own folder called dbt artifacts.

This not only helps you organize the model files, it makes it easy to add a tag in your dbt project YAML file, which allows you to run the job in your job orchestration a little bit easier. After creating the stage model files, we build out our dim and fact tables. The dim and fact tables are what are going to be consumed by your Looker dashboard to create that one place stop to ensure your data is fresh and valid while giving visibility to your various teams.

Again, you may notice that we put our dim and fact tables in their own folder. This is done for the [00:22:00] organization and for the tag management. From the Lineage graph, we can see everything we’re able to build. We start with our dbt stage, our stage dbt artifacts that contain the JSON files, and then we have our stage files that are flattening that JSON and parsing it.

Then we have our dim and fact tables and these are what are going to be consumed by your Looker dashboard.

Now that we have our metadata objects, we need to orchestrate some of the jobs to pick up a JSON and transform it on a regular cadence. In this example, I start with checking my source freshness. Then I’m invoking the macro we created. I know when I run the source freshness command, it generates this sources JSON file.

So I add that command as my argument. And you may remember we tag these models within the dbt project demo files. So the last step is just to [00:23:00] run that tag of dbt artifacts.

There are a variety of ways to configure your dbt job, to run your models. If you’re doing dbt run and dbt test, it’s important to run your macro twice, once to pick up the run and then again, to pick up the test results. So let me show you. In the dbt run, I’m doing a tag of core, which is going to populate my dim customer and my fact order tables.

Then I’m going to invoke my macro to pull my manifest file and my run results to see if everything was successful. Next I’m going to do my test. And I know that my dim customer and my fact order all come from Jaffle shop. And so one great trick is to do source Jaffle shop with a little plus sign, and it’s going to run everything from the source all the way to my dim and fact tables.

Once I do that, I’m going to invoke my macro again, because I know that when I run, my test [00:24:00] is going to override my run results now with my test results. And then the last step is I want to update my metadata dim and fact tables. So I’m going to do a dbt run of my tag of dbt artifacts. If you’re using the build, the run results are great because there it includes both of the run and test results.

So you only have to run the macro once. So here, I’m going to do my dbt build, select my core. I’m going to invoke the macro, pull my manifest and run results. And then I’m going to build my dbt artifact.

So now we have the building blocks to create our own dashboard. Yay. And because we’re using Looker, we can set up a data group trigger to refresh the cache of our dashboard. So one easy way to do that is to utilize the SQL trigger. And as I mentioned in artifacts, the JSON artifacts, we have that artifact generated at date, so you can just grab the max date to make sure [00:25:00] that your Looker dashboard is up to date. Looker also has alerts where you can specify conditions in your data that when met or exceeded, it can trigger a notification. And that notification can be either an email or even a Slack message. So if we think back to the dashboard where I had that source of freshness warning, this would be a great example where we could set up.

[00:25:28] The end result of all these #

[00:25:28] Angie Brown: So what’s our end result? Our analytics engineers, they no longer need to spend their valuable development time resolving issues in a pressured environment and could focus further on the development of their dbt project. Business users will embrace the data tool, knowing the health of their data is accurate.

Team leads, they’ll be empowered to lead their team, to get the work done instead of finding those data inaccuracies. And decision-makers are confident they are making the right decisions due to the fact that the dashboards they’re using are accurate.[00:26:00]

Oops. Before I conclude, I wanted to go over some considerations. By using the dbt Cloud jobs, we converted all of the tests to warnings and that’s because when the severity is not at a warning, it will stop. It will fail. And it won’t pick up the JSON files. If that’s not ideal for you, there are the options available such as Airflow and the dbt API.

And then again, for this demonstration, we use Snowflake. So the JSON extraction macro and parsing our data warehouse specific. So in conclusion, I hope you found this presentation helpful. You now have the tools and code in hand to utilize dbt’s robust testing, exposures, and artifacts to bolster the trustworthiness and value to your downstream consumers. You have any questions?

[00:26:56] Joel Labes: All right. Wow. Thank you both so much. As I [00:27:00] was talking about in the chat, but I haven’t seen that snuff like functionality. And so I was delighted to know that was possible, especially in a cloud context, which is not something that you normally think about being able to do that sort of behavior just natively.

[00:27:15] Q&A #

[00:27:15] Joel Labes: So there’s a couple of questions in the chat. Maybe Vince, we might want the slide deck back up just for Janessa asking what was what were we seeing on that dashboard at the stat? Was that, the data on a whole dbt project or just a single model? Yeah, that one.

[00:27:29] Angie Brown: Yeah, so I created a very simple project that only had two tables, dim customer and fact order. So it was based off of that. And then of course I’m consuming all of those dim and fact tables that I created within the metadata to give us all of this information.

[00:27:49] Joel Labes: Perfect. And what’s your preferred way to enforce the tests and the documentation and the source freshness have been applied to the project? Kelechi, I know you answered it a little bit in the chat, [00:28:00] but just to pop it on record a bit.

[00:28:05] Angie Brown: I’m sorry, could you ask that one more time? I’m not sure about the question.

[00:28:08] Joel Labes: How do you make sure that the tests and the documentation have been applied to the project? Do you do any laser through code review or do you have any automated process?

[00:28:17] Angie Brown: Yeah, definitely. So the projects I’ve been on, definitely, before we implement a new source into the project, we want to make sure that when we’re doing that pull request, that someone reviews it and maybe has like things that they’re checking for.

So the things that they should be checking for is in their sources YAML file that they have set up a source freshness checks, right? Whether it’s a day, seven days, a month it’s a really important thing to make sure that you’re utilizing. Again, also, you want to check for any of your dim and fact tables that you’ve set your primary keys, at least with those out of the box tests that are not null and unique.

[00:28:57] Kelechi Erondu: Yeah to add, we typically have [00:29:00] just like our best practices and your kind of bread and butter tests that you want to make sure that we apply to every project. And then I think just getting in communication with the clients that we work with in terms of understanding their business needs and converting some of those needs from like a data integrity standpoint into a custom tasks and integrating in pair.

Cool. This gives very secure as it’s very quickly, but do you have a freshness test on your artifact tables?

[00:29:28] Angie Brown: You could. I didn’t choose to do that for this presentation, but I think that would probably be an important check as well to include for sure.

dbt Learn on-demand

A free intro course to transforming data with dbt