Table of Contents

The Endpoints are the Beginning: Using the dbt Cloud API to build a culture of data awareness

Kevin Hu is co-founder and CEO at Metaplane, a data observability startup focused on helping every team find and fix data quality problems with as little setup as possible.

Originally presented on 2021-12-08

As modern data stacks click into place around the world, data teams enable their organizations to answer a universe of questions.

Despite progress, some questions still haunt even the savviest of data teams: Is this dashboard broken? What is impacted by this change? Is this data trend normal? These questions, which are not about data per se and rather more about data awareness, are challenging to answer without data about data.

In previous generations of data tooling, metadata of this sort was elusive. But with the introduction of dbt, and specifically the dbt Cloud API, teams now have an accessible way to gain awareness about their data.

This session is a deep dive into three ways the dbt Cloud API can be used to foster a culture of greater awareness: through real time monitoring, impact analysis, and historical understanding. For each of the three use cases, we state the problem in relation to practical workflows, walk through a concrete solution (with working open sourced code! ) using the Cloud API, and present a scenario in which a real team uses this in practice.

To close, we discuss approaches for taking each solution to the next level. Viewers of the session can expect to leave with new concepts with which they can frame problems of data awareness, solutions to which they can map their own situations, and code that they can start tinkering with that same day.

Follow along in the slides here.

Browse this talk’s Slack archives #

The day-of-talk conversation is archived here in dbt Community Slack.

Not a member of the dbt Community yet? You can join here to view the Coalesce chat archives.

Full transcript #

[00:00:00] Julia Schottenstein: Welcome everyone to your next session at Coalesce, The Endpoints Are the Beginning: Using the dbt Cloud API to Build a Culture of Data Awareness. I’m Julia Schottenstein and I’m part of the product team at dbt Labs. And I’ll be your host for this talk. Over the next 30 minutes, we’re going to hear from Kevin Hu, the co-founder and CEO of Metaplane.

He’ll share more about how to use the dbt Cloud API and the dbt metadata API, to add real-time monitoring and impact analysis to your data teams’ capabilities. I love this talk so much because Kevin built some awesome things using our APIs. Over the next year, we’re going to invest a lot more into our APIs so the possibilities of what you can build and do with dbt Cloud will grow significantly.

I invite you to contribute to the conversation in our dbt Slack channel Coalesce dbt Cloud API, and Kevin will answer your questions at the end of his talk. So without further [00:01:00] ado over to you, Kevin.

[00:01:03] Kevin Hu: Hi everyone. Thank you, Julia, for the great introduction. Hope everyone has had an awesome time in Coalesce so far. I know that I’ve learned a lot. I’m Kevin, I’m the co-founder of a data observability startup called Metaplane, but I’m not here to talk about Metaplane. I’m here to talk about the coolest piece of technology I’ve encountered in the past few years and one that we use a lot at Metaplane and our customers use a lot, which are the dbt Cloud and metadata API. And specifically this talk is going to show you how you can use those APIs to help build a culture of data awareness.

Now, the session is structured very simply. It’s like a sandwich. First we’re going to establish some common ground by talking about what is the problem, right?

What is the current state of data awareness? And just to skip ahead, it’s like debugging the dark. But then we’ll go into, what are the technology shifts that made us such that we have a new attack on this problem today? And the main substance of this talk is going to [00:02:00] be in the solution, or we’re going to jump right into a Jupiter notebook and talk about how you can accomplish three specific jobs using the cloud and metadata APIs.

And lastly, we’re going to wrap up with a little bit of discussion of okay, what can you do now. And also one takeaway from this talk is that you will have a link to a GitHub repo with a notebook that you can plug your API into, run it from the top and see how it looks for your organization.

[00:02:28] Debugging in the dark #

[00:02:28] Kevin Hu: So we’ve heard a lot about data becoming a product, right, data moving beyond the main vein of powering business intelligence and reporting applications, which are still very important, to powering all sorts of other use cases from machine learning and artificial intelligence, to being plumbed back into your go-to-market tools, using Reverse ETL tools to powering in product experiences. And this is awesome. More use cases, more data, more stakeholders. I’m definitely not [00:03:00] complaining but there is a drawback which is as we collect more and more data, there’s more and more surface area to manage.

Now you probably have hundreds, if not thousands or tens of thousands of tables and dashboards, and ain’t nobody going to audit all of that. So the surface area gets larger and larger, and the opportunities for breakage increase, which leads to lost trust throughout the organization. And then the question is, okay, how do people currently get awareness of data today?

And there’s this great state of the art method called EDT or executive driven testing. So if you’re a data practitioner, you’re probably no stranger to Slack alerts like this one from a head of marketing saying, Hey, data team, why does the revenue dashboard look broken? And, once you receive this alert, you know what happens: poor decision-making by stakeholders, you lose trust with the [00:04:00] organization, you have to start having a bunch of one-on-ones to start regaining that trust and you have lost time for the data organization. So whatever you had planned for the afternoon, yeah, see you later.

But let’s dig in a little bit deeper and ask, why is this question difficult to answer?

To systematically answer this question, we need to know the answer to three sub questions. What’s going wrong right now? What’s the impact? How can I fix it? Is this revenue dashboard actually broken? Is that some upstream table not get refreshed. And then how can I fix that? And to us, there’s a huge irony here, which is that as data practitioners, we provide the data for other teams to make informed decisions.

[00:04:49] Metadata is data for the data team #

[00:04:49] Kevin Hu: But ironically, we often don’t have the data to make our own informed decision. And that data about data is metadata. And one takeaway [00:05:00] from this talk is that metadata is like data for the data team, right? Without metadata or information about how your data systems are running about the state of your data, it’s hard to know what exactly to work on and what the impact of your work is and whether or not everything is in the state that you want. And in previous generations of data tooling, this metadata was pretty difficult to get your hands on. I’ll take what I can get, but it was hard to accomplish the jobs that we want to accomplish until recently.

[00:05:31] dbt and the cloud and metadata APIs #

[00:05:31] Kevin Hu: And dbt, of course, we all know and love dbt but you may not know that there are two amazing APIs that dbt provides. One of them is the cloud administrative API and it’s like a restful API for doing CRUD operations on jobs, runs, run artifacts. You can imagine starting or deleting a run. And then there’s a metadata API and the metadata API is super exciting.[00:06:00]

I know Julia mentioned that there’s a lot of work going in this direction where it’s a one-stop shop, a GraphQL API for metadata for every type of nodes: models, sources, seeds, tests, you name it. I’m very excited to see how this unfolds in the coming months. And so to tie it back, right? Those three questions correspond to three jobs that we want to accomplish.

One of them is real-time monitoring. The second one, what is impact analysis? And the third is how do I fix it? Historical understanding. And with that, let’s jump into my handy dandy notebook and let’s code. I’ll give you all a second to, type in mediplan.dev/coalesce, or you can go straight to the GitHub repo.

[00:06:43] Let’s code! #

[00:06:43] Kevin Hu: It’s a very humble repo. There’s only one notebook but I just want to make sure that, you don’t have to frantically take notes or look at the code that I’m writing. You can follow along by yourselves. So I’ll give you a couple of seconds to do that.

[00:07:00] Okay, let’s do it. So here we are in my trusty notebook we’re going to start from the top and say, okay, import some packages I’m going to be using later. I’ll go into depth about what these packages are, but to summarize it’s tools for retrieving information from an API, manipulating data and visualizing data.

Now we’ve gone to the setup. These are two very important endpoints. We just tie them to the cloud API base here, all the metadata API base URL. And here is the auth section. Now I didn’t put my own key, but if you do want to retrieve your own key, it’s very simple. And at least for a cloud user like me go here, going to dbt, my profile, API access. There’s your API key.

And I’m not going to show you mine. Just take that, plug it in here and you’re good to go. Here’s a little bit of code to make sure that our graphs look pretty and we’re going to [00:08:00] jump right into the first job, which is real-time monitoring. What is going on right now. So the setup is you have dbt jobs running periodically in the cloud, whether it’s for your own development purposes or you want to include, some metadata into a pipeline that programmatically answer questions like what’s the status of my jobs right now.

And what we’re going to build is this, a pretty simple, a heat map of the status of all of your jobs by the date, for different hours and the status of the job. So like when it’s canceled, a couple are in progress. And the second question is what are potential model bottlenecks?

And we’re going to build this really nice Gantt chart showing model execution times lined up so you can help find some of those bottlenecks.

Let’s dive in. So there’s a couple of API calls that we want to get out of the way. First, the first one is the called API slash accounts. Like we need to get our account [00:09:00] ID basically. So we hit the slash accounts with our auth headers and we got a list, here’s one trial account, but the main one is the 6494, the Metaplan ID.

I signed it to my account ID. Let’s get the jobs for that account. I run that and we have two jobs running on Metaplane. One of them is an hourly dbt job run. The second is a daily job for this talk. We’re going to be focusing on the hourly job. So we’re just going to pull out that job ID and assign it to this variable.

And then here, we’re going to get all the runs in that particular job. So we have a /accounts /runs and note that here we pass in this data payload, which says, okay, give me the runs for this one job ID order of by the created time created at time. So reverse chronological order and limit it to let’s say 150 runs, [00:10:00] arbitrarily.

Here we go. We have, this is just the first five of 150 where every single run has an ID. We know like when it was run, when it started the duration of the run and the status of the run, for example, whether it was canceled or complete, or if it errored out. And here, we’re going to do a little bit more manipulation.

And finally visualize the run statuses over time. So I want to pause here for one second and point out that I’m using this Python library called Altair. It’s a beautiful library for declaratively creating visualizations. I’ll go into a little bit of detail here, where we passed in a data frame with the run statuses and across the X axis are the different hours. Across the Y axis is the date, and the color [00:11:00] is determined by the status. I don’t have to do this procedurally, it’s all boom, it’s a json. And here is the visualization and we can see okay, for this PA for the past 150 runs some were canceled, some errored out, but five of them are in progress, which leads me to think, perhaps there’s some queuing behavior going on. Whether my Snowflake warehouse has a long running job that’s slowing down the rest of the queries, or I made a mistake and, pushing bad code to the DAG. One followup question might be okay if it is the latter case, what could be going wrong with this particular job.

And here. We’re going to get all of the models for that one job. And we’re using the metadata API for the first time. And what’s super cool about it is that it’s a GraphQL API, meaning that instead of having slash runs to get the [00:12:00] runs slash products, to get projects, you pass in a json object, that specifies exactly what obviously you want to get back and the attributes within those objects.

So for us, we care about the models. And for this one job, the name, the unique ID, the execution time and so on of those models. So let’s get that and we can see, okay, here’s five of the models within my job as some aliases and so on and let’s visualize it. So this is good code just to manipulate the data a little bit and organize it, and there you go. Let’s zoom out a little bit where you have across the y-axis all of the models within our DAG. And it’s a pretty small DAG and across the X axis is the execution start. And you can see when it started, when it ended and the total execution time, and maybe, this best guest mapping model [00:13:00] is taking a suspiciously long time, just the place to start jogging some thoughts.

Just to recap what we did in that section. We used the dbt Cloud API to retrieve metadata about your account jobs and job runs. We displayed job run statuses in a heat map, and we used a metadata API to create the data for this nice Gantt chart. Some future work that you can do here is one, you can store that information back into your data warehouse, so you no longer have to call the API.

You have all the data there, ready. You can display it in a BI tool and you can maybe programmatically surface some bottlenecks, but that’s the first job to answer the question of what’s going on right now. For job number two, of impact analysis. We know that when you’re writing code, it can sometimes feel like punching out a block on a Jenga, when you’re playing Jenga. I don’t [00:14:00] know what’s going on, but let’s just hope for the best. And one particular question is, okay, you’re changing existing models and a dbt DAG.

You might want to know what the downstream impact is of those things. What we’re going to do is use the cloud API again, to retrieve the manifest, to reconstruct your DAG programmatically, and then say, okay, for any given model, can I get all of the successors of downstream nodes or all of the predecessors and the upstream nodes.

And given that we model the DAG programmatically, let’s say that you’re trying to refactor your jobs. Warehouse consumption is increasing for some reason. I don’t know. Where are some good places to start? And from there you might want to get models organized by the number of downstream dependencies.

And this is what this histogram is showing you, but let’s pop in there. So here, we’re just creating a little widget. Let’s select the latest run ID. [00:15:00] And to give you a sense, right, now we’re using the /account /run. So for one given run, give me all of the artifacts. And I love this endpoint because here, you have all of the artifacts that you want a list of them, your catalog, the compiled SQL files, your manifest.

I feel like Indiana Jones right now. I can pick whatever I want. And in particular, we care about the manifest by json, so same endpoint, but you get one particular manifest. And within that data we have for one job, all of the nodes, all of the sources and critically the map of parent-child relationships Gantt chart within.

What we’re going to do is, runs for this code and create a network X. Network X is a Python library for creating and manipulating networks. And we’re going to put it into your diagraph, a directed graph, [00:16:00] iterate through there, and here we’re just going to visually. So here is our humble DAG at Metaplane where in green, our sources and in blue are some of our models.

And you can see how everything is laid out. What’s at the beginning or what’s at the end and okay. To be fair, you can get this for free and much better in your dbt docs. But the benefit is that now we can manipulate it programmatically to get the successors and predecessors, like I mentioned before. So let’s create these functions to say, okay, give him my graph.

And one selected model, mark all of the downstream dependencies or mark the upstream dependencies. Does it create that? But let’s make another widget. I love widgets and pick tags to exclude. So here we might want to care about what are the successes. [00:17:00] And there you go. What we just did was we took the model that we just selected tags to exclude, it’s marked in magenta.

And now we’re highlighting all of the downstream dependencies, right? Not just one hop, but all the way down and all the way down to the final users model that’s being used within RPI dashboards. We could flip that around a little bit and say, okay, what are all the predecessors of a downstream. And let’s say that we care about final groups, which is right here.

We run that again. And there you go. Here’s the final groups like the central focal node and all of the upstream dependencies. So this is particularly useful when you have a pretty hairy DAG that you can’t like kind of eyeball and you want to know, okay. Just give me that. And I will be like do my due diligence to make sure that everything is in sound shape after I make some changes.[00:18:00]

The last thing is visualizing the number of descendants. Where are we doing? A couple of more manipulations. And here’s a list of all my models. Green is source, blue is a model, sorted by the number of descendants. If I want to start, refactoring, let’s say that the DAG is slow again. Here is a good place to start because if I make an optimization here, they’ll propagate all the way down.

But just to recap, we use the dbt API is to retrieve the DAG of a specific job. And we visualize as a network, cut the descendants, cut the predecessors. Use that network to understand a number of descendants of each model. What’s next. You can go so ham on some deeper network analysis here, if you want to calculate centrality is different nodes.

Now you can integrate other APIs to programmatically, create exposures, and maybe you can parse core [00:19:00] history to understand it. But just to wrap it up. Here’s the third and final job. We go beyond understanding what is happening right now to understanding the DAG to understanding what happened over time.

So as you increase the size of resource tables and the complexities, your DAG, our job is starting to take longer. We’re going to create this, job run duration chart over time. And two, if job runs are increasing, which specific models are taking more time. And we’re going to create this. So let’s jump ahead a little bit because we’re using all of the APIs that we already got to know a little bit here.

There you go, where here’s a line chart showing for the past hundred runs of this hourly job. The run duration increased from about 10 minutes to 15 minutes, and we don’t like that. And the followup question might be okay. How do I understand [00:20:00] which particular models in this job contributed to that?

Here this end point takes a little bit long. So I already ran it for us, but we were, we use the artifacts endpoint and the run results by json to retrieve the model execution start times and execution times for the past 100 runs and where we see from there. Okay. Now this is like the Gantt chart that we saw up top, where at one point in time, we got the waterfall of different models here. We’re visually not a waterfall kind of overtime for the past 100 runs. And we picked out the five longest running models on average, just to keep it clean and we can see the, of those five. Okay. This best guess mapping model really hiked up from around like less than 10 seconds to over a minute and that’s increasing [00:21:00] over time. And the downstream model, which we saw before a final users also increased, not too much, but there’s something yet. I’m not sure what’s going on there, with those three clues, you can sit down in your thinking chair and we can figure it out.

But to recap that section we use the dbt Cloud API to understand job run durations over time both by on the top level and broken down by model. What’s next. You can start adding some anomaly detection to understand. Okay. Can I get an alert every time that there’s a spike in the job run duration?

And maybe I can introduce, integrate those alerts with existing BI and alerting systems, but yeah, that’s the end of the notebook. Hopefully y’all tried to dial in GitHub but let’s pop back to the slides to recap. What’s going wrong right now. Real-time monitoring. What’s the impact. Impact analysis, how do I fix it?

Historical understanding. And all of that was just to give you a little taste [00:22:00] of what you can do with the API. You can think of it more of giving you a place to jump off from when you did decide to explore for yourself. That’s for what’s next. There are so many use cases. It’s very exciting. One common one is change management, whether you’re migrating from one ETL tool to another, or you’re migrating from warehouses from one warehouse to another. You want the metadata to understand that your migration is going smoothly. Two is a data discovery. Now you have all the metadata about your DAG. Perhaps you want to integrate that into a third party or a home homegrown data discovery tool like a catalog.

Three is you might want to trigger other workflows or automation. And for now, this is still within your team, right? The whole organization can’t be using this notebook, but if you store that metadata back into your warehouse and use the visualization tools that we already know and love you can provide awareness of the product.[00:23:00]

We also, today only went through basically the Git requests. But there’s plenty that you can do by way of creating, updating, triggering, and deleting jobs. And now you can experiment for yourself. So just to summarize, ironically data teams frequently don’t have the data to answer our own questions.

The dbt Cloud and metadata APIs have that metadata. And you can use that metadata in useful ways to increase data awareness. Here’s the GitHub repo. I’ve loved every talk at Coalesce so far, but one in particular by Jonathan and Kevin from Snapcommerce yesterday about observability within dbt.

I learned a lot from this talk and it’s a great pairing with what you just heard. And with that, I’d like to thank the dbt team for organizing yet another amazing Coalesce, a special shout out to Barr, my speaker advocate and PM on the metadata team. I only use the API, she’s building it. She and her team are doing an amazing job.

I’ll definitely follow [00:24:00] her. And with that, we use the dbt APIs, our customers use them and you should also use them. So I’m happy to answer any questions in the Slack. Thank you for tuning in.

dbt Learn on-demand

A free intro course to transforming data with dbt