Table of Contents
- • No silver bullets: Building the analytics flywheel
- • Identity Crisis: Navigating the Modern Data Organization
- • Scaling Knowledge > Scaling Bodies: Why dbt Labs is making the bet on a data literate organization
- • Down with 'data science'
- • Refactor your hiring process: a framework
- • Beyond the Box: Stop relying on your Black co-worker to help you build a diverse team
- • To All The Data Managers We've Loved Before
- • From Diverse "Humans of Data" to Data Dream "Teams"
- • From 100 spreadsheets to 100 data analysts: the story of dbt at Slido
- • New Data Role on the Block: Revenue Analytics
- • Data Paradox of the Growth-Stage Startup
- • Share. Empower. Repeat. Come learn about how to become a Meetup Organizer!
- • Keynote: How big is this wave?
- • Analytics Engineering Everywhere: Why in the Next Five Years Every Organization Will Adopt Analytics Engineering
- • The Future of Analytics is Polyglot
- • The modern data experience
- • Don't hire a data engineer...yet
- • Keynote: The Metrics System
- • This is just the beginning
- • The Future of Data Analytics
- • Coalesce After Party with Catalog & Cocktails
- • The Operational Data Warehouse: Reverse ETL, CDPs, and the future of data activation
- • Built It Once & Build It Right: Prototyping for Data Teams
- • Inclusive Design and dbt
- • Analytics Engineering for storytellers
- • When to ask for help: Modern advice for working with consultants in data and analytics
- • Smaller Black Boxes: Towards Modular Data Products
- • Optimizing query run time with materialization schedules
- • How dbt Enables Systems Engineering in Analytics
- • Operationalizing Column-Name Contracts with dbtplyr
- • Building On Top of dbt: Managing External Dependencies
- • Data as Engineering
- • Automating Ambiguity: Managing dynamic source data using dbt macros
- • Building a metadata ecosystem with dbt
- • Modeling event data at scale
- • Introducing the activity schema: data modeling with a single table
- • dbt in a data mesh world
- • Sharing the knowledge - joining dbt and "the Business" using Tāngata
- • Eat the data you have: Tracking core events in a cookieless world
- • Getting Meta About Metadata: Building Trustworthy Data Products Backed by dbt
- • Batch to Streaming in One Easy Step
- • dbt 101: Stories from real-life data practitioners + a live look at dbt
- • The Modern Data Stack: How Fivetran Operationalizes Data Transformations
- • Implementing and scaling dbt Core without engineers
- • dbt Core v1.0 Reveal ✨
- • Data Analytics in a Snowflake world
- • Firebolt Deep Dive - Next generation performance with dbt
- • The Endpoints are the Beginning: Using the dbt Cloud API to build a culture of data awareness
- • dbt, Notebooks and the modern data experience
- • You don’t need another database: A conversation with Reynold Xin (Databricks) and Drew Banin (dbt Labs)
- • Git for the rest of us
- • How to build a mature dbt project from scratch
- • Tailoring dbt's incremental_strategy to Artsy's data needs
- • Observability within dbt
- • The Call is Coming from Inside the Warehouse: Surviving Schema Changes with Automation
- • So You Think You Can DAG: Supporting data scientists with dbt packages
- • How to Prepare Data for a Product Analytics Platform
- • dbt for Financial Services: How to boost returns on your SQL pipelines using dbt, Databricks, and Delta Lake
- • Stay Calm and Query on: Root Cause Analysis for Your Data Pipelines
- • Upskilling from an Insights Analyst to an Analytics Engineer
- • Building an Open Source Data Stack
- • Trials and Tribulations of Incremental Models
Building On Top of dbt: Managing External Dependencies
Originally presented on 2021-12-08
This session is will help people understand how they can integrate dbt, and the data warehouse into the very core of their product and business.
By extending dbt’s functionality with declarative meta-data, teams and companies can leverage the power dbt to build data-intensive applications confidently.
Browse this talk’s Slack archives #
The day-of-talk conversation is archived here in dbt Community Slack.
Not a member of the dbt Community yet? You can join here to view the Coalesce chat archives.
Full transcript #
[00:00:00] Joel Labes: Thank you for joining us at Coalesce. My name is Joel Labes. I use he/him pronouns and I’m part of the community team at dbt Labs focusing on developer experience. Welcome to Day 3, has not been a fun week so far. The session is called Building on Top of dbt: Managing External Dependencies.
And we’re going to hear from Teghan Nightengale at Community.com. Without stealing all of his thunder, Teghan is going to show us the results of an internal hack week project that helps make it easier for people who don’t directly use dbt to benefit from its automated tests and well-governed data. All the chat conversations today has taken place and the Coalesce dbt external dependencies channel of dbt Slack.
The channel name should be showing on your screen at the moment. If you are not part of the chat, you have time to join right now, visit getdbt.com/community and search for Coalesce dbt external dependencies when you enter the space. We encourage you to ask other attendees questions, to make comments, to react on the channel at any point throughout the [00:01:00] presentation. This is the highest density of people who will appreciate your bad data jokes that you’re going to find anywhere so I make the most of it. After the session, Teghan will be available in the Slack channel to answer your questions, but again, get them flowing in throughout the session. If you throw a red question mark at the start of your message, it’ll be easier to make sure that it gets an answer. And we’ll probably do some live at the end of the session if we’ve got time. So let’s get started over to you, Teghan.
[00:01:24] Teghan Nightengale: Thank you so much, Joel. Really appreciate that introduction. I’m happy to be here. Happy to be chatting to everyone today. A little bit about what we’ve been doing at our company, Community, and how it’s been going. So the title of today’s talk is Building on Top of dbt: Managing External Dependencies. And let’s just dive right in here. Or actually before we get further, I just want to give more of a quick background on myself.
[00:01:54] Defining the dependency problem #
[00:01:54] Teghan Nightengale: Joel and I are a bit of a yin and yang in some ways, because he works with the community team at dbt Labs, [00:02:00] and I work on the dbt team at Community.com. Community.Com is a one to many texting platform for influencers and celebrities that allows them to expose a public phone number and then get inbound and outbound text messages and get clustering on their inbound messages. So they can hold a lot of simultaneous conversations at the same time. And myself, I’m a data engineer by trade and we’ve done a lot of great work with dbt and we love it at Community. So we’re excited to share what we’ve been thinking about in this problem space and give you guys more context.
So I wanna start by outlining the problem that we’re working on that we’re seeking to solve, which we call the dependency problem, which can basically be outlined as companies are increasingly using dbt to define their data transformation layer. And it’s becoming the source of truth for [00:03:00] a lot of their business and their analytics.
And, it’s typically exposed in BI tools outside of the data warehouse. But then there’s this desire and this need and this emerging space at least at first to export some of those transformations and some of those metrics into external platforms like Salesforce or other sort of SAS platforms.
And typically introduces a little bit of a risk because when you define models in dbt they’re lovely translations. You have a final, beautiful layer of fact tables and tasks or maybe very wide tables. And you ‘re excited about all the crunching of the numbers that you’ve done to get that into that state.
And you know that it’s a source of truth for your business. Now you want to build something else on top of it using the information that is stored in it. By default dbt doesn’t really [00:04:00] understand that you’re building anything on top of it. So say you have a use case, and this is what I’m actually gonna show later where you’re building an internal Slack client.
You want to send very specific configured alerts to certain people in your company when a dbt test fails. And so if you’re building that and you’re maybe just going about it in a naive way, just taking a first pass at it, you might think, oh I guess I’ll have to define in the service, what kind of tests?
I want to get alerts on when they fail. But this is exactly what we’re talking about when we say the dependency problem. Say one day an analyst goes in and removes a test that they think is no longer necessary or was failing too often. All of a sudden, if your downstream tool or downstream client was hard-coded to be configured off of that particular test it’s gonna break and your engineering team’s gonna get alerts and they’re gonna be like, why is this service down?
A nd they’re going to have to go talk to the analyst [00:05:00] and the analyst had no idea. They’re blameless. They’re just trying to optimize their day-to-day work. And this problem, this is a very specific example, but it really takes all forms of whether it’s tests or models. The things that you’re defining in your dbt transformation layer, you want to expose those things in other downstream clients and tools. But dbt again by default is not aware that there are these downstream dependent tools.
So if we think about how this problem is being solved today the answer is that it’s actually being solved very rapidly for specific use cases for a software as a service platforms and other sort of wonderful tools like Lightdash, like Hightouch, or just a few to mention, that are finding ways to integrate more tightly with dbt in order to define the behavior of their product.
They are basically situating themselves downstream of dbt to [00:06:00] varying degrees and using different features of dbt. But fundamentally they recognize that it’s a very powerful sort of configuration language for the data transformation layer, and they are finding ways to tap into it in order to ensure that product is behaving as expected when the transformation layer changes like when you alter a model alternative tests, anything like that.
So this is very much an emerging pattern that is being seen right now for paid products. But the wonderful thing is you can also bring these principles into your own work when you are building things internally or open-source. So an example is, really the power of this particular config that exists in the dbt YAML called the meta config. So the meta config basically allows anyone to embed an arbitrary JSON/YAML schema on a variety of objects in dbt, [00:07:00] including a test, including a model, including a column. So it’s very powerful in a lot of ways. And so what you’re seeing here is a little example from our Salesforce staging layer in our repo community, where basically we have this unique test, and this is a problem that was arising because sometimes there are data generation processes, like something in Salesforce, something in a sheet that are feeding into your transformation layer. And when a test fails, it’s not enough to just know that it fails. You need someone to take specific action and you don’t want to have your analysts have to manually message people all day. Hey, can you take a look at this sheet? See what was entered, please fix it. So we decided we needed a better solution.
So this is an example of us attempting to configure a Slack client that will send out specific notifications to specific people with custom messaging based on this embedded portion here that you see under the [00:08:00] meta config.
So we’re calling this a configuration schema and you can see, it’s for us, it’s just this custom keyword and then two keys with their values, a list of recipients, and a message for the potential message. So I want to just reiterate everything that I just mentioned on the meta config, make sure you guys really are dialed into how powerful it is.
[00:08:26] Use the meta config #
[00:08:26] Teghan Nightengale: So it is an arbitrary config that can hold anything you essentially want. We are calling the things that we’re putting into it configuration schemas and it’s basically like a codified contract of a specific structure that a downstream tool can ingest and take action based on. And so if you’re not already aware of the amazing thing about putting in this meta config and embedding configuration scheme is into it is that these compiled structures will be [00:09:00] compiled into the JSON artifacts, at least the manifest JSON artifact that dbt provides. And so if you’re already familiar with that, you might be getting really excited, but you might also have some questions.
Okay, I put a schema under this config and okay. I compile my dbt project and now I have this JSON. How is that helping me? Like where does a downstream tool access that configuration? How is it using it? And so that’s where this hack week project that we had comes into play.
[00:09:37] Accessing a configuration schema #
[00:09:37] Teghan Nightengale: So we built a nice little searchable JSON store that we’re calling Metahub, and we’re excited to be open sourcing it in the near future. But essentially it’s a way to parse through the JSON artifacts that dbt is creating for you and pull out specific things like those configuration schemas that can configure any kind of [00:10:00] client, built or tool built in anything you want, whether it’s Python, just Bash, whatever you’re doing with that configuration schema, you can build a client off of it too what you needed to do. And then of course, the other amazing option for people who are already on a dbt Cloud is that they offer of course the metadata API. And so that is essentially a much easier to spin up kind of solution for a paid customer to search through these artifacts and maybe pull out that particular configuration schema that they need to use. So for example, you could build a client that pings the metadata API, pulls out what it needs, and and takes action based on that.
[00:10:48] The Metahub project #
[00:10:48] Teghan Nightengale: But I want to just share a little bit more about this Metahub project, which really came together during a hack week a couple of weeks ago. And it’s basically a very light store, it’s dockerized, it’ll be [00:11:00] easy to spin out with a post SQL backend and it’s nice because it makes it very easy to extract. Only the particular configuration schema is that you’re interested in extracting for a given client. If you’ve ever looked at the manifest JSON artifact you’ll know that it’s massive. If you’ve often been in an editor, you might’ve crashed your editor a few times. It can be quite big. This is just a nice way rather than storing things in AWS or just pulling in like the entire JSON artifact, this is a nice way to just push it to a service and then have a query language that lets you pull out the very particular configuration schemas that you’re interested in. And the language that we’re exposing to do that is a JSONPath query interface, which is very standardized and well-documented in the Postgres documentation.
So it’s very very easy to get going. This is a little screenshot of just a quick make command, spinning it up locally on Docker, and then checking it. [00:12:00] So I want to give you guys some more details on this example and revisit our earlier point about the Slack client that we configure.
So if you’re running dbt Core, this is an example of what our run script essentially could look like, or your run script look like. So you do a run to refresh your project. You have a one-liner here where you’re basically pushing the artifact that’s been generated as a result of that run, in this case, the manifest, which stores the entire state of the configuration.
And then you can essentially, this is a locally served option, but you can of course serve an exposed Metahub anywhere that you would serve a dockerized app. And so this is just an example. Simple. It’s very easy at the end of your run to just load your manifest, your artifact. And by the way however you want to structure this data store and fetch things out of it, it’s very easy to do. This is simply just pushing that manifest into [00:13:00] that that API so that it can be searched over later and extracted by clients. And so then this is an example. This, the search path syntax that makes it really easy to fetch out exactly what you want. Like I said, the manifest can be fairly large
and so this is an example of what we saw earlier that configuration in the YAML was put into the manifest artifact, pushed to Metahub, the API data store. And then this is a simple rest API. So it’s just a Git and you’re giving essentially a path. Oops, sorry. You’re giving a path. So basically, this is the path language.
And so it’s very easy to search through all the intricacies of the artifact and extract say this particular thing. So what we’re doing, because there’s only one artifact, it’s a very simple path, but we’re going into nodes, all of them into their config and then we’re finding the [00:14:00] meta tag, and we’re pulling out what’s under that and in this case, it’s all the Slack alert for this particular example, but you could say find me the tag where this configuration schema like Slack test alert.
So as you can see, it’s very easy to extract exactly what you want to use for your downstream clients. And then of course this makes it very easy to configure their behavior.
[00:14:21] Teghan Nightengale: so on the front end, what this looks like for us is we have a small Slack client Python script that we basically run. And this now is able to be configured off of this schema, embedded into the model YAML and the test YAML. And this is really powerful because usually, if you have an engineer build something like this then, one, you have the dependency problems.
So if something changes underneath them and the service is looking for a model or a task that doesn’t exist anymore, all of a sudden it’s broken. [00:15:00] But if you build your services and your clients to be configured off of these sort of configuration schemas that you embed in dbt, it makes it really flexible to empower your analysts to change the behavior of those clients based simply on what they are doing.
So for example, an analyst could embed this Slack test alert configuration schema on a new test. And say, I want this message to be sent with it, and I want this person to receive it. And they can do that in a way that they’re very familiar with, but just a few lines of YAML. And this sort of allows really like empowering your analysts to the next level, because they now understand that they can control what’s happening downstream of dbt and they’re completely aware of it too. So they’re like, oh, I see. This test is being used by the Slack alert. I know that if I delete this YAML file, [00:16:00] then that’s not going to be sent anymore. So they’re very like in tune with all the things that you have in your stack that are starting to rely on the transformations that you’re doing in dbt.
So we’re very happy with this. We’re very excited to empower our analysts in this way and give them the optionality to embed this kind of stuff. And if we think about what’s next in this area I think that what I just showed about this configuration schema is great for a simple Slack client with maybe two fields like a recipient and a message, but very quickly you could see how it’s going to be a huge headache if you have downstream tools and clients that are expecting a certain data structure from this JSON that they’re fetching out and they try to parse it and it doesn’t match the contract of what it should be, it’s going to break the client as well.
So I think, an emerging area and something that [00:17:00] I’m excited to look into more is how can we validate these custom sort of structures and schemas that we might want to embed into our dbt metadata, into our artifacts. I think that yeah, there’s already some amazing like standards and tools for doing this, it’s just a matter of building a little bit of some lightweight CLI stuff on top of them in order to apply it specifically for the dbt case.
So I think, yeah, it’ll be very important if you want to continue to rely on this pattern in order to de-risk your dependencies, that you can assert that the configuration schemas you’re embedding match the exact sort of contract that the downstream clients are expecting.
[00:17:47] Let’s recap #
[00:17:47] Teghan Nightengale: Okay. So if we want to summarize what we’ve talked about today, I think that at a high level, more and more tools are being built on top of [00:18:00] dbt and that could be internal, it’s also external. And when they are being built, they need to understand and be connected with changes in the underlying configuration of dbt.
That’s really what we started with, the dependency problem. And a lot of different tools that exist right now, we’re doing this in different ways. One way that’s very easy and straightforward from what we’ve seen is the meta config, and that is just so flexible. You can really put any kind of structure that you want into it, including I think, putting a macro as the structure, right? You don’t even have to like, put all that in your YAML. You can just define a macro with the arguments and embed that in your YAML as well. I think so there’s a lot of ways that it can be very clean, very powerful for alerting and configuring the downstream clients of changes.
And then obviously when you create a [00:19:00] configuration schema and you are using it to configure a client, you need some way for that client to fetch it and to actually take action on it. So the dbt Labs metadata API is a wonderful tool for this. And then there’s the open source solution of Metahub as well that I mentioned where we’re very excited to get towards releasing that for the community use.
We’ve really enjoyed it. So yeah very excited. And oh, I’m going to, or we’re glimpsing in the future. So I just gotta put on my little future glimpsing glasses here. They help me see a little bit further and maybe we’ll just put on, yeah, I dunno, we’ll just get the vibe going, bring it up in a notch.
This really ties into a lot of what we’ve been talking about in this conference so far where we know that this is a big way, right? We heard Tristan and Martin’s talk. And I think that really, what they’re touching on is it’s big because dbt has given [00:20:00] us the community, the foundations to build the next generation of tools because they are essentially providing a configuration language for your data structure. And this what we’ve talked about today, solving the dependency problem. It’s really just a way to tap into that configuration language and to de-risk the tools that you want to start building as we continue expanding off of the wonderful work that they’ve done. It’s pretty awesome. I cannot say enough how grateful I am for all the wonderful work that everybody at dbt Labs puts in every day. And the fact that dbt Core is such an amazing open source product still, the fact that it exposed all these artifacts in order to make it easy. Like they are really empowering the community to take the next steps and keep building more and more on top of this. And so I hope today’s talk has been [00:21:00] useful for you to get a little insight into the future and understand that this is a great sort of option to solve that dependency problem and to understand how you can keep building and keep growing the way.
So thank you so much. I’ve really enjoyed chatting with you guys today. I think we’re going to do a few questions and have you up on the channel.
[00:21:24] Joel Labes: Yeah, thank you so much, Teghan. That was phenomenal. The shags are getting it, great direction in the chat. But alongside those, we do actually have some really good questions.
So we’ll see how many of them we can get through. And please forgive me if I mangle any names. But first off why not use the JQ library or powershells convert from JSON as opposed to sending the manifest for the cloud? Querying it from a database.
[00:21:51] Teghan Nightengale: Yeah. Great question. Like you’re definitely free to do that. I would say if it’s a matter of, if you want to [00:22:00] use JQ over AWS, it depends on like how, if you’re already like setting up artifacts in a cloud storage bucket and you’re comfortable using JQ and that syntax then that’s great. We found that the syntax provided by the Postgres backend, the P the posts, what is it, JSONPath SQL was like really dynamic and powerful and easy and nice for searching across not just within an artifact, but across artifacts over time. So you could say fetch me all the failed tests in the past week, and so really cool stuff like that. So it’s just like a nice little lightweight layer for us rather than saying, okay, we have to push this to AWS and then try to search through multiple files in AWS for particular things.
It was just like a nice convenience. And I guess if we open source it, you can decide if you want to use it.
[00:22:57] Joel Labes: That’s the magic really,isn’t it. So Simon [00:23:00] Patatski asks how this differs from exposures and then Blake Birch leans into that a bit further and asks, are you able to build similar downstream functionality of doing a post request to a URL through exposures or does it require the setup you showed with the Immuta field?
[00:23:17] Teghan Nightengale: I guess I’ll take that in two parts. So first with exposures, I didn’t want to touch on exposures earlier, just cause I didn’t want to get people maybe derailed or less excited about this direction. Or maybe confused anyone who’s new and hasn’t heard of them, but exposures certainly are a great tool for alerting or really conferring to analysts, I think, what relies on specific things. But the reason that we weren’t so excited about using them or they didn’t work for this example, let’s say that I gave today is because they really just exist at the model and source level. So you’re declaring like a [00:24:00] dependency of this. Rather than saying, I want to embed on this task or this column or this model, so in that way they’re still a little bit yeah, they didn’t give us all the optionality. I do believe they actually do have a meta field now and you can put meta stuff in them, but again, we were like, why do we need this additional thing that feels like work to declare and really doesn’t change when anything else changes like an exposure just feels like a log to keep analysts appraised of what’s going on with particular dependencies, but it’s not actually I don’t think you need to necessarily have that extra step. I think you can just embed your dependencies directly on the transformations that you’re doing and reduce having that extra, additional downstream layer.
That was just like my impression I think like I’m always curious. It’s definitely like in the direction. And so what the point of this was is that it’s not like this [00:25:00] is the only solution to this problem. The broad problem is the dependency problem of configuring downstream tools, and there’s lots of different ways to decide to do that. If you want to do it in an exposure through something, if you want to do it in a meta config, I know like some of the SaaS products that I mentioned they do it in different ways. Like they might have their own configuration files that they want you to add your model names to.
So it really totally varies. This is the way that we were excited about solving the problem and that we think really touches on a lot of.
[00:25:34] Joel Labes: Yeah, helpful. I think the column and case legal stuff is really important because you wouldn’t get that with an exposure.