Table of Contents
- ⢠No silver bullets: Building the analytics flywheel
- ⢠Identity Crisis: Navigating the Modern Data Organization
- ⢠Scaling Knowledge > Scaling Bodies: Why dbt Labs is making the bet on a data literate organization
- ⢠Down with 'data science'
- ⢠Refactor your hiring process: a framework
- ⢠Beyond the Box: Stop relying on your Black co-worker to help you build a diverse team
- ⢠To All The Data Managers We've Loved Before
- ⢠From Diverse "Humans of Data" to Data Dream "Teams"
- ⢠From 100 spreadsheets to 100 data analysts: the story of dbt at Slido
- ⢠New Data Role on the Block: Revenue Analytics
- ⢠Data Paradox of the Growth-Stage Startup
- ⢠Share. Empower. Repeat. Come learn about how to become a Meetup Organizer!
- ⢠Keynote: How big is this wave?
- ⢠Analytics Engineering Everywhere: Why in the Next Five Years Every Organization Will Adopt Analytics Engineering
- ⢠The Future of Analytics is Polyglot
- ⢠The modern data experience
- ⢠Don't hire a data engineer...yet
- ⢠Keynote: The Metrics System
- ⢠This is just the beginning
- ⢠The Future of Data Analytics
- ⢠Coalesce After Party with Catalog & Cocktails
- ⢠The Operational Data Warehouse: Reverse ETL, CDPs, and the future of data activation
- ⢠Built It Once & Build It Right: Prototyping for Data Teams
- ⢠Inclusive Design and dbt
- ⢠Analytics Engineering for storytellers
- ⢠When to ask for help: Modern advice for working with consultants in data and analytics
- ⢠Smaller Black Boxes: Towards Modular Data Products
- ⢠Optimizing query run time with materialization schedules
- ⢠How dbt Enables Systems Engineering in Analytics
- ⢠Operationalizing Column-Name Contracts with dbtplyr
- ⢠Building On Top of dbt: Managing External Dependencies
- ⢠Data as Engineering
- ⢠Automating Ambiguity: Managing dynamic source data using dbt macros
- ⢠Building a metadata ecosystem with dbt
- ⢠Modeling event data at scale
- ⢠Introducing the activity schema: data modeling with a single table
- ⢠dbt in a data mesh world
- ⢠Sharing the knowledge - joining dbt and "the Business" using TÄngata
- ⢠Eat the data you have: Tracking core events in a cookieless world
- ⢠Getting Meta About Metadata: Building Trustworthy Data Products Backed by dbt
- ⢠Batch to Streaming in One Easy Step
- ⢠dbt 101: Stories from real-life data practitioners + a live look at dbt
- ⢠The Modern Data Stack: How Fivetran Operationalizes Data Transformations
- ⢠Implementing and scaling dbt Core without engineers
- ⢠dbt Core v1.0 Reveal â¨
- ⢠Data Analytics in a Snowflake world
- ⢠Firebolt Deep Dive - Next generation performance with dbt
- ⢠The Endpoints are the Beginning: Using the dbt Cloud API to build a culture of data awareness
- ⢠dbt, Notebooks and the modern data experience
- ⢠You donât need another database: A conversation with Reynold Xin (Databricks) and Drew Banin (dbt Labs)
- ⢠Git for the rest of us
- ⢠How to build a mature dbt project from scratch
- ⢠Tailoring dbt's incremental_strategy to Artsy's data needs
- ⢠Observability within dbt
- ⢠The Call is Coming from Inside the Warehouse: Surviving Schema Changes with Automation
- ⢠So You Think You Can DAG: Supporting data scientists with dbt packages
- ⢠How to Prepare Data for a Product Analytics Platform
- ⢠dbt for Financial Services: How to boost returns on your SQL pipelines using dbt, Databricks, and Delta Lake
- ⢠Stay Calm and Query on: Root Cause Analysis for Your Data Pipelines
- ⢠Upskilling from an Insights Analyst to an Analytics Engineer
- ⢠Building an Open Source Data Stack
- ⢠Trials and Tribulations of Incremental Models
Building On Top of dbt: Managing External Dependencies
This session is will help people understand how they can integrate dbt, and the data warehouse into the very core of their product and business.
By extending dbtâs functionality with declarative meta-data, teams and companies can leverage the power dbt to build data-intensive applications confidently.
Follow along in the slides here.
Browse this talkâs Slack archives #
The day-of-talk conversation is archived here in dbt Community Slack.
Not a member of the dbt Community yet? You can join here to view the Coalesce chat archives.
Full transcript #
[00:00:00] Joel Labes: Thank you for joining us at Coalesce. My name is Joel Labes. I use he/him pronouns and Iâm part of the community team at dbt Labs focusing on developer experience. Welcome to Day 3, has not been a fun week so far. The session is called Building on Top of dbt: Managing External Dependencies.
And weâre going to hear from Teghan Nightengale at Community.com. Without stealing all of his thunder, Teghan is going to show us the results of an internal hack week project that helps make it easier for people who donât directly use dbt to benefit from its automated tests and well-governed data. All the chat conversations today has taken place and the Coalesce dbt external dependencies channel of dbt Slack.
The channel name should be showing on your screen at the moment. If you are not part of the chat, you have time to join right now, visit getdbt.com/community and search for Coalesce dbt external dependencies when you enter the space. We encourage you to ask other attendees questions, to make comments, to react on the channel at any point throughout the [00:01:00] presentation. This is the highest density of people who will appreciate your bad data jokes that youâre going to find anywhere so I make the most of it. After the session, Teghan will be available in the Slack channel to answer your questions, but again, get them flowing in throughout the session. If you throw a red question mark at the start of your message, itâll be easier to make sure that it gets an answer. And weâll probably do some live at the end of the session if weâve got time. So letâs get started over to you, Teghan.
[00:01:24] Teghan Nightengale: Thank you so much, Joel. Really appreciate that introduction. Iâm happy to be here. Happy to be chatting to everyone today. A little bit about what weâve been doing at our company, Community, and how itâs been going. So the title of todayâs talk is Building on Top of dbt: Managing External Dependencies. And letâs just dive right in here. Or actually before we get further, I just want to give more of a quick background on myself.
[00:01:54] Defining the dependency problem #
[00:01:54] Teghan Nightengale: Joel and I are a bit of a yin and yang in some ways, because he works with the community team at dbt Labs, [00:02:00] and I work on the dbt team at Community.com. Community.Com is a one to many texting platform for influencers and celebrities that allows them to expose a public phone number and then get inbound and outbound text messages and get clustering on their inbound messages. So they can hold a lot of simultaneous conversations at the same time. And myself, Iâm a data engineer by trade and weâve done a lot of great work with dbt and we love it at Community. So weâre excited to share what weâve been thinking about in this problem space and give you guys more context.
So I wanna start by outlining the problem that weâre working on that weâre seeking to solve, which we call the dependency problem, which can basically be outlined as companies are increasingly using dbt to define their data transformation layer. And itâs becoming the source of truth for [00:03:00] a lot of their business and their analytics.
And, itâs typically exposed in BI tools outside of the data warehouse. But then thereâs this desire and this need and this emerging space at least at first to export some of those transformations and some of those metrics into external platforms like Salesforce or other sort of SAS platforms.
And typically introduces a little bit of a risk because when you define models in dbt theyâre lovely translations. You have a final, beautiful layer of fact tables and tasks or maybe very wide tables. And you âre excited about all the crunching of the numbers that youâve done to get that into that state.
And you know that itâs a source of truth for your business. Now you want to build something else on top of it using the information that is stored in it. By default dbt doesnât really [00:04:00] understand that youâre building anything on top of it. So say you have a use case, and this is what Iâm actually gonna show later where youâre building an internal Slack client.
You want to send very specific configured alerts to certain people in your company when a dbt test fails. And so if youâre building that and youâre maybe just going about it in a naive way, just taking a first pass at it, you might think, oh I guess Iâll have to define in the service, what kind of tests?
I want to get alerts on when they fail. But this is exactly what weâre talking about when we say the dependency problem. Say one day an analyst goes in and removes a test that they think is no longer necessary or was failing too often. All of a sudden, if your downstream tool or downstream client was hard-coded to be configured off of that particular test itâs gonna break and your engineering teamâs gonna get alerts and theyâre gonna be like, why is this service down?
A nd theyâre going to have to go talk to the analyst [00:05:00] and the analyst had no idea. Theyâre blameless. Theyâre just trying to optimize their day-to-day work. And this problem, this is a very specific example, but it really takes all forms of whether itâs tests or models. The things that youâre defining in your dbt transformation layer, you want to expose those things in other downstream clients and tools. But dbt again by default is not aware that there are these downstream dependent tools.
So if we think about how this problem is being solved today the answer is that itâs actually being solved very rapidly for specific use cases for a software as a service platforms and other sort of wonderful tools like Lightdash, like Hightouch, or just a few to mention, that are finding ways to integrate more tightly with dbt in order to define the behavior of their product.
They are basically situating themselves downstream of dbt to [00:06:00] varying degrees and using different features of dbt. But fundamentally they recognize that itâs a very powerful sort of configuration language for the data transformation layer, and they are finding ways to tap into it in order to ensure that product is behaving as expected when the transformation layer changes like when you alter a model alternative tests, anything like that.
So this is very much an emerging pattern that is being seen right now for paid products. But the wonderful thing is you can also bring these principles into your own work when you are building things internally or open-source. So an example is, really the power of this particular config that exists in the dbt YAML called the meta config. So the meta config basically allows anyone to embed an arbitrary JSON/YAML schema on a variety of objects in dbt, [00:07:00] including a test, including a model, including a column. So itâs very powerful in a lot of ways. And so what youâre seeing here is a little example from our Salesforce staging layer in our repo community, where basically we have this unique test, and this is a problem that was arising because sometimes there are data generation processes, like something in Salesforce, something in a sheet that are feeding into your transformation layer. And when a test fails, itâs not enough to just know that it fails. You need someone to take specific action and you donât want to have your analysts have to manually message people all day. Hey, can you take a look at this sheet? See what was entered, please fix it. So we decided we needed a better solution.
So this is an example of us attempting to configure a Slack client that will send out specific notifications to specific people with custom messaging based on this embedded portion here that you see under the [00:08:00] meta config.
So weâre calling this a configuration schema and you can see, itâs for us, itâs just this custom keyword and then two keys with their values, a list of recipients, and a message for the potential message. So I want to just reiterate everything that I just mentioned on the meta config, make sure you guys really are dialed into how powerful it is.
[00:08:26] Use the meta config #
[00:08:26] Teghan Nightengale: So it is an arbitrary config that can hold anything you essentially want. We are calling the things that weâre putting into it configuration schemas and itâs basically like a codified contract of a specific structure that a downstream tool can ingest and take action based on. And so if youâre not already aware of the amazing thing about putting in this meta config and embedding configuration scheme is into it is that these compiled structures will be [00:09:00] compiled into the JSON artifacts, at least the manifest JSON artifact that dbt provides. And so if youâre already familiar with that, you might be getting really excited, but you might also have some questions.
Okay, I put a schema under this config and okay. I compile my dbt project and now I have this JSON. How is that helping me? Like where does a downstream tool access that configuration? How is it using it? And so thatâs where this hack week project that we had comes into play.
[00:09:37] Accessing a configuration schema #
[00:09:37] Teghan Nightengale: So we built a nice little searchable JSON store that weâre calling Metahub, and weâre excited to be open sourcing it in the near future. But essentially itâs a way to parse through the JSON artifacts that dbt is creating for you and pull out specific things like those configuration schemas that can configure any kind of [00:10:00] client, built or tool built in anything you want, whether itâs Python, just Bash, whatever youâre doing with that configuration schema, you can build a client off of it too what you needed to do. And then of course, the other amazing option for people who are already on a dbt Cloud is that they offer of course the metadata API. And so that is essentially a much easier to spin up kind of solution for a paid customer to search through these artifacts and maybe pull out that particular configuration schema that they need to use. So for example, you could build a client that pings the metadata API, pulls out what it needs, and and takes action based on that.
[00:10:48] The Metahub project #
[00:10:48] Teghan Nightengale: But I want to just share a little bit more about this Metahub project, which really came together during a hack week a couple of weeks ago. And itâs basically a very light store, itâs dockerized, itâll be [00:11:00] easy to spin out with a post SQL backend and itâs nice because it makes it very easy to extract. Only the particular configuration schema is that youâre interested in extracting for a given client. If youâve ever looked at the manifest JSON artifact youâll know that itâs massive. If youâve often been in an editor, you mightâve crashed your editor a few times. It can be quite big. This is just a nice way rather than storing things in AWS or just pulling in like the entire JSON artifact, this is a nice way to just push it to a service and then have a query language that lets you pull out the very particular configuration schemas that youâre interested in. And the language that weâre exposing to do that is a JSONPath query interface, which is very standardized and well-documented in the Postgres documentation.
So itâs very very easy to get going. This is a little screenshot of just a quick make command, spinning it up locally on Docker, and then checking it. [00:12:00] So I want to give you guys some more details on this example and revisit our earlier point about the Slack client that we configure.
So if youâre running dbt Core, this is an example of what our run script essentially could look like, or your run script look like. So you do a run to refresh your project. You have a one-liner here where youâre basically pushing the artifact thatâs been generated as a result of that run, in this case, the manifest, which stores the entire state of the configuration.
And then you can essentially, this is a locally served option, but you can of course serve an exposed Metahub anywhere that you would serve a dockerized app. And so this is just an example. Simple. Itâs very easy at the end of your run to just load your manifest, your artifact. And by the way however you want to structure this data store and fetch things out of it, itâs very easy to do. This is simply just pushing that manifest into [00:13:00] that that API so that it can be searched over later and extracted by clients. And so then this is an example. This, the search path syntax that makes it really easy to fetch out exactly what you want. Like I said, the manifest can be fairly large
and so this is an example of what we saw earlier that configuration in the YAML was put into the manifest artifact, pushed to Metahub, the API data store. And then this is a simple rest API. So itâs just a Git and youâre giving essentially a path. Oops, sorry. Youâre giving a path. So basically, this is the path language.
And so itâs very easy to search through all the intricacies of the artifact and extract say this particular thing. So what weâre doing, because thereâs only one artifact, itâs a very simple path, but weâre going into nodes, all of them into their config and then weâre finding the [00:14:00] meta tag, and weâre pulling out whatâs under that and in this case, itâs all the Slack alert for this particular example, but you could say find me the tag where this configuration schema like Slack test alert.
So as you can see, itâs very easy to extract exactly what you want to use for your downstream clients. And then of course this makes it very easy to configure their behavior.
[00:14:21] Teghan Nightengale: so on the front end, what this looks like for us is we have a small Slack client Python script that we basically run. And this now is able to be configured off of this schema, embedded into the model YAML and the test YAML. And this is really powerful because usually, if you have an engineer build something like this then, one, you have the dependency problems.
So if something changes underneath them and the service is looking for a model or a task that doesnât exist anymore, all of a sudden itâs broken. [00:15:00] But if you build your services and your clients to be configured off of these sort of configuration schemas that you embed in dbt, it makes it really flexible to empower your analysts to change the behavior of those clients based simply on what they are doing.
So for example, an analyst could embed this Slack test alert configuration schema on a new test. And say, I want this message to be sent with it, and I want this person to receive it. And they can do that in a way that theyâre very familiar with, but just a few lines of YAML. And this sort of allows really like empowering your analysts to the next level, because they now understand that they can control whatâs happening downstream of dbt and theyâre completely aware of it too. So theyâre like, oh, I see. This test is being used by the Slack alert. I know that if I delete this YAML file, [00:16:00] then thatâs not going to be sent anymore. So theyâre very like in tune with all the things that you have in your stack that are starting to rely on the transformations that youâre doing in dbt.
So weâre very happy with this. Weâre very excited to empower our analysts in this way and give them the optionality to embed this kind of stuff. And if we think about whatâs next in this area I think that what I just showed about this configuration schema is great for a simple Slack client with maybe two fields like a recipient and a message, but very quickly you could see how itâs going to be a huge headache if you have downstream tools and clients that are expecting a certain data structure from this JSON that theyâre fetching out and they try to parse it and it doesnât match the contract of what it should be, itâs going to break the client as well.
So I think, an emerging area and something that [00:17:00] Iâm excited to look into more is how can we validate these custom sort of structures and schemas that we might want to embed into our dbt metadata, into our artifacts. I think that yeah, thereâs already some amazing like standards and tools for doing this, itâs just a matter of building a little bit of some lightweight CLI stuff on top of them in order to apply it specifically for the dbt case.
So I think, yeah, itâll be very important if you want to continue to rely on this pattern in order to de-risk your dependencies, that you can assert that the configuration schemas youâre embedding match the exact sort of contract that the downstream clients are expecting.
[00:17:47] Letâs recap #
[00:17:47] Teghan Nightengale: Okay. So if we want to summarize what weâve talked about today, I think that at a high level, more and more tools are being built on top of [00:18:00] dbt and that could be internal, itâs also external. And when they are being built, they need to understand and be connected with changes in the underlying configuration of dbt.
Thatâs really what we started with, the dependency problem. And a lot of different tools that exist right now, weâre doing this in different ways. One way thatâs very easy and straightforward from what weâve seen is the meta config, and that is just so flexible. You can really put any kind of structure that you want into it, including I think, putting a macro as the structure, right? You donât even have to like, put all that in your YAML. You can just define a macro with the arguments and embed that in your YAML as well. I think so thereâs a lot of ways that it can be very clean, very powerful for alerting and configuring the downstream clients of changes.
And then obviously when you create a [00:19:00] configuration schema and you are using it to configure a client, you need some way for that client to fetch it and to actually take action on it. So the dbt Labs metadata API is a wonderful tool for this. And then thereâs the open source solution of Metahub as well that I mentioned where weâre very excited to get towards releasing that for the community use.
Weâve really enjoyed it. So yeah very excited. And oh, Iâm going to, or weâre glimpsing in the future. So I just gotta put on my little future glimpsing glasses here. They help me see a little bit further and maybe weâll just put on, yeah, I dunno, weâll just get the vibe going, bring it up in a notch.
This really ties into a lot of what weâve been talking about in this conference so far where we know that this is a big way, right? We heard Tristan and Martinâs talk. And I think that really, what theyâre touching on is itâs big because dbt has given [00:20:00] us the community, the foundations to build the next generation of tools because they are essentially providing a configuration language for your data structure. And this what weâve talked about today, solving the dependency problem. Itâs really just a way to tap into that configuration language and to de-risk the tools that you want to start building as we continue expanding off of the wonderful work that theyâve done. Itâs pretty awesome. I cannot say enough how grateful I am for all the wonderful work that everybody at dbt Labs puts in every day. And the fact that dbt Core is such an amazing open source product still, the fact that it exposed all these artifacts in order to make it easy. Like they are really empowering the community to take the next steps and keep building more and more on top of this. And so I hope todayâs talk has been [00:21:00] useful for you to get a little insight into the future and understand that this is a great sort of option to solve that dependency problem and to understand how you can keep building and keep growing the way.
So thank you so much. Iâve really enjoyed chatting with you guys today. I think weâre going to do a few questions and have you up on the channel.
[00:21:24] Joel Labes: Yeah, thank you so much, Teghan. That was phenomenal. The shags are getting it, great direction in the chat. But alongside those, we do actually have some really good questions.
So weâll see how many of them we can get through. And please forgive me if I mangle any names. But first off why not use the JQ library or powershells convert from JSON as opposed to sending the manifest for the cloud? Querying it from a database.
[00:21:51] Teghan Nightengale: Yeah. Great question. Like youâre definitely free to do that. I would say if itâs a matter of, if you want to [00:22:00] use JQ over AWS, it depends on like how, if youâre already like setting up artifacts in a cloud storage bucket and youâre comfortable using JQ and that syntax then thatâs great. We found that the syntax provided by the Postgres backend, the P the posts, what is it, JSONPath SQL was like really dynamic and powerful and easy and nice for searching across not just within an artifact, but across artifacts over time. So you could say fetch me all the failed tests in the past week, and so really cool stuff like that. So itâs just like a nice little lightweight layer for us rather than saying, okay, we have to push this to AWS and then try to search through multiple files in AWS for particular things.
It was just like a nice convenience. And I guess if we open source it, you can decide if you want to use it.
[00:22:57] Joel Labes: Thatâs the magic really,isnât it. So Simon [00:23:00] Patatski asks how this differs from exposures and then Blake Birch leans into that a bit further and asks, are you able to build similar downstream functionality of doing a post request to a URL through exposures or does it require the setup you showed with the Immuta field?
[00:23:17] Teghan Nightengale: I guess Iâll take that in two parts. So first with exposures, I didnât want to touch on exposures earlier, just cause I didnât want to get people maybe derailed or less excited about this direction. Or maybe confused anyone whoâs new and hasnât heard of them, but exposures certainly are a great tool for alerting or really conferring to analysts, I think, what relies on specific things. But the reason that we werenât so excited about using them or they didnât work for this example, letâs say that I gave today is because they really just exist at the model and source level. So youâre declaring like a [00:24:00] dependency of this. Rather than saying, I want to embed on this task or this column or this model, so in that way theyâre still a little bit yeah, they didnât give us all the optionality. I do believe they actually do have a meta field now and you can put meta stuff in them, but again, we were like, why do we need this additional thing that feels like work to declare and really doesnât change when anything else changes like an exposure just feels like a log to keep analysts appraised of whatâs going on with particular dependencies, but itâs not actually I donât think you need to necessarily have that extra step. I think you can just embed your dependencies directly on the transformations that youâre doing and reduce having that extra, additional downstream layer.
That was just like my impression I think like Iâm always curious. Itâs definitely like in the direction. And so what the point of this was is that itâs not like this [00:25:00] is the only solution to this problem. The broad problem is the dependency problem of configuring downstream tools, and thereâs lots of different ways to decide to do that. If you want to do it in an exposure through something, if you want to do it in a meta config, I know like some of the SaaS products that I mentioned they do it in different ways. Like they might have their own configuration files that they want you to add your model names to.
So it really totally varies. This is the way that we were excited about solving the problem and that we think really touches on a lot of.
[00:25:34] Joel Labes: Yeah, helpful. I think the column and case legal stuff is really important because you wouldnât get that with an exposure.
Last modified on: Apr 19, 2022