Table of Contents

dbt 101: Stories from real-life data practitioners + a live look at dbt

Alexis was a data engineer that loved the teaching aspects of her previous roles, so she became an instructor. Now she is building training experiences at dbt Labs.

Jon (Natty) Natkins likes data of all shapes and sizes. He is a product-oriented and customer/community-focused bridge-builder. He’s a lifelong learner, and an empathetic engineer with a passion for helping customers and users make sense of and derive value from complex technology. He helped get the StreamSets Data Collector product off the ground and drove open-source adoption to hundreds of thousands of deployments and commercial adoption to north of $10M in ARR.

Coalesce is a conference for the dbt community! Most of our sessions assume you already have an understanding of dbt.

If dbt is new to you, join us in this session to get up to speed. We’ll go through the basics and give you all the information you need to get the most value out of this conference.

Follow along in the slides here.

Browse this talk’s Slack archives #

The day-of-talk conversation is archived here in dbt Community Slack.

Not a member of the dbt Community yet? You can join here to view the Coalesce chat archives.

Full transcript #

[00:00:00] Travis: Hello again, and thank you for joining us at Coalesce. My name is Travis Marceau and I am the sales director covering big companies in the Midwest and mountain regions for dbt Labs. If you’ve ever thought about talking to sales and hesitated, I promise we have no interest in becoming a thorn in your side.

Most of my job is helping y’all figure out how to get your company to let you use this thing. Enough about me though. The title of this session is dbt 101: Stories from Real-Life Data Practitioners and a Live Look at dbt. Our first presenter, Alexis is one of the individuals responsible for what has been described to me by some of my cereal-loving friends as the best training and onboarding we’ve ever experienced hands down. And our other presenter, my good friend and regular presentation partner Natty will be working the keys and mouse as he leads us on a journey into the cloud. Natty [00:01:00] heads up our solutions architects group and the west, and is likely the person you actually wanted to talk to.

If you’ve ever ended up on the phone with me, all chat conversation is taking place in the Coalesce dbt 101 channel of the dbt Slack the link should be down here to join. If you haven’t already, and we encourage you to ask other attendees questions, make comments or react at any point in the channel and just generally keep my other friend Randy company in there.

I have no doubt. He’ll make it fun. After the session, the speakers will be available in the Slack channel to answer those questions. So feel free to say to your speaker questions or the end of the session. All right. Enough preamble. Let’s get started over to you, Alexis.

[00:01:49] Alexis Wong Baird: Thanks Travis. So thank you for joining us today.

We’re really excited for this session. I am Alexis. I’m an instructor at dbt Labs. I used to be a data [00:02:00] engineer or what I found out later was really an analytics engineer and I really loved the training and teaching aspect of my job. So I became an instructor and now I’m at dbt.

[00:02:10] Natty (Jon Natkins): And my name’s Jon Natkins, but I go by Natty. I’m a solutions architect here leading the enterprise west region.

But I actually got my start with dbt as a product manager, a couple of roles ago where I was using dbt to help me understand mobile SDK adoption. So Alexis I’m going to kick it back to you because I’d love to hear some stories.

[00:02:30] Same metric, different answer #

[00:02:30] Alexis Wong Baird: Great. So in this presentation, I’m going to take on the role of a data engineer or analytics engineer that doesn’t know about dbt which was true before I came here, actually, I didn’t know what dbt was.

And I’m going to tell some stories from my life as a data engineer, analytics engineer, some pain points. We had a lot of these stories I’ve heard from other data teams. So hopefully you can relate. So my first story is about when you go to that meeting and you’re trying to figure out [00:03:00] what’s going on with churn in our business, but different stakeholders have different numbers for that same metric. And you spend the entire meeting arguing about what the difference in the metrics is instead of figuring out how we can reduce churn. If you were like the company I was at a lot of these disagreements stem from the way our teams are structured. So our data are embedded in business and product teams, which is great.

So they could have context on what that team needed, but it also led to a lot of data silos. Each day, all has had their own pipelines and transformations to power, their own set of dashboards, which led to a lot of a disagreement on the calculation report metrics.

Each team was familiar with their own data analysts and their own set of metrics and dashboards, so they had a hard time trusting other teams’ metrics.

As I mentioned, there was a lot of meetings and emails and slack messages trying to get to the bottom of the differences. This became really apparent when we undertook cross cutting products, projects, like the migration of our email platform to a new tool.

Our email marketing platform, [00:04:00] I should say,while we were exploring the data going into and out of our email marketing tool, we realized there was a lot of stale data going around as well as conflicting sources of data. For example, we had four different ways of calculating which country a user came from, and a lot of times it conflicted with each other.

This project also involved collaboration between marketing and sales. And we figured out that each team had their own definition of what a qualified lead meant. Overall, it was a mess to try to figure out where the data was coming from and how the metrics were calculated.

We did our best to find all of our logic across Python scripts and various repos, Looker persistent derived tables, spreadsheets, and even Jupiter notebooks.

We were able to successfully make the migration to the new tool, but it was a painful process and exposed a lot of our issues with not being able to be aligned on data definitions and not being able to get a handle on where all our data was flowing through. Natty, does that sound like something you’ve heard from a lot of your customers?

challenges to overcome

[00:04:58] Natty (Jon Natkins): Yeah, and I really appreciate you [00:05:00] sharing that story, Alexis. There’s definitely a lot of topics that you raise that I hear in conversations all the time. And to just recap a little bit of what I think I heard. It sounds like part of the challenge that you were experiencing was that different parts of the organization were effectively working in silos.

You described each analyst working in isolation which really produces a lot of risk of rework talking about having four different ways of calculating a user’s country. It gets really confusing. And it introduces errors into the decision-making process and because users and people at your organization weren’t really collaborating on that actual work, they ended up actually redefining terms like qualified leads in their own contexts.

And there wasn’t really a central place where they could get aligned on what a qualified lead really was. So what I’m hearing is, there was a disconnection between the data sources powering these transformations and using different logic for things that really should have been [00:06:00] semantically the same.

And going a bit further than that, even it sounds like no one really knew who was responsible for specific metrics or how they come to be. And that really led to a lot of confusion around how to actually use them make decisions off of them. And I can imagine this was a really hard situation for you to navigate.

[00:06:17] dbt in Action! #

[00:06:17] Natty (Jon Natkins): I’m guessing it was a time suck as well as just being demoralizing for everybody involved. And so what I’d like to do is take our first look at dbt Cloud so that we can talk about how Alexis’s life might’ve been different with dbt. So this is the dbt Cloud IDE, which is a development environment that’s used for developing and collaborating on analytics projects.

On the left-hand side, you’re going to be seeing a directory tree where there’s various files that are in my project. And in the main area, you can see the file that I’m currently working on. This particular file is a SQL file. So hopefully it doesn’t look too confusing. And this SQL file contains a dbt model, which is the term that we use for a transformation in dbt.

[00:07:00] Every transformation is a SQL select statement, and we believe that that makes the code really accessible across the organization. You don’t have to know Python. You don’t have to know other programming languages to use dbt. And in this transformation, we’re actually defining how monthly recurring revenue is calculated.

You can see that going on predominantly in this case. And some of the nice things that dbt provides out of the box are the ability to preview the query results so that you can actually see what’s going on. You’ll be able to test and validate your assumptions. You can also see information like the data lineage, where this data’s coming from, how it’s being used throughout the different models in your project.

And underneath the covers, when you build a dbt model, dbt actually uses the SQL state. But the SQL select statement to materialize the model as of your a table. So I’ve, pre-run this just so you can see it, but you can see that the query that actually gets executed is creator a place to view. There’s no magic here.

We’re just separating out the business logic from the physical data representation and letting you [00:08:00] focus on a few lessons. And what are the things that I find really cool about dbt Cloud is that it doesn’t use any proprietary or non-human readable file formats. So everything can just be version controlled in a central place to make it easier for different people in your organization to work on the same project together.

I’ll talk a bit more about that in a little bit. But for folks who aren’t actually building the transformations, dbt also generates documentation that teams can use to to understand what’s in the project, what’s in their warehouse. So for example, I can very easily search for that MRR model. I can see the different types of columns there.

I can see what these things mean and all of this is kept up to date by dbt for you. So Alexis, what do you think, how does this look so far?

[00:08:46] Alexis Wong Baird: Well, it seems really simple, but even just with that file tree, it would be such a nice place to be centralized and say, oh, all of our transformations lived. I’m going to go look for the view named MRR.

[00:09:00] And I can tell Mr. SQL that’s where the logic lives. Just that, as I said, we were crawling through lots of different repos and file types spread out, and it was quite a mess. Also, could you go a little bit more into this lineage graph? It looks amazing. I’ve written these things out on whiteboards before. But does it, how does it stay up to date? How has it generated?

[00:09:22] Natty (Jon Natkins): Yeah, so dbt is actually looking at the code and it’s building this lineage graph based on the code that you’ve written. So for example, if I were to just remove some of this, this’ll actually stay up to date and refresh automatically.

So it’s really easy to get a sense for what’s going on. This lineage is also available in the documentation as well. So you can view it whether you’re actually a developer or you’re just someone trying to understand what data you’ve got access to.

[00:09:50] Alexis Wong Baird: Wow. So it stays, it crawls the code and keeps it up to date. That’s incredible. I mean, the size of the white boarding, we used to just at one point we [00:10:00] printed out all of our data flows on like a wall sized piece of paper. But as soon as you print it out, it’s out of date because stuff’s changing so fast. So that’s incredible. What about the doc site? How manual is that? It looks really helpful for our stakeholders.

[00:10:14] Natty (Jon Natkins): Yeah. So again, all of this is automatically generated by dbt. Information like the descriptions for columns, those are things that you put into your configuration files. So it’s all kept in a central repository. But it’s all pulled in, but things like the data types, the types of tests that are being executed on these different.

Tables and models. All of that’s scrap gathered by dbt automatically. And you don’t actually have to do any work here. And then this whole site is hosted by dbt Cloud for you. You don’t have to host it yourself.

[00:10:47] Alexis Wong Baird: Wow. There’s so much to unpack here. It seems so simple. Thanks, Natty.

[00:10:52] Natty (Jon Natkins): Absolutely. Alexis, any chance you’ve got another story to tell us about your experiences.

[00:10:55] The Dreaded Slack Message from the CEO #

[00:10:55] Alexis Wong Baird: Well, I’ve got another tricky that I’ve already made in the oven for ya. So. [00:11:00]

We used to have this red flashing siren that we would plug in whenever a critical data pipeline was down. And that siren was on too much. Before we had an off the shelf data pipeline tool or extract load tool.

Our data pipelines were all written in custom Python scripts. We didn’t have a sandbox or development environment; we did, but it was really clunky. Didn’t have all the data and it was hard to use. So we ended up doing a lot of our testing in the production warehouse when you make a copy of production table and you a pen test to the end of it.

But sometimes you forgot to get all the places in your code. So you actually end up editing the production data, which was really fun. You could also do testing just by manually scanning our tables and views. So we would scroll through and make sure everything looked good and maybe do a select distinct on a column to make sure all the values look as we expected.

We did have a testing framework, but it involved more SQL queries executed from Python scripts. For example, we tested to make sure that our keys were unique once [00:12:00] a week, but it was really clunky to use and we rarely use it honestly. So when things broke, we usually heard from our stakeholders about it, I would come to work and there’d be a Slack message.

There’s no new leads in Salesforce or the revenue that dashboard has been updated in the last hour. At that time, I was owning the revenue pipeline, which got updated every 20 minutes which a lot of our C-level executives just had pinned in a tab in their browser all the time. And if something was wrong with that dashboard, I would get a Slack message from our CEO or the occasional unannounced desk visit.

Sometimes there was a bug on our end, but just as often it was something like an external API being down since we didn’t have a good way of figuring out what was wrong unless we would turn on the siren and hunker down all, try to figure it out. That, and the never ending requests for data sources and new dashboards, it made my field.

My team feel like we were always in reactive mode and never able to keep up with the [00:13:00] demand. I remember seeing some of my coworkers around the office and kind of waving and saying, hi, I have a bunch of stuff I need to get to you. I wish I could help more. If only I could work harder on the weekends every night, I could get everyone the help and the data they need, but obviously that’s not sustainable. And honestly led to a lot of burnout.

[00:13:21] Natty (Jon Natkins): Well, Alexis, that sounds like it was a really frustrating and probably exhausting situation. And I’m sure that we can all relate to the idea that surprises, especially with critical data pipelines are never fun. And unfortunately what you’re describing is not that uncommon.

challenges to overcome

[00:13:38] Natty (Jon Natkins): I hear a lot of folks who say that their troubleshooting happens really reactively. They don’t really have an opportunity to test your data. And I mean, you’re probably ahead of the curve, even having something that you could have used, even if it didn’t get used very often. So, what sounded even worse to me and really stressful is, is being in those high pressure situations where there’s a [00:14:00] pipeline down, you’re trying to fix it, but you end up having to contend with angry stakeholders who want status updates.

It slows down your progress on actually making the fix. And it’s really just like this kind of cascading domino effect of badness. So yeah. I think it’s great that you had some testing outside of production, but again I hear that a lot of data teams don’t really have the bandwidth to do that.

And the other clear downside that I heard was that if this is happening frequently enough, that you need a spinning red light at your desks I’m guessing you’re quickly eroding the trust of your stakeholders. Those downstream data consumers, and that’s going to exacerbate the problem.

It’s going to lead to the data team feeling unappreciated. And you said it yourself, you’re burning out. So I want to go back into dbt to talk about where we might be able to improve on some of these situations. And what I think is really powerful about dbt is that it incorporates both a development environment for building data pipelines, as well [00:15:00] as providing an opinionated stance on the right process to use for building them.

So, one of the things that you may not have noticed as I was going through the first is that I’m not working in a production environment. I’m actually working in this dbt J napkins schema, which is my own personal development environment. And so even when I run the models, I create tables in the underlying warehouse, that work is happening in my environment and it’s not impacting anybody else’s work.

And so you can actually have multiple people working on the same environment or the same data model without actually impacting each other. And that enables me to move fast and to keep going on, on the work that I need to get done. I also mentioned that we have the ability to version control this code to get it out to the rest of the organization.

And that’s the central version control system that we use is called Git, if you’re familiar with that. And that’s the mechanism that we use to collaborate with others and get my work into production. So here I can actually commit to make a [00:16:00] change to my MRR model. And that’s actually going to send this code change into a Git repository, which enables me to go through a pull request and code review process and really share my work with the rest of the organization.

And so where things really start to tie back to the story that Alexis was telling is that when you’re getting your work ready for production, a big part of dbt is its built-in testing and data quality framework, which allows you to validate your work, feel confident that it’s doing what you think it does.

And the beauty of dbt is that tests just like models, are just SQL statements. So this particular SQL file it’s called test NRR sign. This is actually testing for invariants in my data. So in this particular test, I’m checking whether or not the numeric sign of a monthly recurring revenue change is correct.

If a change is due to something like in contraction or a turn, I’d expect that the sign should be negative. If it’s [00:17:00] net new business or expansion, it should be positive. Now the way tests work in dbt is that you’re really defining a query that selects out data that you don’t want to see, or that shouldn’t exist.

And so if a test returns any records, then it fails. This particular test illustrates a business rule that I might want to encode. And you can see that if it’s a new or upgrade a MRR change, then I’m saying if it’s less than zero, this is data that I don’t want to see. And that’s what that test is going to actually pull out.

So many tests that people want to put in place are actually more generic, like validating that a column has no nulls or it’s unique and more generic tests don’t actually even require writing SQL. They’re actually built into the platform. So for example, I can see here in this configuration, That I’ve specified for an ID column that I want to run a uniqueness or not an old test.

And I can run these in my development environment, just using the dbt test command. But as these are [00:18:00] executing, we’re actually getting to see all the green checks that show up. And if I actually dig into one of the ones that it’s actually already run, you can see them starting to run here. If I look at one of these not null tests that was configured in that configuration file, it’s still running a SQL select statement.

But I didn’t have to do it. So it allows me to move faster to take these really common types of tests and apply them to gain confidence that my modifications aren’t going to break down and production. And so, while I’m running these transformations and tests manually in the development environment to illustrate a point, all of these can actually be scheduled as jobs in dbt Cloud. They can be set up to run on a schedule. They can be triggered by another application using a rest API. There’s a lot of flexibility for any situation. And in fact, a very common way that people leverage jobs is to use them for continuous integration by triggering jobs based on pull requests.

When you want to go through code review and continuous integration tests allow organizations to test as [00:19:00] they’re getting ready to push that new code out to the rest of the organization. And it’s a great way to get notified about potential issues and new code before it ever impacts anybody testing in prod has always ended in tears for me.

So I’ve got to make sure that I’ve got processes in place to keep a smile on my stakeholders spaces. And one of the great things about dbt is that those job failures can also generate alerts, which can keep you informed about issues before your CEO is blowing up your Slack. So Alexis based on what I’ve said, do you think this could have made your life a little bit better?

[00:19:32] Alexis Wong Baird: Yeah. I don’t even know where to get started. Being able to add test to something with just a line of texts so easily is amazing. As I mentioned, we had a testing suite, but it was really hard to use and it’s really illustrating to me how tooling can make things better the lower, the barriers to using it.

Even those more custom tests that you were writing for business roles, I can imagine every time I edited the revenue pipeline, I knew what to look for. [00:20:00] And like I even had in my mind, a bunch of queries to run, to QA the data. But if I could make those queries that were running all the time, then we could make sure it didn’t break.

And other people who aren’t me could edit and we could collaborate and we would have so, so many fewer errors. That’s amazing.

The development schemas too, is a really interesting way of handling not having an easy sandbox environment, but I can imagine we’re all using the same raw data we have access to all the day, then not partial amounts of data like we did in our sandbox which didn’t really help us test the full pipeline. And I could see collaborating with somebody. I could check out their work and their schema before we pushed to production. It’s just, it’s a totally different way of thinking of development. So yeah, that’s really exciting.

[00:20:50] Natty (Jon Natkins): Yeah, it’s a bit different, but it’s easy to get started. And organizations are picking it up all the time and I think they’re they’re having a great time with it.

[00:20:58] dbt Ethos: Build Data Products Like Developers Build Apps #

[00:20:58] Natty (Jon Natkins): And I guess just [00:21:00] to bring this around a bit I’m really glad you found dbt so exciting and I tend to agree at the core of dbt is really the ethos that organizations should build data products and data apps. Just the same way that developers build their own applications. And we aim to achieve this through two main avenues.

The first is that we unite on sequel. We think this is really important because SQL is often the common thread between tooling and people, every warehouse under the sun speaks equal. And it’s hard to imagine any data team out there that doesn’t use SQL in some way, shape or form. And additionally, it’s an easy language to learn for folks that are new to data, but it’s also understood by data veterans.

And so it really enables organizations to involve more people in that data conversation. And the second piece of this is that we work like engineers. And what I mean by that is that a lot of the best practices that engineering teams view as kind of standard non-negotiable things these days like testing your code, providing documentation, or [00:22:00] using agile development methodology is like CI/CD those should all be brought to data product development.

And there’s really a reason that engineers follow these practices and it’s that it produces better outcomes, whether it be reduced time to market, less time dedicated to break, fix support, or just reducing rework. So Alexis, I’d love to hear from you a little bit of how about how dbt actually functions.

[00:22:25] dbt: Develop, Test, and Deploy Data Products in Your Warehouse #

[00:22:25] Alexis Wong Baird: Yeah, so to break it all down and I’ll give you an overview. We’ve got the raw data over here, then the dots, the colored dots on the left. And we are assuming that your raw data is already loaded into your data warehouse using a data pipeline tool or your extract tool. And dbt here in the middle is your platform.

That’s going to allow you to develop tests and document and deploy your transformations. So it’s important to know the dbt doesn’t actually perform any of the computation itself, but it sits on top of your warehouse and it pushes the SQL queries down to your warehouse. So it doesn’t do any of the compute.

It relies on the [00:23:00] warehouse for that. Then when you have your data sets that are all clean, ready for analysis, you can hook them up to your BI tools or to power your ML models or even for operational analytics. So hopefully this gives you an idea of how dbt fits into your modern data stack. You can also think of it as the T and the ELT (extract load transform) .

[00:23:19] Natty (Jon Natkins): Awesome. Well, thank you everybody for hanging out for a little while with me and Alexis today. I hope this has been really informative and given you a quick tour of dbt and dbt Cloud, and just to kind of round it up one more time. dbt is all about creating a really collaborative environment for analytics development and ensuring that you can test your data so that it’s reliable, documented so that other people can leverage it.

And just get it out into the hands of the rest of your organization so that you can make data informed decisions throughout the company. So to get started today, you can sign up yourself at cloud.getdbt.com/signup. And we’ve got a few links that Randy is going to be putting into [00:24:00] the Slack channel.

So we’re hoping that you’ll get started today. And Alexis, I don’t know if you want a quick plug on dbt Fundamentals, but I’ll let you do that.

[00:24:09] Alexis Wong Baird: Yeah, thanks, Natty. So we’re excited to announce that dbt Fundamentals, some of you may know it as our free online on demand course, we are announcing a new version of it.

So we’ve refilmed a bunch of the videos, added some, tightened it up so that it goes through deployment. And we haven’t announced this. The official announcement is tomorrow, but I wanted you to be the first people to know that is available and a quick teaser, if you finish, there’s a quiz now and a badge that you can get, that you can post all your socials to share that you now know your dbt Fundamentals. It’s possible that you might get a surprise piece of dbt swag. I can’t promise anything but quick teaser there. So check out dbt Fundamentals. We also have some other courses on our courses that get to bt.com. If you want to go further in the intermediate level. [00:25:00]

Natty, do you want to talk about the playbook?

[00:25:04] Natty (Jon Natkins): Yeah. So if you want to give a shot with the code that I was demoing today, you can go download the MRR playbook it’s stored in a Git repository. And if you want to understand a bit more about what we’ve been talking about and really get an understanding of the dbt philosophy I recommend you read the dbt Viewpoint.

It’s a great primer on what, why and how dbt operates and is really helpful if you’re just getting started. So, Travis, I’m gonna pass it back to you. But thanks again for taking the time to listen to us today.

[00:25:36] Travis: We’ve got a couple more minutes. Natty, do you want to stay on for a couple of questions or should I just go ahead and close it up.

[00:25:42] Natty (Jon Natkins): If there are quick questions, I’m certainly happy to help answer some of them.

[00:25:46] Travis: I’ve got a fun one. Mika asked, "As dbt newbie here for the CI test, does dbt support, like a pre-production environment? For example, I think our current setup is that once PR get pushed to master our production jobs will immediately pick it [00:26:00] up for the next run. You describe how you’ve seen other customers do that?"

[00:26:03] Natty (Jon Natkins): Yeah, that’s a great question. And yes, dbt supports pre-production environments. So dbt Cloud actually has a concept called environments. You can actually see it in the dbt Cloud product. And so what you can do is you can assign jobs to specific environments.

So you can have a CI job on a QA environment or on your production environment to make sure that as you’re moving code throughout the promotion, you can actually have it being tested at every stage of the game. So it’s really designed for those and a lot of our customers do use multi-stage environments to do a CI testing. So really great question, very relevant.

Last modified on: Apr 19, 2022

dbt Learn on-demand

A free intro course to transforming data with dbt