Table of Contents
- • No silver bullets: Building the analytics flywheel
- • Identity Crisis: Navigating the Modern Data Organization
- • Scaling Knowledge > Scaling Bodies: Why dbt Labs is making the bet on a data literate organization
- • Down with 'data science'
- • Refactor your hiring process: a framework
- • Beyond the Box: Stop relying on your Black co-worker to help you build a diverse team
- • To All The Data Managers We've Loved Before
- • From Diverse "Humans of Data" to Data Dream "Teams"
- • From 100 spreadsheets to 100 data analysts: the story of dbt at Slido
- • New Data Role on the Block: Revenue Analytics
- • Data Paradox of the Growth-Stage Startup
- • Share. Empower. Repeat. Come learn about how to become a Meetup Organizer!
- • Keynote: How big is this wave?
- • Analytics Engineering Everywhere: Why in the Next Five Years Every Organization Will Adopt Analytics Engineering
- • The Future of Analytics is Polyglot
- • The modern data experience
- • Don't hire a data engineer...yet
- • Keynote: The Metrics System
- • This is just the beginning
- • The Future of Data Analytics
- • Coalesce After Party with Catalog & Cocktails
- • The Operational Data Warehouse: Reverse ETL, CDPs, and the future of data activation
- • Built It Once & Build It Right: Prototyping for Data Teams
- • Inclusive Design and dbt
- • Analytics Engineering for storytellers
- • When to ask for help: Modern advice for working with consultants in data and analytics
- • Smaller Black Boxes: Towards Modular Data Products
- • Optimizing query run time with materialization schedules
- • How dbt Enables Systems Engineering in Analytics
- • Operationalizing Column-Name Contracts with dbtplyr
- • Building On Top of dbt: Managing External Dependencies
- • Data as Engineering
- • Automating Ambiguity: Managing dynamic source data using dbt macros
- • Building a metadata ecosystem with dbt
- • Modeling event data at scale
- • Introducing the activity schema: data modeling with a single table
- • dbt in a data mesh world
- • Sharing the knowledge - joining dbt and "the Business" using Tāngata
- • Eat the data you have: Tracking core events in a cookieless world
- • Getting Meta About Metadata: Building Trustworthy Data Products Backed by dbt
- • Batch to Streaming in One Easy Step
- • dbt 101: Stories from real-life data practitioners + a live look at dbt
- • The Modern Data Stack: How Fivetran Operationalizes Data Transformations
- • Implementing and scaling dbt Core without engineers
- • dbt Core v1.0 Reveal ✨
- • Data Analytics in a Snowflake world
- • Firebolt Deep Dive - Next generation performance with dbt
- • The Endpoints are the Beginning: Using the dbt Cloud API to build a culture of data awareness
- • dbt, Notebooks and the modern data experience
- • You don’t need another database: A conversation with Reynold Xin (Databricks) and Drew Banin (dbt Labs)
- • Git for the rest of us
- • How to build a mature dbt project from scratch
- • Tailoring dbt's incremental_strategy to Artsy's data needs
- • Observability within dbt
- • The Call is Coming from Inside the Warehouse: Surviving Schema Changes with Automation
- • So You Think You Can DAG: Supporting data scientists with dbt packages
- • How to Prepare Data for a Product Analytics Platform
- • dbt for Financial Services: How to boost returns on your SQL pipelines using dbt, Databricks, and Delta Lake
- • Stay Calm and Query on: Root Cause Analysis for Your Data Pipelines
- • Upskilling from an Insights Analyst to an Analytics Engineer
- • Building an Open Source Data Stack
- • Trials and Tribulations of Incremental Models
The Call is Coming from Inside the Warehouse: Surviving Schema Changes with Automation
We’ll show our method to automate source transformations. Aula’s growth and a proliferation of nonrelational sources required constant manual maintenance to our dbt project – which was sucking up the data team’s capacity. We’ll showcase functionality of the dbt ecosystem as we explain our solution to this challenge.
Follow along in the slides here.
Browse this talk’s Slack archives #
The day-of-talk conversation is archived here in dbt Community Slack.
Not a member of the dbt Community yet? You can join here to view the Coalesce chat archives.
Full transcript #
Julia Schottenstein: [00:00:00] Welcome everyone to your next session at Coalesce, "The Call is Coming from Inside the Warehouse: Surviving Schema Changes with Automation." I’m Julia Schottenstein, and I’m part of the product team at dbt Labs and I’ll be your host for this session. In this talk you are going to hear from senior analytics engineers at Aula Education, Lewis Davies and Erika Pullum.
They’re going to share more about how they were in need of better automation to organize an increasingly complex customer data configuration. They turn to dbt packages to save the day, and in the next 30 minutes, they’ll share more about how dbt-utils and dbt audit helper made their processes far more automated.
Please join the conversation in the dbt Slack channel #coalesce-schema-changes, and Erika and Lewis will answer any questions you may have after their talk. I’m excited to hear some practical examples on how to use packages and macros. So [00:01:00] without further ado, over to you, Lewis and Erika.
Lewis Davies: Thank you very much.
Yeah, so I’m Lewis, I am here with Erika. Like the introduction said, we are senior analytics engineers. I’m currently at our education. Erika also works currently. She’s just switched to Hex though, big move for her. What I want to do first is just explain a little bit about why we wanted to do this talk at all.
The main problem that we had is that we have a series of separate, but similar environments. I’ll go into a bit more detail, obviously shorter, in a little while, but the main thing that we have is that a lot of institutions, each needing their own environments due to a range of regulatory reasons, and that kind of thing.
There were some challenges that we encountered there when we were trying to set up a dbt pipeline, that we felt were interesting to share, but also, even if that doesn’t generalize to everybody watching this talk, an understanding of the dbt, because the buildup is something that we think is really relevant to other people like, wherever you are in terms of what your journey with dbt is.
[00:01:57] About Aula #
Lewis Davies: First of all, I just want to give a bit of context about [00:02:00] what Aula is so that actually makes a bit more sense. Aula is a learning experience platform for higher education. Now, that means that we are a service where university students can go on and they get their content, and they get their lectures, and they get everything delivered to them.
We are focused on higher engagement practices. Now, the key thing for Aula is that we don’t want to just be white labeled "Google Drive" for a university setting. What we want to do is really have interaction between educators and students. In the example here that you see on the right, that’s being presented with a quiz element, there’s a chat on the side.
The key thing is that educators and students are in contact with each other, there were open channels, and they are able to ask questions and engage with each other in that way. That’s really important to us. The next part is just framing the problem. Let’s start with explaining how data works at Aula right now.
The Aula app itself: like I said, we have these [00:03:00] segregated customer environments, so each institution has their own virtual private cloud, and the database that powers the app is MongoDB. Each of those then are brought into our data warehouse. In this case, Snowflake, we then use metabase for internal analytics, but that metabase instance is also fed into the Aula app itself.
We provide not only analytics for everybody else in the company, but we also provide analytics directly to the product. So obviously that means that it’s really important that not only are they correct, but they’re highly available, they’re reliable, and basically that we have high quality data. Now here’s the major hint as to what the problem we’re presenting are, and it’s this part: so there’s the fact that we have these many customer environments. That is a problem in itself. Just bringing those all together and making sure that we can centralize them in order to do analytics, but also the fact that we use MongoDB. Now, [00:04:00] I know that not everybody will have used MongoDB, at least in my experience before Aula, I hadn’t really used it, but I think that it’s important to know the context about the differences between typical SQL.
[00:04:13] MongoDB: A Primer #
Lewis Davies: Now this is a very quick primer,but my assumption is that most people watching this have some familiarity with SQL. So if you are dealing with SQL database, your tables have a predefined structure. For example, if I have a user’s table, I know that it has a user ID, an email address, a username, and all those kinds of things.
I know that those columns exist. Meanwhile, MongoDB, because it’s not a relational model, it’s not SQL. The equivalent structure there is a collection. Now, the structure that is not predefined is determined by the contents. But, what does that mean? So in SQL knowing that those columns exist, if I take that user’s example, no way, for example, that the email column exists. [00:05:00] If it’s not been populated for any reason, and the user hasn’t entered it, I just get a null value. That’s fine. I still know the column is there. With MongoDB, if that has not been populated, it just doesn’t appear in that record. If [there are] no records of how they’re populated, that column doesn’t appear. So if for some reason, that column doesn’t appear in my data set yet, I have no way of knowing that they should be an email column there beyond an engineer telling me, so what this then converts to is that if we are bringing together this data from across multiple institutions, if we were working with SQL, I know that generally they’re all going to have the same structure. They’re all going to have the same columns. So if I go back to the Aula app, if I’m looking again at my users [table](https://docs.getdbt.com/terms/table), I know that they’re all going to have those columns of user ID, email, whoever else.
That’s not necessarily the [00:06:00] case with Mongo. If a column is not there, it will be inconsistent. That is a problem, and when you bring both of those problems together, [you] create some complications. This is my science-y bet now. So what we have are 42 environments and each of those can have up to 40 collections unless we’re keeping it’s up to the not always all present.
Each of those collections, depending on which one you’re looking at can have between three for our smallest, and 39 fields for our largest model. Now, if anybody’s quick with mental math, you will be able to realize that this worst case scenario is, yeah, it’s a big headache. So we had to change something.
[00:06:43] Introducing dbt-utils #
Lewis Davies: And the answer for us was dbt-utils, obviously on top of dbt itself, of course. Now, what we were doing before is we follow the best practices guide, and that guide says that when you bring in your data sources and you define them in YAML, you then have [00:07:00] them each sourced, directly linked to a staging model.
So we set everything up like that. This is an example. Again, I’ll continue with the users collection example. So this would be how we would get every single user’s table to combine into just one single user’s table. Now, if you are having to deal with up to 40 different versions of this, and then times 40, 42, for instance, the number of institutions that we have, you end up with a lot of models.
It’s not scalable and you end up with a ton of boiler plate that is manually set up and very hard to manage. This is the example. This is a summary of our first revelation early. What we did is we used two dbt Utils macros, and they are union relations and get relations by patent. These were the key for us.
So the first one with union relations that solved the problem of the inconsistency in the collections being present, [00:08:00] or whether the fields were present. That took that headache away, and the key thing there is, if for example, email wasn’t in my user’s table, union relations will just look at the contents of all of my different tables and it will say, "okay, that one doesn’t exist in one particular source, I’ll just fill it with a null." When these things can change, not just at this set of time, but they change over weeks as people ramp up and start using Aula more and more, and they use more features and more collections up here, that was a massive time saver for us. And then when we combine that with GAT relations, by pattern where you can use just standard sequel, like syntax, or just strings to define what tables and what relations you want to union, it makes onboarding way simpler. Like it says here, those new collections appear automatically. It was a bit of a lifesaver, and what we end up with is that entire column of staging [00:09:00] models removed. It doesn’t follow that best practice pattern, but for us, it was way easier to manage it this way. Now, at first, when we did this, our sources were not nicely connected to our union model, as you see here, but we’ll go into more detail on that later.
So if I revisit my science-y bet, my theorem of dbt complexity, what you end up with, if we put some more hard numbers to this, if you do this for all of these collections at once, is a 962 file pull request where you’re on the order of just under 9,000 line changes. Now, some people would say that it’s not reasonable to make your teammates review thisbut that’s their view. One thing that we did realize when we did this, is this work, this thing that we were patting ourselves on the back for, was just the start. What we needed to do was make sure that when we rolled it out, we didn’t break anything.
We needed to keep the quality [00:10:00] high and we needed to make sure that we maintain trust. This is where we had to look at how we ordered it these changes cause they were obviously very big changes. Now we were already doing automated poll request tests and that was enormously helpful for us. But what we then had to do is break that big change down into a series of smaller ones.
We had to think about what are the most important collections that we rely on that, that really warrant a deep dive. Like, how do we make sure that you roll it out without breaking anything and make sure that things are still consistent, and how do we make sure that if anything does change, we communicate it clearly to the rest of the business.
One thing that we found after we’ve broken it down and we started doing a series of smaller poll requests is that we wanted to simplify our feedback system and this was really helpful because rather than just leaving comments on poll requests and just saying, "have you considered this, or I’m not sure about this," we have a system where you have one of these three emojis that range from "this is a showstopper and I think it’s broken" [00:11:00] to "this is just something that I’m curious about and I want to check just for my own information." That really helps make our communication clearer. Finally one of the biggest helps was dbt audit helper.
This is another package and this is an example of how you use it. This is more or less adapted from the docs to suit our example. What you essentially do is you get one table, which is your production version, you get a dev version of that same table where you’ve made these changes, and then what Audit Helper does, once you’ve said,for example, [in] our last activity, depending on when we’ve done our last run, that might not match it. For example, we look at the primary key. We want to make sure that all the details are the same on that level. What audit helper does, and if you run this in the in the dbt cloud ID, [is] it generates it all for you. So that’s really helpful as well. So what it does is row by row, it tells you whether each column, or whether the columns on both sides [00:12:00] are present in the other.
And from there, you can then explore. This is actually where you compile the code and you go and look at the the compi led version. You’ll see that there’s actually a full breakdown available in there and different ways of looking at it, which is super handy. This will save you a lot of time if you ever have to do something similar.
[00:12:16] The New Way #
Lewis Davies: So what this meant for us, what our new way gave us, these new collections added themselves. Like I said, as institutions ramped up and they started using different features and more and more of these tables were populated, we didn’t have to check back over a period of weeks, on repeatedly making poll requests and often making mistakes with those poll requests, cause it’s very hard to compare at that level. Obviously the nice thing for us is that there’s less manual effort involved, and like I said, there’s way fewer errors. Our data is more complete. We’re not missing as much stuff. We went from 900 models to below 200 because of the sheer number of boiler plate models that we got rid of.
Obviously that then that had an impact on our run times as well. And they were reduced, I think on our first one, we probably shaved about 30 [00:13:00] minutes off, which was which was very nice, but that was far from the end of it. Erika is now going to talk about some of the other work that we did with our dbt project
Erika Pullum: All right, thanks Lewis for that great start. So everything is great. We got rid of the work. We don’t have to copy paste anymore. We’re kicking back in our separate kitchens and making some popcorn, and that’s when the calls start coming from inside the house. So this combination of get relations by pattern and union relations leaves you with a couple of problems.
If an institution starts using a new feature, the Mongo collection shows up. Stitch pulls it into the data warehouse, and now we have an untested collection that’s flowing into our unions. So the problem of these untested sources is really a problem of timing. You don’t want new data showing up in your warehouse faster than you can make interview PRs to get it covered.
And with Mongo, remember that new tables can pop up anywhere, any time. Chasing them all down [00:14:00] would take way too much time and energy. So the way out is to automate everything, and when I say everything, the entire first layer of the dbt project. So snapshots, sources, the unions that Lewis is talking about that were really the foundation of our dbt project.
[00:14:17] Custom Automation #
Erika Pullum: So let’s take a quick look at the tools that we’re going to use to do that. To solve this problem, we need to auto-generate everything, and the tools we chose to do this are all within the core dbt ecosystem. Vanilla Jinja is more or less perfect for solving this problem. Its core is a templating language and it happens to be easily understandable to people who are working with dbt.
Once things start getting complicated. We’re going to need some YAML to help keep things organized, more on that in a bit. And finally we use Python to help read, render, and write, a nd instead of using dbt’s Jinja engine, we’re using out of the box Jinja and Python’s Jinja to package.
[00:14:55] Jinja: A Primer #
Erika Pullum: So before we get going with the automation, let’s look at Jinja: A Primer.[00:15:00]
The dbt context can actually make Jinja a bit harder to learn. I had a hard time when I first started learning dbt Jinja because there’s dbt specific Jinja, as well as the more standard Jinja syntax at its simplest. Jinja is a little bit like Madlibs. So you have this template that has blanks and some rules, like a noun goes here and a verb goes there, and you can fill in the blanks and render it back out to a full story.
And that’s that flow you’re seeing on the left side of this slide. When you’re compiling Jinja templates in a Python context, you can actually pass an arbitrary dictionary into the template, and anything that matches up is going to be substituted. So we’re going to do a quick syntax primer before we dive in to the automation.
So the top two examples here should be familiar to people that have done significant work with Jinja in dbt. The first example is an expression just like Madlibs. We’re going to fill in the blanks inside the double curlies, substitute one thing for another. Next step, we have statements help us add logic so we can have ifs.
We can have four loops. And then [00:16:00] finally, if you’re going to make Jinja templates that then create resource files like a SQL file or a source that contains Jinja, you have to be able to escape sequences. Escaping tells the rendering engine when to substitute as it renders and when to make a string literal. This is really important because we need the output to be valid to dbt.
There are two ways to do it, and both of them are on the slide. There’s this statement raw, and nraw, which are really useful for long blocks. But if you want to do something like a ref, you want escaped and an escaped Jinja to co-exist on the same line, and the second method of scaping expressions is more readable in that instance.
What we’re doing is building up a string that includes escaped curlies inside of the expression. There’s this like outer wrapper here, and we’re actually intentionally nesting our curlies. It’s like a bit like crossing the streams. So I should note here that friends don’t let friends write this kind of Jinja without a live parser.
It’s really hard to keep this syntax straight. So Randy’s going to start dropping [00:17:00] some examples into the Slack channel that you can use to test out some of the syntax concepts above. Now that we’re grounded with some syntax, let’s take a look how this works for automation. Let’s start with the sources, cause this is a little bit simpler and that was the top priority for us.
Remember that we can have get relations by pattern pulling in the raw data before the sources themselves are tested. We’re going to use substitution and statements to render a list of tables into this source schema, and when things are pretty easy, like in this template, you can just have any logic you want in the template.
So at Olive, for example, we had some tables that started with an underscore and we had logic to add an identifier for the underscore users, so that we could just name the source users and not have to hit the underscore key, but it might not make sense to put all your logic into the Jinja template.
You’re going to eventually overload it. So what if you have a lot of different source types or you have a lot of, if then logic going into the template? [It] [00:18:00] gets overloaded pretty quickly. You might have an instinct to put some of the rules into the Python script that’s running the whole show for you, but that gets complicated too, because then you have to hold like what’s in the template versus what’s in the Python script. So we taught the dbt ecosystem here and we turned to YAML. So it made this config file that helps keep everything organized. It’s really nice because it’s human readable.
So we can have our template file and our file name and the output path where we want to put it for each dbt resource that we’re going to generate. So types of models, types of snapshot, types of unions—we’re going to do a lap through the flow on the left. Get the template, substitute the data and then output it to the correct location.
And this config, a YAML file, lets us define how all of that is going to work. We can also add special lists like the one in the source query section of that file, which lets us avoid hard coding logic into functions. And that’s really helpful for data inputs. Like we’re expanding the list of [00:19:00] tables on the previous slide.
I’ve got all our ingredients. We’ve got a template. We’ve got a way to write rules that humans can read, and it’s time to take a look at a high level at the Python script that ties all this. This is just a schematic diagram here, but to orchestrate all this, we have a Python script that is going to do laps through each resource defined in the configuration template.
We’re going to go to the information schema of the database, so we know what sources exist. What has popped up, what calls have come from inside the house and using the specifications in that configuration file, we’re going to build a list of dictionaries, and each dictionary in the list is going to represent a dbt resource that we want to build.
It’s compatible with a template. So let’s say I want to do three sources. I’m going to have a list of three dictionaries. Going back to our Madlibs example for Jinja, you can’t put a noun in a blank intended for a verb and end up with a good sentence, so this first definition step helps make sure that there’s a match between the templates and the data that we’re going to substitute.
[00:20:00] We then loop over that list of dictionaries, which renders a template into valid source or model code and write each file where it needs to go. Now that we have our scaffold for how this works, we’re going to go to a scary code example. So this is a union template that covers all the different kinds of Jinja syntax we discussed, and it will put this back on your deck. Lewis mentioned earlier that when you get relations by pattern and union relations in common, your sources get disconnected, so we figured out a way to put them back. This example is a little bit to take in, so Randy’s also going to drop a version of this into the Slack that you can plug into the live parser to try out.
So this template demonstrates all of the Jinja syntax that we covered a few slides ago. We have some expressions that are doing substitution, there’s some statements that are controlling logic, and we have both forms of escaping Jinja to ensure that it ends up in the rendered file. [If] you didn’t get relations by pattern and union relations means the dbt parser can’t properly resolve the dependencies of your union model.
So the [00:21:00] template shows how to use depends on to force them back onto the list. The new and improved way is the best of both worlds. We get all of the benefits that Lewis shared in the first part of the talk. We added a github action to run this for us, which solves the timing problem of the untested sources become tested.
And something that we really like about this solution is that it keeps to this core contract of dbt, which is that there’s this one model that relates to an object in your data warehouse. Everything is still version controlled. Everything is still traceable, and we can see what’s going on with our sources.
So of course, this is all really cool. And you might be asking should everyone automate? And the answer is probably not. You need to have the exact same kind of problem we had where you have similar environment [with] so much going on that manual maintenance is not possible. If you, or a loved one are using get relations by pattern and union relations a lot, you may be entitled to automation.
I’m sure there’s lots of other ways you could solve this problem. , but one thing [00:22:00] we really like about doing it this way is that the contract for dbt is this one file, one model, one table, and this lets everything stay traceable. Finally, of course we have some lessons learned along the way. You want to make the dbt ecosystem work for you so best practices are a great starting point, but you have to know the rules to know when to break them. Like Lewis said, the solution was only the start. Reviewing big changes takes time and effort. It’s like a stress test for your team’s working practices. You can probably guess in the chat which person submitted the giant PR and which person had to review it.
Finally, extraordinary changes require extraordinary justification. There were real business reasons that we had to do this and it supported what we were trying to deliver for Aula and all their customers. So with that, I’d like to thank everyone so much for coming to our talk today. Big thanks to Randy and the dbt Labs team for keeping everything organized and we’ll be in the Slack.
Last modified on: Oct 13, 2022