My first week with the new dbt
This post first appeared in The Analytics Engineering Roundup.
I’ve been writing dbt code for coming up on 8 years. I don’t get as much time to do that these days as I used to (forever a point of frustration), but I do roll my sleeves up periodically. And last week I did a bit of a desk-sweep—I cancelled everything non-essential and dug in on the new and improved dbt experience we shipped at Coalesce.
I truly believe that it is not possible to understand a product unless you use it yourself, and feel strongly that one of my most important roles as CEO is being a long-term advocate for the voice of the practitioner in the product development process. Doing this is a tremendous investment of time, but there are few things more important.
In fact, this is where much of our roadmap came from leading up to Coalesce 2023. As I shared in my keynote, I went through this same process roughly a year ago and found my experience to be…suboptimal. I realized the friction that project complexity was creating in the dbt user experience and committed to fixing that over the coming year. Most of our product launches from last month fell out of this in one way or another.
My experience this year was so good. I kid you not, this was the most fun I have had writing dbt code—potentially ever. It was also challenging at times, but it was the good kind of challenge, the one you experience when you are practicing a new piece on the piano and your fingers just don’t (yet) know how to do the thing they’re supposed to do. I was learning a lot, building muscle memory, and trying to update my mental model of the world.
Here’s an attempt to bottle up what I learned and share it.
First: Deferral is Awesome
I’m a CLI user. It’s hard to not be when most of my experience with dbt predates the dbt Cloud IDE. There was just too much muscle memory built into this flow of development, from
cmd-tab to all of the VSCode-native functionality that one would expect.
In the past, that meant I wasn’t using dbt Cloud in my development process. Now, with the Cloud CLI, that’s no longer true. With the Cloud CLI, every
dbt build I execute on the command line is parsed and executed from within a VM living inside of dbt Cloud’s infrastructure. This means: I no longer have to do local version upgrades, a massive advantage for someone like me who honestly just doesn’t have time to f around with Python dependencies. The Cloud CLI installed with zero hiccups and auth’ing to our Cloud account was seamless. And it’s quite fast—I don’t perceive a difference vs. local execution on a project of 1,000 models.
The install/upgrade experience is nice, but that’s just the appetizer. My personal favorite part of Cloud CLI is that I can take advantage of dbt Cloud’s statefulness via a feature called “auto-deferral”. Auto-deferral (now also available in the IDE) means that every time you execute a dbt run, dbt is comparing the state of your development project to the state of the production environment and it is only executing code in your development environment where that code differs from production, or is downstream of those changes. All upstream references point to prod.
Just from hearing that description it’s hard to picture what a massive improvement this is. Let me try coming at this another way. A typical development flow in dbt goes like this:
- create feature branch
dbt buildto initialize your dev environment
- write some code to implement a feature
dbt buildthe models you’ve changed
- iterate 4&5 until you’re happy
- submit PR
Previously, step #3 would take a long time for folks with serious dbt projects. Development environments are, by default, identical to production. So by default you’re running your entire project from scratch—500, 1000, however many models and however much compute. Now, you’re running some tiny % of that…the stuff that is actually relevant to the code you just changed. Save time, save $$, stay in flow state: win/win/win.
In my particular case, each of my dbt invocations was just a couple dozen models and took less than a minute to execute (the rest of the DAG referenced prod). That type of development experience is just worlds apart from what I had a year ago when I had to initialize my dev environment from scratch.
Second: Mesh isn’t just about Tech
We launched the dbt Mesh paradigm at Coalesce 2023. It touches nearly every aspect of the dbt experience and required real investments across nearly every codebase we manage. The thing I set out to do last week was to evaluate the current architecture of our dbt project and start to “meshify” it. (I don’t know if this verb will catch on but it’s now widely used within dbt Labs!)
Going through this process was fascinating. It wasn’t a simple push-button upgrade. Implementing Mesh felt more like doing a reorg and a refactor at the same time! And that’s because that’s exactly what it is. dbt Mesh is not just about how you organize your code—it is also a statement about who owns that code. Both are challenging, but when you put in the work things really click.
Implementing Mesh: the Technical Side
A lot of things just worked and are not worth talking about here. Cross-project ref was straightforward.
dependencies.yml was straightforward. Etc.
There were some things that I had to sort out. We decided to go multi-repo. We have several CI checks implemented in our primary internal analytics Github repo and I had to familiarize myself with how to set them up in this new project. This wasn’t incredibly complex, but it did require me to set up things that I personally hadn’t had to do before. I also had to connect our warehouse to the new project within dbt Cloud and navigate through the associated Okta config. Dig up passwords in dusty 1Password vaults. This kind of thing.
Perhaps the most challenging thing was setting up a private package of shared macros. This is something that has come up with many of our early mesh users—when you go from one project to many, you need to have a shared library of macros that all projects can reference. And likely you’ll want that shared code, like the rest of your project code, to be private. So instead of creating a single project for my meshified code I needed to create two projects—one for the new data product and one to act as a common code library. Again, this wasn’t too hard, but for me it required some mucking around with Github access tokens and the like. Once I had read all the docs I did get it to work on the first try though.
All-in, this setup work took me a couple of hours. If you knew what you were doing and had all of the credentials available at your fingertips you could do it in 15 minutes or less.
Implementing Mesh: the Organizational Side
The org side—people, process, ownership—was the fun part, and it’s where mesh truly shines.
I believe that in any organization there are people who are being well-served by the current process and tooling surrounding dbt and there are others who are being less well-served. At dbt Labs, we have an internal data team of ~10 people and we also have a finance team that is extremely dbt-native.
For reasons that are purely historically contingent, it’s the folks on the internal data team who have build a large majority of our dbt models and all of our dbt infrastructure. They set the rules, from coding conventions to support rotations to code review processes. The finance folks just have to play by them. This has worked ok, but it has limited the finance team’s ability to own and operate their own data products in the way that made the most sense for them.
What’s particularly interesting is that there are situations in which the organizations’ work doesn’t actually overlap that much. The internal data team is primarily concerned with data that focuses on customers and users, which the finance team uses heavily. But the finance team is also concerned with data about internal operations, often focused on employees as the primary entity. (Mark Matteucci spoke about the finance team’s work at Coalesce London this year.) Not only are these two domains only slightly related in our DAG, the privacy requirements for them are very different.
It was the separate nature of those workloads that made them such a natural first target for our internal mesh architecture. In order to make it happen, though, we needed to do a bunch of negotiation. Who is going to provide ongoing support? What standards will new projects need to live up to? Etc. In opening up the DAG to new ownership, it’s important to make sure that new stakeholders understand and are ready to live up to their responsibilities as stewards of code.
The end result of this decision has been very positive. The finance folks involved in writing a ton of dbt code have been given their own space to operate in and have begun to migrate existing functionality there. They have set up their own
CODEOWNERS and other processes, and are much more able to operate independently. They’re excited, empowered. I look forward to doing a retro with them in a quarter to see how things have evolved.
Third: Explorer is slick
dbt Explorer was another of our major launches at Coalesce last month. It’s a complete overhaul of dbt Docs, giving you an ability to visualize your entire dbt investment—one project or many—but now with a lot more under the hood.
Ok, Explorer is a cool product and I’m excited about it. But how did it actually fit into my workflow? How did it feel to use?
My answer today: very good. Not perfect (yet!) but very good.
- It’s fast. In our internal project of 1200 models and over 2,500 total DAG nodes (including sources / metrics / exposures), it can load the whole thing. The experience is way more reasonable if you specify a sub-DAG. But as long as you are looking at a reasonable sub-DAG, the interface is snappy. This was a MAJOR pain point with dbt Docs’ static web page—graphs of this size would just crash your chrome tab.
- It has a lot more info. dbt Docs just knew about your code, it didn’t really have any more information to show you. Explorer also knows about job and test execution, can trend runtimes, can tell you when data was last loaded, and more. After a few hours with it, I can now feel myself gravitating to it as the place to learn about what’s going on inside our warehouse. I don’t know what the future holds here but I find myself wanting to embed content from Explorer into other interfaces. Like…I want a Notion plugin!
- Problem #1 is something we’re aware of and working on right now: explorer doesn’t support all dbt selectors today. I found this to be frustrating at times; you may or may not notice it. We’ll sort this soon.
- Problem #2 is that Explorer isn’t yet available for dev environments, only prod environments. So as I’m writing code I can’t iteratively see the changes visually. Again, this is on the roadmap, we’ll get there soon, but it was notably lacking as a user.
Short answer: if you used Docs, you’ll love Explorer. It fits right into the workflow in the same way. And over the coming few months during public preview it’ll only get better—we’re at the very beginning on this thing.
Night and Day
These are the biggest changes in my personal dbt workflow for a very long time—multiple years. Easier install, faster (and cheaper!) development, more mature programming constructs, better interactive visualization.
2023 was the first year we started the year with a ramped engineering team, and it’s fantastic to see what we could do. Thanks to everyone who put in the work to make this experience come to life.
After making a big hire, I anticipate having more time to write dbt code over the coming year. I cannot wait to continue to push the envelope on every single thing we ship.
The next issue I write I plan to return to more industry-wide content. With Coalesce and our recent launches I’ve had a lot of specifically dbt-related stuff I wanted to get out there. More soon.
Last modified on: Nov 16, 2023