Blog dbt Core v1.5: model governance of, by, & for people

dbt Core v1.5: model governance of, by, & for people

in societate vivimus, fortasse collaboramus Read now
dbt Core v1.5: model governance of, by, & for people

Today, we released dbt Core v1.5. The release includes features to help govern critical dbt models, across dozens or hundreds of data practitioners. These features are here to help you manage complexity—the complexity of people. We want dbt to be a framework for clarifying ownership, communicating across teams, and managing change.

You can read our docs update for all the details.

Where we’re headed: Soon, you’ll be able to share & access dbt models across projects and across teams, just by writing ref. Multiple teams actually own their code & their data—and they are equipped to serve public, contracted, mature data models for the benefit of all.

Why we built this

There are many large companies using dbt today, so large as to be unimaginable to me. Hundreds of people contributing code to dbt projects, spread across dozens of divisions and domains. This is a good thing. We’ve enabled those with the most context, the closest-to-the-business people, to own the logic of the business. At the same time, that’s a lot of cooks to have in one kitchen.

This has been humbling for me, as a person who started using dbt as a team of one, a team of less-than-one, a consultant on part-time loan, by the sprint and by the hour, getting a whole lot of jobs done, and then going home.

The challenges that come at this level of scale go beyond workflow ergonomics or engine optimization. They are challenges of people. The most important problem becomes offering the right context to the right person—not too much, not too little—so they can get their job done.

It’s also about instilling the sense of responsibility, and humility, that comes from providing a service in production. To know that other people are counting on you. It’s a mistake to privilege the producer of that context over the person who badly needs it for their modeling, their reporting. If a model has the optimal structure, and no one can trust it enough to query—what’s the point, really?

It won’t work unilaterally, with one person or one team as central arbiter and bottleneck, and it won’t work in a vacuum, each team starting from scratch. It only works with a common framework, with common standards, on a common platform.

There’s great tension here, and also great possibility: the potential for collaboration. Such is the premise and promise of governance. If we can agree on some ground rules, acknowledge what we owe each other, a lot of things are possible.

I depend on the models of others, and others depend on mine

Imagine: I’m an analytics engineer at a big organization, bigger than anywhere I (Jeremy) have ever worked.

I know that I cannot go read the source code for every model that’s needed to power all analytics. Those models make total sense to someone—but, however clever I may think I am, there’s just too much context for me to be that someone all the time. I cannot magically borrow the full context of 100 business analysts, nor can I load the source code of 5k models from disk into local human memory. I need to know what’s worth paying attention to, and what’s someone else’s something-along-the-way. I also need to know that someone else won’t come along and, without any indication they shouldn’t, take a dependency on my internal building block that only makes sense in context. How do other programming languages solve this? With access modifiers—some methods are public, others private—which we’ve taken as inspiration for model access. Some models are meant for sharing. A lot more are for handling complexity along way, not as interface, but as implementation detail, private to its owner.

How about the models that are meant for sharing? I’m a well-meaning analytics engineer, so I have some understandable pet peeves, like inconsistent conventions for column naming and typing. Two columns start with has_; one’s a boolean and one’s a 0/1 integer. That’s annoying! It would be a lot easier for me if they were named and typed the same. But wait: That model is being used downstream, powering things in Excel I wouldn’t believe. Tableau extracts off the shoulder of Orion, PowerBI reports glittering in the dark near the Tannhauser gate. Soon—other teams’ dbt projects. When a model is that important, it deserves a model contract—the more explicit the better, especially when someone else is taking my word for it.

That’s not to say I can never make a much-needed change. I just cannot flip a magic switch and expect to migrate everyone overnight. Nor can I go break them all on a whim. They’re other people—the customers of my data product, the people for whom it’s for—that number, that column, all those models, all that SQL. Those renames could still be the right thing, but I need to do it with fair warning, with a migration path, and on a predictable cadence. I bump my model version; soon, I add a deprecation date; dbt helps me with keeping track, and the communication along the way.

I treat the people relying on my models the way I’d want to be treated if I were relying on theirs. That’s it; the rest is commentary, go and read it.

Summary

We’re introducing three new constructs in dbt Core v1.5:

  • Model access. Specify where a given dbt model can be referenced. Designate models as “private” when they’re an internal implementation detail, a modular building block for your eyes only. Make them “public” when it’s a mature & stable asset that should be used & reused elsewhere, including other teams’ dbt projects.
  • Model contracts. Explicit is better than implicit, especially when other people are counting on you. Catch and prevent changes (renaming, removing, retyping a column) that could break downstream uses of your data—the people relying on you, the whole point of doing the work.
models:
  - name: dim_customers
    access: public
    description: "Everything you need to know about our customers"
    config:
      contract: {enforced: true}
    columns:
      - name: customer_id
        data_type: int
        constraints:
          - type: not_null
      - name: customer_name
        data_type: string
      ...
  • Model versions. Communicate and manage breaking changes to mature & stable models. You do not need a new version, defined in yaml, every time you change and redeploy a model in any way. This is for when you said a thing would be one way, and now it needs to be different. It’s is a tool for managing changes and maintaining credibility, across an organization with divergent stakeholders.

Where we’re headed

dbt provides a powerful solution for managing data within a single project. Soon, it will have the ability to receive stateful input from other projects. This will unlock seamless multi-project collaboration for the first time. Conceptually, it will feel similar to how project state works for Slim CI. It will be a feature of dbt Cloud to make that experience seamless. Always the latest state, coordinated across projects, handled behind-the-scenes.

This is different from how packages work. To use another project as a package, you need full access to the source code for those projects, and it’s up to you to reconfigure & run those models. They’re your models. Your dependency on the other project is as library code, with all its attendant flexibility & complexity — rather than an API that provides you with a pointer to the published dataset, no more & no less.

That’s the vision: Use another team’s public model, in your project, with a set of guarantees for its stability, and visibility into its planned migrations. Just write ref.

Coda: why I care

If you’ve read this far, and you know me…

Every morning, I wake up and keep doing this (rewarding, taxing, gratifying) job. I believe two things in my heart:

  1. dbt is the right tool for the job—for the people who need it, for now, and (I hope) for a long time into the future.
  2. dbt isn’t done yet. It deserves to be better, and we are capable of making it better.

The day that I stop believing either one of those things, it will be my last day working on dbt.


dbt has always been a mix of idealism and pragmatism. Dogmatic on some things, flexible on others. We believe that analytics code is code, deserving of testing and version control. Even if you’ve never worked this way before, we think you should. We also believe in building a tool that helps real people get their jobs done, warts and all.

All happy paths are alike; we all want to walk the happy path. Each unhappy path is unhappy in its own ways. I believe that the most magical thing about dbt is the number of un-magical things it does, letting you focus on the stuff that’s hardest about this work: other people.

We should keep optimizing the happy paths, of course. It’s the path we all want to walk on. Faster, more cost-efficient. Iterate on modeling logic with faster feedback. Don’t rebuild what doesn’t need rebuilding. These things are important, and modern data platforms continue to roll out an impressive set of features that dbt can help you leverage.

In the happy path, I’m alone, sitting at my computer, watching one set of numbers turn into another set of numbers at blazing speed. The starting point is the individual, the first mover, the singular data engineer at the control board with all the knobs. My pipelines shall run eternal, bronze and silver and gold.

The fact is, most teams stop working and feeling like this pretty quickly. Codebases become complex, and their outputs serve a variety of competing needs and personas. Managing those tensions, those divergent needs, can turn the happy path into a forked road, full of potholes. Enterprises of great pitch and moment, their currents turn awry, and lose the name of action.

In v1.5, we are taking aim at the unhappy paths. We’ve added into dbt a new set of primitives, building blocks, both opinionated and flexible. This feels like additional complexity for users, but it’s complexity because people are complex. You know your people better than we do, and they probably know things you don’t, too.

We believe in teams actually owning their data. Actually owning. Not owning as a treat, on long-term lease, with the small-print provision that a central data team can revoke those rights at any point. The models are theirs to define. The compute is billed to their data warehouse/lakehouse/ account/project. (And they’re given the tools to avoid building and billing what they don’t have to.) At the same time: Not operating in a silo. Making it accessible for people to enter in, to find what they need.

If you’re a lone analytics engineer slash data person, wearing a lot of hats at a small startup, you probably don’t have this problem. You have a lot of other ones. You’re a member of a fledging band, first-name-basis, working hard to get off the ground. A six-month window for deprecating model versions? Who knows what your company will even look like in six months? Some of the features that are coming to dbt this year may not be for you—not for a while—and that’s okay! Know that it’s a framework that can scale, if and when you need it to.


dbt is for people, and people aren’t perfect. We’re going to keep building this imperfect thing, and finding ways to build it together.

dbt deserves to be better. For the team of one, and the many teams of many. We are capable of making it better, and so we will.

Tell me what’s missing from dbt, what needs to be better. I’m here to listen: at a meetup, on GitHub, in my DMs.

Last modified on: Apr 27, 2023