Blog Scaling knowledge

Scaling knowledge

The hardest thing about scaling a company is communication. Here's how our latest release will make data communication just a little bit easier. Read now
Scaling knowledge

The hardest thing about scaling a company is communication.

The more people that work at a company, the more nodes in the network, the more links between them. In a completely decentralized network, the number of connections is proportional to the square of the number of nodes. In a hierarchical network, each node is only connected to X peers, which limits complexity but also significantly decreases the flow of information. The reason why messaging platforms like Slack are so important today is that they reduce communication friction, allowing companies greater flexibility and/or greater throughput in how they design their communications network architectures. Or, in people-speak: their org charts.

If you’ve ever been at a company that has grown from 20 to 500 employees, you will have felt this process in your lived experience. What were once impromptu conversations turn into scheduled meetings. Your calendar starts to come under pressure. You simply spend a greater percentage of your time saying or typing words to other people.

There are three types of knowledge that people within a company need to communicate to one another.

First, there is tactical knowledge. What is the status of this project? Who is working on that activity? When will that dependency un-block me? What are our financial goals this quarter? There is a huge industry of software products (i.e. Trello) and collection of methodologies (i.e. scrum) that are dedicated to collecting and disseminating this kind of knowledge.

Second, there is cultural knowledge. Culture isn’t a mission statement or a set of values: culture is the way a group of people make decisions in support of the company’s values. Founders can’t be present for every decision, but the best ones build cultures that allow their employees to make decisions from the same core values that they would.

The best post I’ve ever read on this topic is by Brian Chesky, CEO of Airbnb, called Don’t Fuck Up the Culture. The whole post is beyond phenomenal, but the payload is here:

The stronger the culture, the less corporate process a company needs. When the culture is strong, you can trust everyone to do the right thing. People can be independent and autonomous. They can be entrepreneurial.

People can be independent and autonomous because they all share the same knowledge about how to make decisions.

The third type of knowledge, and ultimately the subject of this post, is factual knowledge. How many shirts did we sell yesterday? How many months of cash do we have in the bank? How are our user retention cohorts trending?

Factual Knowledge

As companies grow, there is tremendous pressure to produce more factual knowledge. Growth begets bigger budgets, new people, new processes, new technology, all aligned around new and bigger initiatives. Each initiative brings with it the same set of questions: How much X did we produce? Why? How do we double it?

Undertaking new initiatives without supporting factual knowledge doesn’t necessitate failure, but it significantly lowers your chances. Growing a company is like a poker tournament: you might win a couple of lucky hands early, but ultimately you’re going to have to make good decisions if you want to win the whole thing. And good decisions are founded on a solid understanding of facts.

With new and cheaper fact-generating technologies like data pipelines, data warehouses, and BI tools, companies today are generating more facts than ever. But what we’ve come to realize over the past two years is that almost no companies are good at disseminating factual knowledge. What’s more, most companies don’t even realize they suck at it. I’m even going to go out on a limb and suggest — begging your pardon — that your company probably sucks at disseminating factual knowledge.

Most companies suck at disseminating factual knowledge. Yours probably does too.

If you’re a data analyst or engineer, you might not be aware of this problem, specifically because your entire job is to be a librarian of facts. Of course you know where to find them! And even if you don’t, you know the arcane internal systems you have to use to look them up. But you don’t make day-to-day decisions in product, marketing, finance, or ops. And most of your counterparts who do (a.k.a. “business users”) have a common experience repeated multiple times a day:

  1. They look at a report you’ve built for them
  2. They have a question about the report: there is some nuance that they want to understand about how it was produced. Did you remember to exclude refunded orders? How did you define “channel”? Etc.
  3. They spend a couple of minutes trying to answer their own question. They fail.
  4. They either a) ask you for clarification, a request that goes into a long backlog of other such requests, or b) shrug and move on with their days, totally ignoring your report.

This is a communication failure. In order to disseminate factual knowledge, it is insufficient to simply disseminate data. Factual knowledge must include the data themselves as well as the knowledge about how those data were produced. This is why scientists don’t just upload CSVs, they write lengthy papers describing how other scientists can reproduce their results. Otherwise, why would anyone trust them?

The answer that hierarchical companies give their employees to the trust question is: because I said so. Have you ever heard the phrase “Are those the blessed numbers?” Who blessed them — a corporate data guru?! Blessed means “You don’t have to worry about how these facts were produced, just trust us.”

This answer wouldn’t support the distributed, entrepreneurial culture that Brian Chesky wanted at Airbnb, so they’ve worked hard to build something better.

Scaling Knowledge at Airbnb

Airbnb has been thinking about this problem since at least early 2016, when they released the post Scaling Knowledge at Airbnb. The post, like Brian’s, is singular — it described the problem with incredible clarity. Here’s my favorite line:

Low quality research manifests as an environment of knowledge cacophony, where teams only read and trust research that they themselves created.


The post also demonstrates the investments that the company had made in solving it their own knowledge scaling problem. At the time, Airbnb’s solution was a collection of R, Jupyter, Markdown, and Git, and the result was a Shiny app that looked like this:


The project has matured since then, and all code is available under the Apache 2.0 license. Its core features — a) data and narrative colocated and versioned together, b) universal knowledge search — are prescient. But unlike other Airbnb data projects like Airflow and Superset, Knowledge Repo hasn’t taken off. My personal guess is that it’s too idiosyncratic to Airbnb’s particular needs.

Other companies with impressive data credentials have worked on this problem as they’ve experienced it themselves. Linkedin (originator of Kafka) built WhereHows. Uber just released Databook. Companies scaling fast are experiencing this problem and they are devoting real resources towards solving it. But none of their solutions has yet broken out.

Introducing dbt 0.11.0

(need a refresher on dbt? see this.)

dbt 0.11.0 includes a massive new feature called Docs. Docs is the most requested feature from our community ever. It’s a browser GUI that helps users — any users — understand the provenance of their data. We’ve invested heavily in its UX and in the past month during the beta we’ve gotten wildly positive reactions. If you want to see what the fuss is about, check out Drew’s detailed release post.

I wanted to write this post, though, to plant a flag in the ground. dbt today is used by 250 companies to transform raw data in their data warehouses. But most downstream users of this data don’t have any access to the knowledge of the workings of this data transformation process. dbt is opaque to them. To say that another way:

To date, dbt has been contributing to the knowledge scaling problem I’ve described above.

With dbt 0.11.0, we’re saying that this is a problem — dbt needs to promote, not hinder, knowledge dissemination. Transforming raw data into useful information is a part of that process, but it’s not sufficient on its own. dbt Docs explicitly promotes transparency throughout the entire DAG, helping data engineers and analysts expose the provenance of data to others on their team and throughout the broader organization.

With Docs, users can understand where your data comes from.

dbt Docs

Just the Beginning

We have big goals for dbt Docs, many of them shaped by the smart folks at Airbnb, Linkedin, and Uber who have trodden this path before us. The version we’ve released today is a huge step forwards, but it’s just the beginning. As more users get their hands on Docs we’ll be learning and evolving with you, so please—the more feedback the better.

One final note, because I know you’ll ask: we’ve released Docs under the Apache 2.0 license, making all of its code fully open source. This is a big deal to us. There are a lot of proprietary tools out there that purport to be your “single source of truth.” Well — I don’t want anyone else to own my truth. If your pricing strategy impacts my ability to know basic factual knowledge about my business, that’s a risk I’m just not OK with.

You already invest tens of thousands of person-hours every year maintaining this knowledge base; it should live within software that you control.

Thanks for reading. I’ll see you in Slack.

Last modified on: Nov 22, 2023

dbt Learn on-demand

A free intro course to transforming data with dbt