Data cataloging

This piece is adapted from Jeremy Cohen's presentation at a private dbt Learn training in May of 2020.

Data documentation is much more than a few lines of code comments that we gloss over.

It helps us write better analytics code in 4 ways:

  1. Empowers the writer to refer back to their thinking
  2. Enables the reader to find their own answers
  3. Increases the velocity of onboarding, and
  4. Reduces redundant communication

Our Favorite Metaphor #

Analytics engineering is really the organization of an organization’s information.

Analytics engineers are data librarians, and documentation is our Dewey decimal system to catalog information.

Someone may not know precisely where to begin on a project, so they’ll ask the analytics engineer, aka the librarian, to point them in the right direction.

And at other times, they’ll know exactly what they’re looking for and delve directly into the Dewey decimal card catalog (aka the documentation) themselves.

Either way, both the analytics engineer and the documentation are here to help.

Relying 100% on an analytics engineer is inefficient, yet relying entirely on documentation lacks comprehensiveness. Having both empowers people to find the information they need in the most direct, efficient way possible.

Our Strong Opinions on Documentation #

The argument for documentation revolves around four crucial points we hold dear:

1. People only use code they trust.

Testing and documentation provide the coverage the code needs to gain others’ trust.

Those who can understand your code and view the tests performed will use it.

Code that has no documentation will never be used, resulting in wasted time and wasted code.

2. Reliable documentation is key to scaling a data team.

Unrecorded experiential knowledge and outdated documentation are a recipe for an ineffective team, where individuals ask the same questions repeatedly and waste each other’s time.

Documented data takes the knowledge out of people’s heads and arranges it so everyone has access.

No matter if you’re working in a small team or as part of a team of 40, you’ll all be on the same page.

3. Automation is key.

One of the reasons data documentation is so often unsuccessful is because it relies on someone manually entering explanations and notes — possibly one of the most boring things in the world to do.

dbt automates most documentation by making inferences about your codebase, thus taking away the uncomfortable task of manually maintaining a Google sheet or Confluence article.

There is still a need for some manual input, especially, for example, when you just can’t convey everything one needs to know about a particular column in the column title itself.

If you’ve ever looked in Snowflake explorer and not understood what a column means, you know why manual documentation is still necessary on some level.

4. The documentation process shouldn’t be a game of catch-up.

Imagine an inchworm making its way along a leaf — just as its back-end catches up, its front end moves forward a little more, so both parts of the body are moving, but the tail is always one step behind.

Documentation should always step in tandem with the work; it must be a part of the same workflow as the transformations it seeks to describe.

It can’t if the two are separate; the documentation will always lag behind.

Learn more about dbt’s data documentation functionality here, in the er, dbt docs (we’re big on documentation around here).