Project Design | dbt Labs

---

# So, I have raw data.

???
[Teacher Note]:  
You can read these out or pose these questions to the class - read the audience.  
If they don't seem interactive but you want some interactivity, you can also  
pose these questions to fellow teachers.

This flows best by refering to what happened in previous sessions before asking  
the questions on the slides:  
"In previous sessions we saw that we have raw data in the data platform."

Model it!

With dbt!

???
[Teacher Note]:  
We need to know what we mean by "model it". It's a broad term!

---

<h1>Project Design | Focus</h1>

<li> Name the three main interfaces for a dbt project </li>

<li> Build intuition for naming conventions for database objects </li>

<li> Build intuition for how to use dbt and BI tools together </li>

<li> Refactor code by creating staging models </li>

<li> Create new directories & subdirectories </li>

<li> Change materializations in a project </li>

---

# "Model it!" actually means:

**1. There is a set of final data artifacts I want to share with stakeholders.**
- "data model design," "data platform design," "warehousing"
- Tradeoffs will be team & org specific

**2. I need a reliable, scalable, intuitive process for generating those final artifacts.**
- With dbt!

---

### What is the table or view I'm trying to build?

???
[Teacher Note]:  
This is referring to the first set of bullet points on the previous slide.

---
## What is the table or view I'm trying to build?

Our view is that there is no one right answer, but there are good heuristics.

As modern data practitioners, we are inheritors of a decades-long history. It
is our task to determine for the present day which concepts are still relevant.

Let's break down the process...

---

# Modeling New Payment Data 📬

```
Hey there,

We just got our payment data added to the data platform.

If you could put together a report that shows me how much we have made from each order, that 
would be great.

I also remember you making a model representative of each customer, could you go ahead and add some 
info about the lifetime value of each customer?

Thanks!

Mr. Schmo
```

---

# The New Data

---

# The desired output:

| customer_id | first_name | last_name | first_order_date | most_recent_order_date | number_of_orders | lifetime_value |
|-------------|------------|-----------|------------------|------------------------|------------------|----------------|
| 1           | Michael    | P.        | 2018-01-01       | 2018-02-10             | 2                | 33             |
| 2           | Shawn      | M.        | 2018-01-11       | 2018-01-11             | 1                | 23             |
| 3           | Kathleen   | P.        | 2018-01-02       | 2018-03-11             | 3                | 65             |
| 4           | Jimmy      | C.        | NULL             | NULL                   | 0                | 0              |
| 5           | Katherine  | R.        | NULL             | NULL                   | 0                | 0              |

???
[Teacher Note]:  
This is our desired output - You can see that we've got a table with one row per  
brand and country combination, with an aggregate of the units sold.

What is the query you (the analyst) will write to get these results?

???
[Teacher note:]  
You should actually pause here for a few seconds to allow everyone to think  
about this!  
  
--

This depends on your models!

---

#Pre-dbt

---
class: subtitle, center, middle

# 2. The process is the product

### (and that's more important)

???
[Teacher Note]:  
This is where we'll talk about the process and how we'll create our artifacts in  
a scalable, reliable, and intuitive way.

---

_The natural state of the universe is chaos: entropy tends to increase in closed
systems, and there’s really nothing that we can do about that. So too is the
nature of your dbt project: unless action is taken to maintain order._

---
# What we really mean to say:

No matter what the final artifacts you're building are, be effortful in how you
organize them.

These artifacts show up in three places:
1. The dbt DAG
2. The repo
3. The data platform

These are how people at your organization will conceptualize, interact with,
and benefit from data—your product—more than any one table or dashboard.

---

# Interface #1: The dbt DAG

"Where are those sales numbers coming from?" → _Which data sources, processes, and assumptions do our sales calculations depend on?_

The DAG tells the story:

<h3 style="text-align: center;"><span style="color: green;">Source</span>
→
<span style="color: blue;">Staging</span>
→
<span style="color: purple;">Dimensions/Facts</span></h3>

???
[Teacher Note]:  
You can see here that our sales model combines order and payment information from  
jaffle_shop and stripe, cleans the data, and finally comes together in fct_sales.

---

# Interface #1: The dbt DAG

For your final artifacts, consider which sources will power this project:

(More on this later)

???
[Teacher Note]:  
To transition, you can word this slide like:  
"In order to plan out our flow of data, let's whiteboard our fct_sales example.  
We'll start by considering the the sources that power our end result.   
The data sources we need are orders, products, customers, and stores."

---

# Interface #1: The dbt DAG

Then add staging models: "The shape you wish your data came in"

???
[Teacher Note]:  
Further explanation on the whiteboarding:  
"Once we have our sources, we create our first layer of models which cleans and  
standardizes our raw data so that it's easier to work with downstream.  
  
We'll learn later on about what fills in our leftover question mark."

---

# Interface #1: The dbt DAG
### Model prefixes
.dense-text[
.left-column-33[
### stg_
- 1:1 with raw data tables
- **Source**-conformed
- Renaming, type casting, coalescing + nullif’ing
]

.right-column-66[
.left-column[
### dim_
- “Nouns”
- Core **business** entities
- Customers, products, apartments, providers, employees, ...
- Some "slowly-changing" attributes (e.g. address, email)
]

.right-column[
### fct_
- “Verbs”
- Core **business** processes
- Orders, cases, plays, listings, comments, ...
- Often built on top of an immutable event stream
]]]

---

<h3 style="text-align: center;"><span style="color: green;">source</span>
→
<span style="color: lightblue; font-size: 60%;">[base]</span>
→
<span style="color: blue;">stg</span>
→
<span style="color: grey; font-size: 60%;">[intermediate, lookup, utils]</span>
→
<span style="color: purple;">{fct</span>
+
<span style="color: navy;">dim}</span></h3>

???
[Teacher Note]:  
This explanation plays on the leftover question mark from whiteboarding earlier,  
which isn't covered explicitly in any other slide:  
"Here's an example of what a real DAG might end up looking like. You can  
see that we have:
- Our sources (the green nodes)  
- Our staging models
- Some intermediate steps happening: Remember our leftover question mark when we
were whiteboarding our DAG? We can identify what processes are our "stepping stones"
to our endresult and modularize those as intermediate models. You don't need these,
but it does make the overall flow more understandable and allows you to reuse the
outputs of those processes!
- and our final dimensions and facts"

[Demo:]  
Draw circles/boxes around the different "model groups" (sources,
staging, intermediate, lookups, final fact/dim). Comment on how they can
be grouped together in all three interfaces.

---
# Interface #1: The dbt DAG

Make sure that for each model, you ask:
- What does one record represent? Let's document this!
- What assumptions about our data are we making? We should test those!
- Am I using naming patterns? You should!

???
[Teacher Note]:  
You can position this by saying something like:  
"Some good practices to go along with building your DAG are..."

---

# Interface #2: The repo

.left-column[
- **Within files:** adhere to a style guide
- **Between files:** better file organization = easier onboarding + better collaboration
]

.right-column[
.dense-text[
```
├── README.md
├── dbt_project.yml
└── models
    ├── marts
        └── core
            ├── core.yml
            └── dim_product.sql
            ├── dim_store.sql
            └── fact_sales.sql
    └── staging
        └── postgres
        └── shopify
            ├── shopify.yml
            ├── stg_shopify_customers.sql
            ├── stg_shopify_order_items.sql
            ├── stg_shopify_orders.sql
            └── stg_shopify_products.sql
```
]
]

???
[Teacher Note]:  
Our developers are doing most of their work within the files of the repository.  
We can keep this organized by separating our modeling layers in to dedicated  
folders and having naming conventions for our files.  
Point out:
- marts / staging separation
- subfolders within each
- configuration files that describe the contents of the folder they're in
- naming conventions of the models

---

# Interface #3: The Data Platform

.left-column-66[
- **Nomenclature.** How do we name our views + tables so they are consistent, easy to
find, and easy to infer contents?
- How do we organize our output objects? Same schema or different?
- What about the intermediate stuff we _don’t_ want to show?
- Who is your end user? Is it a person writing SQL, or a BI tool pointed to a production schema?
]

.right-column-33[
<img src="/ui/img/dbt-learn/lessons/dbt-project-design/mode-sidebar.png" style="width: 60%;" class="img-center">
]

???
[Teacher Note]:  
Point out how:
- naming has helped organize the models into easy to find groups 
- the naming helps infer the contents of the objects
- that depending on the usage of these objects, we could further customize by  
configuring them to build into schemas by use case and setting permissions on  
those schemas based on who or what should use the objects within them.

---

<ol>
    <li class = 'orange'>Organize:</li>
        <ul>
            <li>Create `staging` subfolder in `models` directory</li>
            <li>Create `marts` subfolder in the `models` directory</li>
            <li>Rename `customers.sql` to `dim_customers.sql`</li>
            <li>Move `dim_customers.sql` to the `marts` folder</li>
        </ul>
    <li class = 'orange'>Refactor:</li>
        <ul>
            <li>Build a staging model for each CTE in `dim_customers.sql` (2)</li>
        </ul>   
    <li class = 'orange'>Materialize:</li>
        <ul>
            <li>`stg_` models as views</li>
        </ul>   
    <li class = 'orange'>Replace:</li>
        <ul>
            <li>CTEs in `dim_customers.sql` with references to `stg_` models</li>
        </ul>   
</ol>

---

# Checkpoint:

- What are some helpful actions to do in the staging layer?  Why?

## Questions?

---

## What BI tools do you use?

???
[Teacher Note]:  
Now we're going to talk about dbt artifacts and business intelligence tools.  
Our end products typically get consumed by these, so it's important to understand  
the practices surrounding this part of the stack.

---

class: center
# What BI tools do you use?
<img src="/ui/img/dbt-learn/lessons/dbt-project-design/bi-logos-3.png" style="width: 40%;" class="img-center">

???
[Teacher Note]:  
There's a lot of BI tools at our disposal - tools such as Sisense, Metabase,  
Redash, Mode, Looker, Tableau, R, Chartio, etc.

*Interactivity*: have the class shout out which of these tools they've used in  
the slack channel!
---

# Which part of this should be in dbt versus the BI tool?

???
[Teacher Note]:  
Throw back to our question from earlier

---

# Which part of this should be in dbt versus the BI tool?

For most orgs, the final rollup query should be in BI tool — allows for "drill down" functionality / further exploration.

What if?
- This is a KPI your company reports on regularly?

Maybe you want it in a dbt model as well.
 
???
[Teacher Note]:
There's nothing wrong with moving important calculations in to dbt,  
but you might want to create a prefix for them to inform others that they  
are metrics aggregated from the major ideas.

- You're building a prototype model or field?

Potentially build it in the BI tool first then "promote" it.

---

# What should you use your BI tool for?

???
[Teacher Note]:  
In order to put more emphasis on why this is important, you can position this  
like:  
"Queries and transformations can move around between dbt and the BI tool, so how  
do we try to stay effortful in what we decide to put where?"

---

## The things they've always been good at!

.left-column-33[
- User Interface
    - Point-and-click
    - Self-service filters, drill-downs
    - Accessible to users who may be less technical
]
.right-column-66[
.left-column[
- Viz!
    - Out-of-the-box
	- Color palettes ("my brand!")
    - Frameworks/libraries
]
.right-column[
- Nice-to-haves
    - Dynamic user input
    - Scheduling
	- Alerting
	- PDF’ing
	- CSV’ing
]
]

???
[Teacher Note]:  
dynamic user input: allowing your users to dynamically change the inputs of a  
query. The BI tool would need to have the capabilities for this, but it allows  
you to set up your own "choices" that the user can choose, and allows you to  
template those choices in the query. For example, a table joins a customers table  
to , and the user chooses what that  is.

---

## The things they're less good at

- Multilayered data transformation with complex ordered dependencies
- Enforcing a single sources of truth
    - Defining, documenting, centralizing business logic
- Maintaining separate data environments
    - Testing
    - Version control
- Running a lot non-performant SQL concurrently
- Managing user access to sensitive data

---

# What goes where?

.left-column[
## dbt layer
- Critical business logic
- Best agreed-on version of a model
- Complex SQL
- No wasted code! Everything is multipurpose*
- Anything worth version control
]

.right-column[
## BI layer
- Aggregations, calculated on the fly
- Joins qualified by user input
- Ad hoc queries and proto-models
- Select-star SQL that feeds R, Python, JS (custom viz)
- Nothing you can't take with you
]

???
[Teacher Note:]  
You can think of dbt as our assembly line which produces expected outputs every  
time.

You can think of the BI layer as the place where we take the items produced from  
our assembly line to customize them in order to meet our stakeholder's needs.

---

# Checkpoint:

- What are the three interfaces into your dbt project?

???
[Teacher Note]:  
DAG, repo, and data platform

- What are some key distinctions between the dbt layer and BI layer that you want to keep in mind?

???
[Teacher Note]:  
- dbt should have our solidly defined business logic and metrics
- The BI later should be kept thin and flexible

## Questions?

---

# Knowledge check

When creating a dbt project:
- First decide on the final artifact, and then sketch out the process for modeling it.
- Use the three interfaces to organize your models.
- Stage your raw source data before transforming it.
- Every model should have a primary key.
- Decide as an organization how to keep your BI layer thin and flexible.

---

# Resources

* [How we structure our projects](https://docs.getdbt.com/guides/best-practices/how-we-structure/1-guide-overview)

* [On the Importance of Naming](https://docs.getdbt.com/blog/on-the-importance-of-naming)

* [dbt_project.yml](https://docs.getdbt.com/reference/dbt_project.yml)

---

#Zoom Out

<li>dbt Learn Norms</li>
<li>Why dbt?</li>
<li>Modeling</li>
<li>Project Design</li>

<li class = 'orange'>Testing</li>

<li class = 'grey'>Sources</li>
<li class = 'grey'>Docs</li>
<li class = 'grey'>Deployment</li>