class: title, center, middle # dbt Project Design --- # So, I have raw data. .grey[_What do I do with it?_] ??? [Teacher Note]: You can read these out or pose these questions to the class - read the audience. If they don't seem interactive but you want some interactivity, you can also pose these questions to fellow teachers. This flows best by refering to what happened in previous sessions before asking the questions on the slides: "In previous sessions we saw that we have raw data in the data platform." -- Model it! -- .grey[_How?_] -- With dbt! -- .grey[_Cool._] -- ??? [Teacher Note]: We need to know what we mean by "model it". It's a broad term! ---
Project Design | Focus
Name the three main interfaces for a dbt project
Build intuition for naming conventions for database objects
Build intuition for how to use dbt and BI tools together
Refactor code by creating staging models
Create new directories & subdirectories
Change materializations in a project
--- # "Model it!" actually means: **1. There is a set of final data artifacts I want to share with stakeholders.** - "data model design," "data platform design," "warehousing" - Tradeoffs will be team & org specific -- **2. I need a reliable, scalable, intuitive process for generating those final artifacts.** - With dbt! --- class: subtitle, center, middle # 1. Final data artifacts ### What is the table or view I'm trying to build? ??? [Teacher Note]: This is referring to the first set of bullet points on the previous slide. --- ## What is the table or view I'm trying to build? Our view is that there is no one right answer, but there are good heuristics. As modern data practitioners, we are inheritors of a decades-long history. It is our task to determine for the present day which concepts are still relevant. -- Let's break down the process... --- # Modeling New Payment Data π¬ ``` Hey there, We just got our payment data added to the data platform. If you could put together a report that shows me how much we have made from each order, that would be great. I also remember you making a model representative of each customer, could you go ahead and add some info about the lifetime value of each customer? Thanks! Mr. Schmo ``` --- # The New Data
--- # The desired output: | customer_id | first_name | last_name | first_order_date | most_recent_order_date | number_of_orders | lifetime_value | |-------------|------------|-----------|------------------|------------------------|------------------|----------------| | 1 | Michael | P. | 2018-01-01 | 2018-02-10 | 2 | 33 | | 2 | Shawn | M. | 2018-01-11 | 2018-01-11 | 1 | 23 | | 3 | Kathleen | P. | 2018-01-02 | 2018-03-11 | 3 | 65 | | 4 | Jimmy | C. | NULL | NULL | 0 | 0 | | 5 | Katherine | R. | NULL | NULL | 0 | 0 | ??? [Teacher Note]: This is our desired output - You can see that we've got a table with one row per brand and country combination, with an aggregate of the units sold. -- What is the query you (the analyst) will write to get these results? ??? [Teacher note:] You should actually pause here for a few seconds to allow everyone to think about this! -- This depends on your models! --- #Pre-dbt
--- class: subtitle, center, middle # 2. The process is the product ### (and that's more important) ??? [Teacher Note]: This is where we'll talk about the process and how we'll create our artifacts in a scalable, reliable, and intuitive way. --- class: subtitle, middle _The natural state of the universe is chaos: entropy tends to increase in closed systems, and thereβs really nothing that we can do about that. So too is the nature of your dbt project: unless action is taken to maintain order._ --- # What we really mean to say: No matter what the final artifacts you're building are, be effortful in how you organize them. These artifacts show up in three places: 1. The dbt DAG 2. The repo 3. The data platform These are how people at your organization will conceptualize, interact with, and benefit from dataβyour productβmore than any one table or dashboard. --- # Interface #1: The dbt DAG "Where are those sales numbers coming from?" → _Which data sources, processes, and assumptions do our sales calculations depend on?_ The DAG tells the story:
Source
→
Staging
→
Dimensions/Facts
??? [Teacher Note]: You can see here that our sales model combines order and payment information from jaffle_shop and stripe, cleans the data, and finally comes together in fct_sales. --- # Interface #1: The dbt DAG For your final artifacts, consider which sources will power this project:
(More on this later) ??? [Teacher Note]: To transition, you can word this slide like: "In order to plan out our flow of data, let's whiteboard our fct_sales example. We'll start by considering the the sources that power our end result. The data sources we need are orders, products, customers, and stores." --- # Interface #1: The dbt DAG Then add staging models: "The shape you wish your data came in"
??? [Teacher Note]: Further explanation on the whiteboarding: "Once we have our sources, we create our first layer of models which cleans and standardizes our raw data so that it's easier to work with downstream. We'll learn later on about what fills in our leftover question mark." --- # Interface #1: The dbt DAG ### Model prefixes .dense-text[ .left-column-33[ ### stg_ - 1:1 with raw data tables - **Source**-conformed - Renaming, type casting, coalescing + nullifβing ] .right-column-66[ .left-column[ ### dim_ - βNounsβ - Core **business** entities - Customers, products, apartments, providers, employees, ... - Some "slowly-changing" attributes (e.g. address, email) ] .right-column[ ### fct_ - βVerbsβ - Core **business** processes - Orders, cases, plays, listings, comments, ... - Often built on top of an immutable event stream ]]] ---
source
→
[base]
→
stg
→
[intermediate, lookup, utils]
→
{fct
+
dim}
??? [Teacher Note]: This explanation plays on the leftover question mark from whiteboarding earlier, which isn't covered explicitly in any other slide: "Here's an example of what a real DAG might end up looking like. You can see that we have: - Our sources (the green nodes) - Our staging models - Some intermediate steps happening: Remember our leftover question mark when we were whiteboarding our DAG? We can identify what processes are our "stepping stones" to our endresult and modularize those as intermediate models. You don't need these, but it does make the overall flow more understandable and allows you to reuse the outputs of those processes! - and our final dimensions and facts" [Demo:] Draw circles/boxes around the different "model groups" (sources, staging, intermediate, lookups, final fact/dim). Comment on how they can be grouped together in all three interfaces. --- # Interface #1: The dbt DAG Make sure that for each model, you ask: - What does one record represent? Let's document this! - What assumptions about our data are we making? We should test those! - Am I using naming patterns? You should! ??? [Teacher Note]: You can position this by saying something like: "Some good practices to go along with building your DAG are..." --- # Interface #2: The repo .left-column[ - **Within files:** adhere to a style guide - **Between files:** better file organization = easier onboarding + better collaboration ] .right-column[ .dense-text[ ``` βββ README.md βββ dbt_project.yml βββ models βββ marts βββ core βββ core.yml βββ dim_product.sql βββ dim_store.sql βββ fact_sales.sql βββ staging βββ postgres βββ shopify βββ shopify.yml βββ stg_shopify_customers.sql βββ stg_shopify_order_items.sql βββ stg_shopify_orders.sql βββ stg_shopify_products.sql ``` ] ] ??? [Teacher Note]: Our developers are doing most of their work within the files of the repository. We can keep this organized by separating our modeling layers in to dedicated folders and having naming conventions for our files. Point out: - marts / staging separation - subfolders within each - configuration files that describe the contents of the folder they're in - naming conventions of the models --- # Interface #3: The Data Platform .left-column-66[ - **Nomenclature.** How do we name our views + tables so they are consistent, easy to find, and easy to infer contents? - How do we organize our output objects? Same schema or different? - What about the intermediate stuff we _donβt_ want to show? - Who is your end user? Is it a person writing SQL, or a BI tool pointed to a production schema? ] .right-column-33[
] ??? [Teacher Note]: Point out how: - naming has helped organize the models into easy to find groups - the naming helps infer the contents of the objects - that depending on the usage of these objects, we could further customize by configuring them to build into schemas by use case and setting permissions on those schemas based on who or what should use the objects within them. --- class: subtitle #Hands-On (10 min)
Organize:
Create `staging` subfolder in `models` directory
Create `marts` subfolder in the `models` directory
Rename `customers.sql` to `dim_customers.sql`
Move `dim_customers.sql` to the `marts` folder
Refactor:
Build a staging model for each CTE in `dim_customers.sql` (2)
Materialize:
`stg_` models as views
Replace:
CTEs in `dim_customers.sql` with references to `stg_` models
--- class: subtitle # Checkpoint: - What are some helpful actions to do in the staging layer? Why? -- ## Questions? --- class: subtitle, center, middle # dbt + BI ## What BI tools do you use? ??? [Teacher Note]: Now we're going to talk about dbt artifacts and business intelligence tools. Our end products typically get consumed by these, so it's important to understand the practices surrounding this part of the stack. --- class: center # What BI tools do you use?
??? [Teacher Note]: There's a lot of BI tools at our disposal - tools such as Sisense, Metabase, Redash, Mode, Looker, Tableau, R, Chartio, etc. *Interactivity*: have the class shout out which of these tools they've used in the slack channel! --- # Which part of this should be in dbt versus the BI tool?
??? [Teacher Note]: Throw back to our question from earlier --- # Which part of this should be in dbt versus the BI tool? For most orgs, the final rollup query should be in BI tool β allows for "drill down" functionality / further exploration. -- What if? - This is a KPI your company reports on regularly? -- Maybe you want it in a dbt model as well. ??? [Teacher Note]: There's nothing wrong with moving important calculations in to dbt, but you might want to create a prefix for them to inform others that they are metrics aggregated from the major ideas. -- - You're building a prototype model or field? -- Potentially build it in the BI tool first then "promote" it. --- # What should you use your BI tool for? ??? [Teacher Note]: In order to put more emphasis on why this is important, you can position this like: "Queries and transformations can move around between dbt and the BI tool, so how do we try to stay effortful in what we decide to put where?" --- ## The things they've always been good at! .left-column-33[ - User Interface - Point-and-click - Self-service filters, drill-downs - Accessible to users who may be less technical ] .right-column-66[ .left-column[ - Viz! - Out-of-the-box - Color palettes ("my brand!") - Frameworks/libraries ] .right-column[ - Nice-to-haves - Dynamic user input - Scheduling - Alerting - PDFβing - CSVβing ] ] ??? [Teacher Note]: dynamic user input: allowing your users to dynamically change the inputs of a query. The BI tool would need to have the capabilities for this, but it allows you to set up your own "choices" that the user can choose, and allows you to template those choices in the query. For example, a table joins a customers table to , and the user chooses what that is. --- ## The things they're less good at - Multilayered data transformation with complex ordered dependencies - Enforcing a single sources of truth - Defining, documenting, centralizing business logic - Maintaining separate data environments - Testing - Version control - Running a lot non-performant SQL concurrently - Managing user access to sensitive data --- # What goes where? .left-column[ ## dbt layer - Critical business logic - Best agreed-on version of a model - Complex SQL - No wasted code! Everything is multipurpose* - Anything worth version control ] .right-column[ ## BI layer - Aggregations, calculated on the fly - Joins qualified by user input - Ad hoc queries and proto-models - Select-star SQL that feeds R, Python, JS (custom viz) - Nothing you can't take with you ] ??? [Teacher Note:] You can think of dbt as our assembly line which produces expected outputs every time. You can think of the BI layer as the place where we take the items produced from our assembly line to customize them in order to meet our stakeholder's needs. --- class: subtitle # Checkpoint: - What are the three interfaces into your dbt project? ??? [Teacher Note]: DAG, repo, and data platform -- - What are some key distinctions between the dbt layer and BI layer that you want to keep in mind? ??? [Teacher Note]: - dbt should have our solidly defined business logic and metrics - The BI later should be kept thin and flexible -- ## Questions? --- class: subtitle # Knowledge check When creating a dbt project: - First decide on the final artifact, and then sketch out the process for modeling it. - Use the three interfaces to organize your models. - Stage your raw source data before transforming it. - Every model should have a primary key. - Decide as an organization how to keep your BI layer thin and flexible. --- # Resources * [How we structure our projects](https://docs.getdbt.com/guides/best-practices/how-we-structure/1-guide-overview) * [On the Importance of Naming](https://docs.getdbt.com/blog/on-the-importance-of-naming) * [dbt_project.yml](https://docs.getdbt.com/reference/dbt_project.yml) --- class: subtitle #Zoom Out
dbt Learn Norms
Why dbt?
Modeling
Project Design
--
Testing
--
Sources
Docs
Deployment