class: title, center, middle # dbt Project Design --- # So, I have raw data. .grey[_What do I do with it?_] ??? [Teacher Note]: You can read these out or pose these questions to the class - read the audience. If they don't seem interactive but you want some interactivity, you can also pose these questions to fellow teachers. This flows best by refering to what happened in previous sessions before asking the questions on the slides: "In previous sessions we saw that we have raw data in the warehouse." -- Model it! -- .grey[_How?_] -- With dbt! -- .grey[_Cool._] -- ??? [Teacher Note]: We need to know what we mean by "model it". It's a broad term! ---
dbt Project Design | Focus
Understand the trade-offs of various model designs
Name the three main interfaces for a dbt project
Build intuition for naming conventions for database objects
Build intuition for how to use dbt and BI tools together
--- # "Model it!" actually means: **1. There is a set of final data artifacts I want to share with stakeholders.** - "data model design," "data warehouse design," "warehousing" - It's a careful balance between several trade-offs - Could fill a book (or several) on this topic. It's a fascinating topic that we'll cover _very_ briefly ??? [Teacher Note]: There's two main things we could mean when we say "model it": The first set of bullet points are saying: - This is also known as these three terms - The tradeoffs will be specific to how our team and org functions - And there's numerous methodologies and extensive conversation to be had on this topic, but we're not going to dive too deeply in to it. -- **2. I need a reliable, scalable, intuitive process for generating those final artifacts.** - With dbt! - The _real_ interest of this presentation --- class: subtitle, center, middle # 1. Final data artifacts ### What is the table or view I'm trying to build? ??? [Teacher Note]: This is referring to the first set of bullet points on the previous slide. --- ## What is the table or view I'm trying to build? Our view is that there is no one right answer, but there are good heuristics. As modern data practitioners, we are inheritors of a decades-long history. It is our task to determine for the present day which concepts are still relevant. -- Let's break down the process... --- # It's 1998. You've got mail. π¬ ``` Dear Analyst, I would like to know how many widgets we sold in 1997 (last year). Just widgets in our TV category, that is. Oh, and if you could slice that number by the brand name of the widget, and dice it by the country where we sold them, that'd be rad. Thanks! Tristan SVP Marketing Analytics ``` --- # The desired output: | Brand | Country | Units_Sold | |-------|---------|------------| | GE | AU | 120 | | GE | UK | 180 | | GE | US | 300 | | LG | AU | 75 | | LG | UK | 100 | | LG | US | 20 | ??? [Teacher Note]: This is our desired output - You can see that we've got a table with one row per brand and country combination, with an aggregate of the units sold. -- What is the query you (the analyst) will write to get these results? ??? [Teacher note:] You should actually pause here for a few seconds to allow everyone to think about this! -- This depends on your models! --- ### Before: snowflake schema (ca. 2000s) ```sql SELECT B.Brand, G.Country, SUM(F.Units_Sold) FROM Fact_Sales F INNER JOIN Dim_Date D ON F.Date_Id = D.Id INNER JOIN Dim_Store S ON F.Store_Id = S.Id INNER JOIN Dim_Geography G ON S.Geography_Id = G.Id INNER JOIN Dim_Product P ON F.Product_Id = P.Id INNER JOIN Dim_Brand B ON P.Brand_Id = B.Id INNER JOIN Dim_Product_Category C ON P.Product_Category_Id = C.Id WHERE D.Year = 1997 AND C.Product_Category = 'tv' GROUP BY B.Brand, G.Country ``` ??? [Teacher Note]: In 1998, it would have been typical to see a query like this, written on an underlying database which uses a snowflake or star schema. We have a central fact table, which we're then joining out to dimensions to get the information we need to answer Tristan's question. --- ### Before: snowflake schema (ca. 2000s)
??? [Teacher Note]: This is what that schema looks like - again, you can see we have a central fact table, and in order to get information like country or brand, we actually need to join in other tables in between to get there. --- ## Related data modeling practices: - Kimball - Inmon - DataVault ??? [Teacher Note]: A good way to transition to this slide: "You might have heard about related data modeling practices such as" --- ### Today: denormalized model ```sql select brand, country, sum(units_sold) as total from analytics.fct_sales where date_trunc('year', sale_date) = '1997-01-01' and product_category = 'tv' group by 1, 2 ``` ??? [Teacher Note]: Our query today would look something like this. There's a lot less going on here - notably: - no joins - using a date_trunc instead Why no joins? It's because we're working with denormalized data, meaning that all of the relevant information is in the same model. We get to the date_trunc explanation later on. --- ### Today: denormalized model `fct_sales`: one record per sale to a customer .denser-text[ | sale_id | sold_at | units_sold | store_id | store_number | state_province | country | ean_code | product_name | brand_name | product_category_name | |---------|------------|------------|----------|--------------|----------------|---------|---------------|--------------|--------------|-----------------------| | 1 | 1997-01-01 | 1 | 1234 | 12 | PA | US | 901234-123457 | Widget | Widgetastic | tv | | 2 | 1997-01-01 | 2 | 1234 | 12 | MA | US | 901234-123457 | Widget | Widgetastic | tv | | 3 | 1997-01-01 | 1 | 1234 | 12 | NSW | AU | 901234-123457 | Widget | Widgetastic | tv | ] ??? [Teacher Note]: This is what the underlying data looks like for our denormalized model. You can see that important data such as brand, country, units_sold, year, and product category are all located in this model. --- ### Today: denormalized model (2020)
??? [Teacher Note]: This is the underlying flow of our data. You can see that we still have all of the same underlying tables, but instead of joining all of it at once, we've modularized it into various flows which finally come together to give us a table of all sales. We'll end up querying from this to get Tristan's answer via filters and aggregations. --- class: middle ## Sidebar Should that final "roll up" query live in dbt? (We're going to come back to this discussion!) ??? [Teacher Note]: You can position this by saying something like: "but before we continue, everyone should think about that final rollup query with our filters and aggregations - should this live in dbt? Think about it - we'll come back to this later!" --- ### Before: snowflake schema (ca. 2000s) ```sql SELECT B.Brand, G.Country, SUM(F.Units_Sold) FROM Fact_Sales F INNER JOIN Dim_Date D ON F.Date_Id = D.Id INNER JOIN Dim_Store S ON F.Store_Id = S.Id INNER JOIN Dim_Geography G ON S.Geography_Id = G.Id INNER JOIN Dim_Product P ON F.Product_Id = P.Id INNER JOIN Dim_Brand B ON P.Brand_Id = B.Id INNER JOIN Dim_Product_Category C ON P.Product_Category_Id = C.Id WHERE D.Year = 1997 AND C.Product_Category = 'tv' GROUP BY B.Brand, G.Country ``` ??? [Teacher Note]: This and the next slide recap on the two queries to position why we're able to design like we do today. You can position this like: "So we have our query from 1998, and we have our query from today" --- ### Today: denormalized model ```sql select brand, country, sum(units_sold) as total from analytics.fct_sales where date_trunc('year', sale_date) = '1997-01-01' and product_category = 'tv' group by 1, 2 ``` --- # What has changed? ??? [Teacher Note]: You can pose this question like: "What's changed? Why didn't we just denormalize in 1998 or use date_trunc instead of having dedicated date tables?" -- Some design decisions of the past were driven by technology: - Storage was expensive β don't duplicate data - Joins were cheap β join at report-time - SQL functions were expensive β don't `date_trunc`, create a table that has already done it. - Updating data was expensive β design models so that changes are isolated to one table - Data modeling was a different skill set to analysis β project plan the whole thing! ??? [Teacher Note]: Namely in the way that row-oriented and column-oriented databases store and process data. Because we use columnar-store for our analytics processing today, it allows us normalize or denormalize as we'd like, comfortably use SQL functions, and start closing the gap between modeling and analysis. --- # But what has stayed the same? -- - A "final" model should represent a _business_ process or entity. -- **Example 1:** Let's say you change payment providers β to your finance team, these are still a "payment". It might be useful to track whether it was processed via `stripe` or `braintree` (i.e. include a `payments_provider` column), but ultimately, the `payments` table should include both in a standardized form. --- # But what has stayed the same? - A "final" model should represent a _business_ process or entity. - Each model should have a clear "grain" β "one record represents a ..." -- **Example 2:** Let's say you're Rideshare Inc. Every time someone splits a trip, the `trips` table has three entries: one parent trip, and two child trips. | trip_id | parent_id | .. | |---------|-----------|----| | 1 | NULL | .. | | 2 | 1 | .. | | 3 | 1 | .. | | 4 | NULL | .. | How many trips occurred? ??? [Teacher Note]: We could say that: - two trips occured, we're only counting trips without a parent trip as a trip. - three trips occurred, because we're counting those without a parent trip as a trip, and the split trip as one trip. - or four trips occurred because we're counting each trip id -- Well what's a trip? ??? [Teacher Note]: In this example, we had to explain our trips by saying: "each record represents a trip, *but* ...". --- # But what has stayed the same? - A "final" model should represent a _business_ process or entity. - Each model should have a clear "grain" β "one record represents a ..." - Naming is important. In particular, consistency is _extremely_ valuable -- - Consider the final use case when designing your final artifacts ??? [Teacher Note]: We can do this by asking questions like: - Who will need to use our final artifacts? - Which questions do we need to answer related to these data sets? - How well will the tools consuming our data products handle how we propose to model it? --- ### OK, but _really_ what is the final artifact that I should build? Should I denormalize everything into one wide table? Should I use Kimball design patterns? -- The answer is: it depends! - What's your BI tool? Does it support joins in the BI layer (e.g. Looker), or does it prefer flat datasets (e.g. Tableau). - How big is your data? How does data get updated? e.g. do you have millions of orders that don't update after they are complete, but product names that update without warning? (Maybe join it in the BI tool) - What data models are your stakeholders used to? e.g. Do you work at an enterprise company that is used to working with DataVault models? -- And dbt lets you choose! It's all just SQL π --- ## So let's say you have an idea of what you want to build
Once you have an intuition of what you want to build, how do you do that in dbt? --- class: subtitle, center, middle # 2. The process is the product ### (and that's more important) ??? [Teacher Note]: This is where we'll talk about the process and how we'll create our artifacts in a scalable, reliable, and intuitive way. --- class: subtitle, middle _The natural state of the universe is chaos: entropy tends to increase in closed systems, and thereβs really nothing that we can do about that. So too is the nature of your dbt project: unless action is taken to maintain order._ --- # What we really mean to say: No matter what the final artifacts you're building are, be effortful in how you organize them. These artifacts show up in three places: 1. The dbt DAG 2. The repo 3. The warehouse These are how people at your organization will conceptualize, interact with, and benefit from dataβyour productβmore than any one table or dashboard. --- # Interface #1: The dbt DAG "Where are those sales numbers coming from?" → _Which data sources, processes, and assumptions do our sales calculations depend on?_ The DAG tells the story:
Source
→
Staging
→
Dimensions/Facts
??? [Teacher Note]: You can see here that our sales model combines order and payment information from jaffle_shop and stripe, cleans the data, and finally comes together in fct_sales. --- # Interface #1: The dbt DAG For your final artifacts, consider which sources will power this project:
(More on this later) ??? [Teacher Note]: To transition, you can word this slide like: "In order to plan out our flow of data, let's whiteboard our fct_sales example. We'll start by considering the the sources that power our end result. The data sources we need are orders, products, customers, and stores." --- # Interface #1: The dbt DAG Then add staging models: "The shape you wish your data came in"
??? [Teacher Note]: Further explanation on the whiteboarding: "Once we have our sources, we create our first layer of models which cleans and standardizes our raw data so that it's easier to work with downstream. We'll learn later on about what fills in our leftover question mark." --- # Interface #1: The dbt DAG ### Model prefixes .dense-text[ .left-column-33[ ### stg_ - 1:1 with raw data tables - **Source**-conformed - Renaming, type casting, coalescing + nullifβing ] .right-column-66[ .left-column[ ### dim_ - βNounsβ - Core **business** entities - Customers, products, apartments, providers, employees, ... - Some "slowly-changing" attributes (e.g. address, email) ] .right-column[ ### fct_ - βVerbsβ - Core **business** processes - Orders, cases, plays, listings, comments, ... - Often built on top of an immutable event stream ]]] ---
source
→
[base]
→
stg
→
[intermediate, lookup, utils]
→
{fct
+
dim}
??? [Teacher Note]: This explanation plays on the leftover question mark from whiteboarding earlier, which isn't covered explicitly in any other slide: "Here's an example of what a real DAG might end up looking like. You can see that we have: - Our sources (the green nodes) - Our staging models - Some intermediate steps happening: Remember our leftover question mark when we were whiteboarding our DAG? We can identify what processes are our "stepping stones" to our endresult and modularize those as intermediate models. You don't need these, but it does make the overall flow more understandable and allows you to reuse the outputs of those processes! - and our final dimensions and facts" [Demo:] Draw circles/boxes around the different "model groups" (sources, staging, intermediate, lookups, final fact/dim). Comment on how they can be grouped together in all three interfaces. --- # Interface #1: The dbt DAG Make sure that for each model, you ask: - What does one record represent? Let's document this! - What assumptions about our data are we making? We should test those! - Am I using naming patterns? You should! ??? [Teacher Note]: You can position this by saying something like: "Some good practices to go along with building your DAG are..." --- # Interface #2: The repo .left-column[ - **Within files:** adhere to a style guide - **Between files:** better file organization = easier onboarding + better collaboration ] .right-column[ .dense-text[ ``` βββ README.md βββ dbt_project.yml βββ models βββ marts βββ core βββ core.yml βββ dim_product.sql βββ dim_store.sql βββ fact_sales.sql βββ staging βββ postgres βββ shopify βββ shopify.yml βββ stg_shopify_customers.sql βββ stg_shopify_order_items.sql βββ stg_shopify_orders.sql βββ stg_shopify_products.sql ``` ] ] ??? [Teacher Note]: Our developers are doing most of their work within the files of the repository. We can keep this organized by separating our modeling layers in to dedicated folders and having naming conventions for our files. Point out: - marts / staging separation - subfolders within each - configuration files that describe the contents of the folder they're in - naming conventions of the models --- # Interface #3: The Warehouse .left-column-66[ - **Nomenclature.** How do we name our views + tables so they are consistent, easy to find, and easy to infer contents? - How do we organize our output objects? Same schema or different? - What about the intermediate stuff we _donβt_ want to show? - Who is your end user? Is it a person writing SQL, or a BI tool pointed to a production schema? ] .right-column-33[
] ??? [Teacher Note]: Point out how: - naming has helped organize the models into easy to find groups - the naming helps infer the contents of the objects - that depending on the usage of these objects, we could further customize by configuring them to build into schemas by use case and setting permissions on those schemas based on who or what should use the objects within them. --- class: subtitle, center, middle # The way we roll ---
.center[[Read it!](https://discourse.getdbt.com/t/how-we-structure-our-dbt-projects/355)] ??? [Teacher Note]: You can position this like: "Here's an article we wrote on how we organize and structure our dbt projects. It's worth a read if you want to get some ideas for creating your own conventions!" --- class: subtitle # Checkpoint: - What are some helpful actions to do in the staging layer? Why? -- ## Questions? --- class: subtitle, center, middle # dbt + BI ## What BI tools do you use? ??? [Teacher Note]: Now we're going to talk about dbt artifacts and business intelligence tools. Our end products typically get consumed by these, so it's important to understand the practices surrounding this part of the stack. --- class: center # What BI tools do you use?
??? [Teacher Note]: There's a lot of BI tools at our disposal - tools such as Sisense, Metabase, Redash, Mode, Looker, Tableau, R, Chartio, etc. *Interactivity*: have the class shout out which of these tools they've used in the slack channel! --- # Which part of this should be in the dbt versus the BI tool?
??? [Teacher Note]: Throw back to our question from earlier --- # Which part of this should be in the dbt versus the BI tool? For most orgs, the final rollup query should be in BI tool β allows for "drill down" functionality / further exploration. -- What if? - This is a KPI your company reports on regularly? -- Maybe you want it in a dbt model as well. ??? [Teacher Note]: There's nothing wrong with moving important calculations in to dbt, but you might want to create a prefix for them to inform others that they are metrics aggregated from the major ideas. -- - You're building a prototype model or field? -- Potentially build it in the BI tool first then "promote" it. --- # What should you use your BI tool for? ??? [Teacher Note]: In order to put more emphasis on why this is important, you can position this like: "Queries and transformations can move around between dbt and the BI tool, so how do we try to stay effortful in what we decide to put where?" --- ## The things they've always been good at! .left-column-33[ - User Interface - Point-and-click - Self-service filters, drill-downs - Accessible to users who may be less technical ] .right-column-66[ .left-column[ - Viz! - Out-of-the-box - Color palettes ("my brand!") - Frameworks/libraries ] .right-column[ - Nice-to-haves - Dynamic user input - Scheduling - Alerting - PDFβing - CSVβing ] ] ??? [Teacher Note]: dynamic user input: allowing your users to dynamically change the inputs of a query. The BI tool would need to have the capabilities for this, but it allows you to set up your own "choices" that the user can choose, and allows you to template those choices in the query. For example, a table joins a customers table to , and the user chooses what that is. --- ## The things they're less good at - Multilayered data transformation with complex ordered dependencies - Enforcing a single sources of truth - Defining, documenting, centralizing business logic - Maintaining separate data environments - Testing - Version control - Running a lot non-performant SQL concurrently - Managing user access to sensitive data --- # What goes where? .left-column[ ## dbt layer - Critical business logic - Best agreed-on version of a model - Complex SQL - No wasted code! Everything is multipurpose* - Anything worth version control ] .right-column[ ## BI layer - Aggregations, calculated on the fly - Joins qualified by user input - Ad hoc queries and proto-models - Select-star SQL that feeds R, Python, JS (custom viz) - Nothing you can't take with you ] ??? [Teacher Note:] You can think of dbt as our assembly line which produces expected outputs every time. You can think of the BI layer as the place where we take the items produced from our assembly line to customize them in order to meet our stakeholder's needs. --- class: subtitle # Checkpoint: - What are the three interfaces into your dbt project? ??? [Teacher Note]: DAG, repo, and warehouse -- - What are some key distinctions between the dbt layer and BI layer that you want to keep in mind? ??? [Teacher Note]: - dbt should have our solidly defined business logic and metrics - The BI later should be kept thin and flexible -- - What naming conventions is your team already using? ??? [Teacher Note:] Use your intuition on the interactivity of the group. If they're less outspoken, ask them to shout it out in the Slack channel. Otherwise, you can pause for a moment here to allow others to say what conventions they use! -- ## Questions? --- class: subtitle # Knowledge check When creating a dbt project: - First decide on the final artifact, and then sketch out the process for modeling it. - Use the three interfaces to organize your models. - Stage your raw source data before transforming it. - Every model should have a primary key. - Decide as an organization how to keep your BI layer thin and flexible. --- class: subtitle # Zoom Out
Distributed dbt Learn norms
Who is an analytics engineer?
dbt Fundamentals
Working Group 1
dbt Project Design
--
Working Group 2
Testing
Sources
Docs
Deployment
Working Group 3
--- --- class: subtitle, middle, center # Appendix: Words to the wise ### from one data modeler to another ??? This is a very special section, for anyone in the room who feels like the presentation thus far has been too _descriptive_ ("what's hard about modeling?") and not _prescriptive_ enough ("ok great, now _what should I do?_") --- 1. **Nomenclature** is _hard_, and it's worth your time to be consistent. It doesn't matter if you prefer `fct_` or `fact_`, it's important that you choose one. (What's not hard? Every model/object name is snake case + case insensitive.) --- 1. .grey[**Nomenclature.**] 2. **Public vs. Private.** Prefixed models with single underscores represent external/usable assets. Double underscores represent internal/intermediate assets. --- 1. .grey[**Nomenclature.**] 2. .grey[**Public vs. Private.**] 3. **Modular model files:** Individual models should max out at a half-dozen CTEs / few hundred lines of SQL. The CTEs should be named such that they narrate the model's transformation. If one model file is getting too long, split it up into multiple ephemeral models. --- 1. .grey[**Nomenclature.**] 2. .grey[**Public vs. Private.**] 3. .grey[**Modular model files.**] 4. **Macros in moderation:** Model code should seek to be _both_ [DRY (Don't Repeat Yourself)](https://docs.getdbt.com/terms/dry) _and_ easily readable. Use Jinja + macros strongly and sparingly. --- 1. .grey[**Nomenclature.**] 2. .grey[**Public vs. Private.**] 3. .grey[**Modular model files.**] 4. .grey[**Macros in moderation.**] 5. **Reusable assets:** Final models (facts + dimensions) should be multipotent, such that each can power an entire genre of potential queries. Your end user or BI tool should be able to answer 80% of its most common questions by querying **one (1)** table. Of the remaining 20%, no regularly running query should require more than two join "steps." As soon as an ad hoc query is run often, it should be refactored and modeled for accordingly. --- 1. .grey[**Nomenclature.**] 2. .grey[**Public vs. Private.**] 3. .grey[**Modular model files.**] 4. .grey[**Macros in moderation.**] 5. .grey[**Reusable assets.**] 6. **Documentation, documentation, documentation:** Your data models are your product; the BI/query layer is your user experience. Having an untested, undocumented model is worse than having no model at all.
_Corollary:_ Uncommented, unvetted, unreadable SQLβhowever clever it may beβis _much_ worse than having no code at all. --- 1. .grey[**Nomenclature.**] 2. .grey[**Public vs. Private.**] 3. .grey[**Modular model files.**] 4. .grey[**Macros in moderation.**] 5. .grey[**Reusable assets.**] 6. .grey[**Documentation, documentation, documentation.**] 7. **Be your own best advocate:** Your dimensional modeling is your colleague's feature engineering is your stakeholder's reconciliation of business definitions. Analysts, BI users, business partners, and data scientists all can and should be beneficiaries of the foundational work you're doing.
_Corollary:_ If they expend effort defining the same concepts on their own, it's not only repeated work, but fertile ground for entropy at your organization. ??? I think about join "steps" as degrees of Kevin Bacon --- 1. .grey[**Nomenclature.**] 2. .grey[**Public vs. Private.**] 3. .grey[**Modular model files.**] 4. .grey[**Macros in moderation.**] 5. .grey[**Reusable assets.**] 6. .grey[**Documentation, documentation, documentation.**] 6. .grey[**Be your own best advocate.**] # YMMV! Take it in good health, and with many grains of salt.