class: title, center, middle # Setting the scene ??? [Instructor note] * Have the training dbt project and training Snowflake account open in another tab --- # Scenario I'm an analyst working at "The Jaffle Shop" where we use our own app for order management: .center[
] ??? [Instructor note] Say: We'll be working with sample data from the Jaffle Shop. Any guesses on what a jaffle is? -- ### My goal: Create a BI visual representing total orders per customer -- ### Why and how should I transform my data? ??? [Instructor note] Hopefully answers like: - clean up the data - create new fields - build trust/test it - stop repeated work --- # My project * Transform to: * Create clarity (e.g. what is `id`?) * Build new fields (e.g. create `number_of_orders` by customer) * Create new table named `dim_customers` for transformed data * Connect BI tool to the transformed data --- # How to do this manually Run a `create table as` statement in my data platform! ```sql create table dbtlabs_jdoe.dim_customers as ( ... ) ``` -- What if... .dense-text[ * schema doesn't exist yet? * schema does exist, and I need to recreate it? * table already exists? * I want to parameterize the schema? * this depends on another transformation? ] --- # So, I have raw data. .grey[_What do I do with it?_] ??? [Instructor Note]: You can read these out or pose these questions to the class - read the audience. If they don't seem interactive but you want some interactivity, you can also pose these questions to fellow Instructors. This flows best by refering to what happened in previous sessions before asking the questions on the slides: "In previous sessions we saw that we have raw data in the data platform." -- Model it! -- .grey[_How?_] -- With dbt! -- .grey[_Cool._] -- ??? [Instructor Note]: We need to know what we mean by "model it". It's a broad term! --- class: title, center, middle # Data Modeling with dbt ---
Data Modeling | Focus
Confirm dbt Cloud account, data platform connection, and project initialization
Explain the connections between company goals/questions and transforming data
Explain how dbt transforms data
Explain models, materializations, and dependencies in the context of a dbt project
Build intuition for naming conventions for database objects
--- # To use dbt you need: -- 1. A data platform ??? [Instructor note] * We're using Snowflake, your org might be using BigQuery, Redshift, Databricks, or Spark * Flip to Snowflake console and show the data -- 2. A dbt project -- * created with the `initialize` button * a directory with a `dbt_project.yml` file + other `.sql` and `.yml` files .center[
] ??? [Instructor note] * go to dbt cloud and show the file explorer --- # What is `.yml`? - a human-friendly way to write dictionaries and arrays - used for configuration files (e.g. tests, documentation, etc.) - prevalent throughout dbt projects - whitespace management can be difficult - copy + paste is encouraged ??? [Instructor note] * `version:2` is required at the top of every yml file to indicate the version of yml we're using. * If you're using dbt version 1.5 or higher, you do not need version:2 at the top --- # To use dbt you need: 1. A data platform 2. A dbt project * created with the `initialize` button * a directory with a `dbt_project.yml` file + other `.sql` and `.yml` files * a project contains **models** --- ## What are models? ??? [Instructor note] * See if anyone wants to explain what a model is -- * Models are `select` statements that transform data. * Stored in the `models` directory, and have a `.sql` extension * There is a 1:1 relationship with a model and a data artifiact .center[
] --
demo
??? Demo Time: - Jump over into dbt Cloud _Group Training project: https://cloud.getdbt.com/#/accounts/26712/projects/43345/develop/ - Look at the model `customers.sql` - Show the SQL select statement. - Hit 'Preview' to show that it outputs data --- # To use dbt you need: 1. .grey[A data platform] 2. .grey[A dbt project] 3. A connection to your data platform -- * contains credentials to connect to your data platform and a **target schema** ??? [Instructor note] * Explain why this lives outside your project: * creds are sensitive and shouldn't be version controlled * target schemas should be different for each user, so shouldn't be hardcoded in a project * Explain what a target schema is * target_schema is specific to the environment. In dev, each user has their own dev schema and in prod, there is a dedicated prod schema * For CLI users — they might want more info on targets/profiles * CLI users will need a profile matching the 'profile:' in the 'dbt_project.yml' file --- # To use dbt you need: 1. .grey[A data platform] 2. .grey[A dbt project] 3. .grey[A connection to your data platform] 4. A command -- * an instruction, issued from the dbt IDE * e.g. `dbt run` ??? [Instructor note] * Run `dbt run`! * Explain what dbt is doing (wraps the select statement in DDL and runs it) * Show where to see the logs Good knowledge check questions: * How did dbt connect to the warehouse? A: connection / profile * How did dbt know what schema to use? A: the target schema * What happens if you rerun dbt? A: no downtime (this looks different on each warehouse) --- # To use dbt you need: 1. A data platform 2. A dbt project 3. A connection to your data platform 4. A command --- # What happens when you run the command: `dbt run`? 1. dbt connects to your **data platform** (via a **profile**/**connection**) 2. dbt parses your **dbt project** 3. dbt wraps your **models** in the appropriate DDL/DML (e.g. `create table as`) 4. dbt executes this code to build your models in your **target schema** ??? [Instructor note] - Behind the scenes, tons more happening: checking if the schema exists etc. --- .center[
] --- class: subtitle # Checkpoint: - What are the four pieces you need to use dbt? ??? [Instructor note] 1. A data platform 2. A dbt project 3. A connection to your data platform 4. A command -- ## Good so far?? - Questions?? --- class: subtitle #Hands-On (5 min)
Find:
`customers.sql` file in your project
Build:
Use the command that builds your models to your data platform
Find:
Find the `customers` view that was built in Snowflake
??? [Instructor note] - If some learners didn't finish the set-up, now is the chance to finish that. - Limit to 5 minutes. Max 10 --- ### You might be wondering: -- **How can I see what's happening?** -- * Check out the `target/` directory and the `logs/` file * **dbt Cloud:** Select the "Details" view within a run -- **How does dbt know the right SQL to use?** (`create table as` vs. `create view as`) -- * The way that dbt builds a model depends on the **materialization** you **configure** for a model. -- **What's a materialization? And a configuration?** --- # Materializations -- * `materializations` are build strategies * i.e. the SQL that your `select` statement gets wrapped in * Models can be materialized as `views` and `tables` (we'll learn more later!) * You need to **configure** a model's materialization * If you weren't using dbt, this would look like writing a lot of DDL in the Snowflake console ??? [Instructor note] - There are 4 different materializations in dbt - view (default) - table (transient default for snowflake) - ephemeral (like a view, but doesn't exist in database) - incremental (append/merge/delete+insert records instead of recreating table) --- # Configurations -- * `configurations` are model settings * They can be set in your `dbt_project.yml` file, _and_ in your model file. * Configurations are applied _hierarchically_ * Example configurations: * `materialized: view` * `tags: ['nightly', 'hourly']` * `enabled: True` --
demo
??? [Instructor note] Demo Time: - "Let's see configurations in action!" - show dbt_project.yml materialization - run `dbt run` - show staging model DDL stating view Now let's do an example that highlights hierarchical nature, showing that more specific configurations take precedence - add a config blocks in a file with a different materialization (e.g. staging file materialized as a table) - run `dbt run` - show staging model DDL stating table --- class: subtitle #Hands-On (5 min)
Materialize:
Materialize all models as `table` in one file
Build:
Run the command that builds your models
Check:
Verify that all models materialized correctly
Find:
Find the `customers` table that was built in Snowflake
--- # What happens when you run the command: `dbt run`? 1. dbt connects to your **data platform** (via a **profile**/**connection**) 2. dbt parses your **dbt project** 3. dbt wraps your **models** in the based on the the **materialization** you **configure** for a model (e.g. `create table as`) 4. dbt executes this SQL to build your models in your **target schema** -- ### You might be wondering: If I have multiple models, how does dbt know which order to run models in? -- * Use the `ref` function to create dependencies --- # Creating dependencies * Use the `ref` function to do build dependencies between models * `{{ ref('model_name') }}` * The `ref` function _also_ compiles to the full table/view name, using the correct schema for your environment * These dependencies form a **Directed Acyclic Graph (DAG)** * dbt uses this DAG to understand the order to run models in * This makes it easy to split long queries up into separate models .center[
] ??? [Instructor note] * Don't demo ref yet * We'll come back to it after we create a staging model --- # DAGs ## A real life example: .center[
] --- # The dbt DAG "Where are those sales numbers coming from?" → _Which data sources, processes, and assumptions do our sales calculations depend on?_ The DAG tells the story:
Source
→
Staging
→
Dimensions/Facts
??? [Instructor Note]: You can see here that our sales model combines order and payment information from jaffle_shop and stripe, cleans the data, and finally comes together in fct_sales. --- # The dbt DAG For your final artifacts, consider which sources will power this project:
(More on this later) ??? [Instructor Note]: To transition, you can word this slide like: "In order to plan out our flow of data, let's whiteboard our fct_sales example. We'll start by considering the the sources that power our end result. The data sources we need are orders, products, customers, and stores." --- # The dbt DAG Then add staging models: "The shape you wish your data came in"
??? [Instructor Note]: "Once we have our sources, we create our first layer of models which cleans and standardizes our raw data so that it's easier to work with downstream." --- # Interface #1: The dbt DAG ### Model prefixes .dense-text[ .left-column-33[ ### stg_ - 1:1 with raw data tables - **Source**-conformed - Renaming, type casting, coalescing + nullif’ing ] .right-column-66[ .left-column[ ### dim_ - “Nouns” - Core **business** entities - Customers, products, apartments, providers, employees, ... - Some "slowly-changing" attributes (e.g. address, email) ] .right-column[ ### fct_ - “Verbs” - Core **business** processes - Orders, cases, plays, listings, comments, ... - Often built on top of an immutable event stream ]]] ---
source
→
[base]
→
stg
→
[intermediate, lookup, utils]
→
{fct
+
dim}
??? [Instructor Note]: "Here's an example of what a real DAG might end up looking like. You can see that we have: - Our sources (the green nodes) - Our staging models - Some intermediate steps happening: Remember our leftover question mark when we were whiteboarding our DAG? We can identify what processes are our "stepping stones" to our endresult and modularize those as intermediate models. You don't need these, but it does make the overall flow more understandable and allows you to reuse the outputs of those processes! - and our final dimensions and facts" --
demo
??? "" Demo: * Break first customers' CTE into stg_jaffle_shop__customers * Demo the ref function, how it builds the DAG, and compile to show target schema * Explain the naming convention: stg_source_name__table_name "" --- # Folder structure .left-column[ - **Files:** follow naming convention - **Within files:** adhere to a sql style guide - **Between files:** better file organization = easier onboarding + better collaboration ] .right-column[ .dense-text[ ``` ├── README.md ├── dbt_project.yml └── models ├── marts └── core ├── core.yml └── dim_product.sql ├── dim_store.sql └── fact_sales.sql └── staging └── postgres └── shopify ├── shopify.yml ├── stg_shopify_customers.sql ├── stg_shopify_order_items.sql ├── stg_shopify_orders.sql └── stg_shopify_products.sql ``` ] ] ??? [Instructor Note]: We can keep organized by separating our modeling layers in to dedicated folders and having naming conventions for our files. Point out: - marts / staging separation - subfolders within each - configuration files that describe the contents of the folder they're in - naming conventions of the models --- class: subtitle #Hands-On (10 min)
Organize:
Create `staging` subfolder in `models` directory
Create `marts` subfolder in the `models` directory
Rename `customers.sql` to `dim_customers.sql`
Move `dim_customers.sql` to the `marts` folder
Refactor:
Build a staging model for each CTE in `dim_customers.sql` (2)
Materialize:
`stg_` models as views
Replace:
CTEs in `dim_customers.sql` with references to `stg_` models
--- class: subtitle ## Checkpoint: .center[
] - What models does `customers_with_attribution` depend on? ??? +sessions & raw_customers -- - What models depend on `customers_with_attribution`? ??? customer_acquisition_by_channel & revenue_by_channel -- - Questions?? --- # Knowledge check You should be comfortable with these words: .left-column[ * dbt project * models * target schema * Directed Acyclic Graphs (DAGs) ] .right-column[ * configurations (configs) * materializations * references (refs) ] --- # Knowledge check You should be able to: * Create new models (that depend on other models) * Run your models (all of them / one model at a time) * Check the SQL that's running * Configure your models (from the `dbt_project.yml` file / within a model) 🙋♀️ Please let us know if you want to go over any of these ??? Instructor notes: * Use this as a checklist that you've covered everything off * Emphasize that we're going to launch off from here — use the lunch time to make sure that everyone is following so far. * For those who do understand, there's a working session --- # Resources * [Configs and Properties](https://docs.getdbt.com/reference/configs-and-properties) * [dbt Command Reference](https://docs.getdbt.com/reference/dbt-commands) * [dbt Selection Syntax](https://docs.getdbt.com/reference/node-selection/syntax) * [dbt Ref macro](https://docs.getdbt.com/reference/dbt-jinja-functions/ref) --- class: subtitle #Zoom Out
dbt Learn Norms
Why dbt?
Modeling
--
Sources
--
Testing
Docs
Deployment