class: title, center, middle # Setting the scene ??? Note to teacher: - This presentation should be accompanied by re-doing the tutorial live. - It comes after a number of lessons that are one-sided - Ask the audience some questions / encourage participation. There will be some students who know all this already, so give them the opportunity to participate by helping answer questions. - If people ask questions that are related to future lessons, jot them down on a shared place so you can cross them off throughout the rest of the course. --- # Scenario I'm an analyst working at "The Jaffle Shop" where we use our own app for order management: .center[
] -- ### My goal: Create a BI visual representing total orders per customer -- ### Why and how should I transform my data? ??? Hopefully answers like: - clean up the data - create new fields - build trust/test it - stop repeated work --- # My project * Transform to: * Create clarity (e.g. what is `id`?) * Build new fields (e.g. create `number_of_orders` by customer) * Create new table named `dim_customers` for transformed data * Connect BI tool to the transformed data --- # How to do this manually Run a `create table as` statement in my data platform! ```sql create table dbtlabs_jdoe.dim_customers as ( ... ) ``` -- What if... .dense-text[ * schema doesn't exist yet? * schema does exist, and I need to recreate it? * table already exists? * I want to parameterize the schema? * this depends on another transformation? ] ??? --- class: title, center, middle # Data Modeling with dbt ---
Data Modeling | Focus
Explain the connections between company goals/questions and transforming data
Explain how dbt transforms data
Explain models, materializations, and dependencies in the context of a dbt project
Confirm dbt Cloud account, data platform connection, and project initialization
--- class: subtitle #Hands-On (10 min)
Access:
Ensure you have can log on to dbt Cloud (check e-mail for invite)
Ensure you can log on to Snowflake
Snowflake Username & Password are in your e-mail
Add your Snowflake credentials to your Coalesce 2022 Training - Fundamentals project
Create:
Make a working branch in the IDE with a unique name (e.g. dbtlabs_bhipple)
--- # To use dbt you need: ??? To use dbt, you need four things: -- 1. A data platform ??? * We're using Snowflake, your org might be using BigQuery / Redshift (or heaven forbid, spark) * Flip to Snowflake console and show the data -- 2. A dbt project -- * created with the `initialize` button * a directory with a `dbt_project.yml` file + other `.sql` and `.yml` files .center[
] ??? Create the project again/show off the models Quick side step into models though... If using the CLI use `init` command to create the project --- # What is `.yml`? - a human-friendly way to write dictionaries and arrays - used for configuration files (e.g. tests, documentation, etc.) - prevalent throughout dbt projects - whitespace management can be difficult - copy + paste is encouraged --- # To use dbt you need: 1. A data platform 2. A dbt project * created with the `initialize` button * a directory with a `dbt_project.yml` file + other `.sql` and `.yml` files * a project contains **models** --- ## What are models? ??? See if anyone wants to have a go at explaining a model -- * Models are `select` statements that transform data. * Stored in the `models` directory, and have a `.sql` extension .center[
] --
demo
??? Demo Time: - Jump over into dbt Cloud _Group Training project: https://cloud.getdbt.com/#/accounts/26712/projects/43345/develop/ - Look at the model `customers.sql` - Show the SQL select statement. - Talk about SQL & Jinja working together - Hit `Compile` to show Jinja compiles to SQL - Hit 'Preview' to show that it outputs data --- # To use dbt you need: 1. .grey[A data platform] 2. .grey[A dbt project] 3. A connection to your data platform -- * contains credentials to connect to your data platform and a **target schema** ??? * Explain why this lives outside your project: * creds are sensitive and shouldn't be version controlled * target schemas should be different for each user, so shouldn't be hardcoded in a project * Explain what a target schema is * For CLI users — they might want more info on targets/profiles * CLI users will need a profile matching the 'profile:' in the 'dbt_project.yml' file --- # To use dbt you need: 1. .grey[A data platform] 2. .grey[A dbt project] 3. .grey[A connection to your data platform] 4. A command -- * an instruction, issued from the dbt IDE * e.g. `dbt run` ??? * Run `dbt run`! * Explain what dbt is doing (wraps the select statement in DDL and runs it) * Show where to see the logs Good knowledge check questions: * How did dbt connect to the warehouse? A: connection / profile * How did dbt know what schema to use? A: the target schema * What happens if you rerun dbt? A: no downtime (this looks different on each warehouse) --- # To use dbt you need: 1. A data platform 2. A dbt project 3. A connection to your data platform 4. A command --- # What happens when you run the command: `dbt run`? 1. dbt connects to your **data platform** (via a **profile**/**connection**) 2. dbt parses your **dbt project** 3. dbt wraps your **models** in the appropriate DDL/DML (e.g. `create table as`) 4. dbt executes this code to build your models in your **target schema** ??? - Behind the scenes, tons more happening: checking if the schema exists etc. --- .center[
] --- class: subtitle # Checkpoint: - What are the four pieces you need to use dbt? ??? 1. A data platform 2. A dbt project 3. A connection to your data platform 4. A command -- ## Good so far?? - Questions?? --- class: subtitle #Hands-On (5 min)
Find:
`customers.sql` file in your project
Build:
Use the command that builds your models to your data platform
Find:
Find the `customers` view that was built in Snowflake
--- ### You might be wondering: -- **How can I see what's happening?** -- * Check out the `target/` directory and the `logs/` file * **dbt Cloud:** Select the "Details" view within a run -- **How does dbt know the right SQL to use?** (`create table as` vs. `create view as`) -- * The way that dbt builds a model depends on the **materialization** you **configure** for a model. -- **What's a materialization? And a configuration?** --- # Materializations -- * `materializations` are build strategies * i.e. the SQL that your `select` statement gets wrapped in * Models can be materialized as `views` and `tables` (we'll learn more later!) * You need to **configure** a model's materialization * If you weren't using dbt, this would look like writing a lot of DDL in the Snowflake console --- # Configurations -- * `configurations` are model settings * They can be set in your `dbt_project.yml` file, _and_ in your model file. * Configurations are applied _hierarchically_ * Example configurations: * `materialized: view` * `tags: ['nightly', 'hourly']` * `enabled: True` --
demo
??? Demo Time: - "Let's see configurations in action!" - show dbt_project.yml materialization - run `dbt run` - show staging model DDL stating view Now let's do an example that highlights hierarchical nature, showing that more specific configurations take precedence - add a config blocks in a file with a different materialization (e.g. staging file materialized as a table) - run `dbt run` - show staging model DDL stating table --- class: subtitle #Hands-On (3 min)
Materialize:
Materialize all models as `table` in one file
Build:
Run the command that builds your models
Check:
Verify that all models materialized correctly
Find:
Find the `customers` table that was built in Snowflake
--- # What happens when you run the command: `dbt run`? 1. dbt connects to your **data platform** (via a **profile**/**connection**) 2. dbt parses your **dbt project** 3. dbt wraps your **models** in the based on the the **materialization** you **configure** for a model (e.g. `create table as`) 4. dbt executes this SQL to build your models in your **target schema** -- ### You might be wondering: If I have multiple models, how does dbt know which order to run models in? -- * Use the `ref` function to create dependencies --- # Creating dependencies * Use the `ref` function to do build dependencies between models * `{{ ref('model_name') }}` * The `ref` function _also_ compiles to the full table/view name, using the correct schema for your environment * These dependencies form a **Directed Acyclic Graph (DAG)** * dbt uses this DAG to understand the order to run models in * This makes it easy to split long queries up into separate models .center[
] --- # DAGs ## A real life example: .center[
] --- background-color: #004C61 class:subtitle ## A slightly more complex DAG .center[
] --- background-color: #004C61 class:subtitle ## Unlocked DAG Potential .center[
] --- class: subtitle #Hands-On (10 min)
Organize:
Create `staging` subfolder in `models` directory
Create `marts` subfolder in the `models` directory
Rename `customers.sql` to `dim_customers.sql`
Move `dim_customers.sql` to the `marts` folder
Refactor:
Build a staging model for each CTE in `dim_customers.sql` (2)
Materialize:
`stg_` models as views
Replace:
CTEs in `dim_customers.sql` with references to `stg_` models
--- class: subtitle ## Checkpoint: .center[
] - What models does `customers_with_attribution` depend on? ??? +sessions & raw_customers -- - What models depend on `customers_with_attribution`? ??? customer_acquisition_by_channel & revenue_by_channel -- - Questions?? --- # Knowledge check You should be comfortable with these words: .left-column[ * dbt project * models * target schema * Directed Acyclic Graphs (DAGs) ] .right-column[ * configurations (configs) * materializations * references (refs) ] --- # Knowledge check You should be able to: * Create new models (that depend on other models) * Run your models (all of them / one model at a time) * Check the SQL that's running * Configure your models (from the `dbt_project.yml` file / within a model) 🙋♀️ Please let us know if you want to go over any of these ??? Teacher notes: * Use this as a checklist that you've covered everything off * Emphasize that we're going to launch off from here — use the lunch time to make sure that everyone is following so far. * For those who do understand, there's a working session --- # Resources * [Configs and Properties](https://docs.getdbt.com/reference/configs-and-properties) * [dbt Command Reference](https://docs.getdbt.com/reference/dbt-commands) * [dbt Selection Syntax](https://docs.getdbt.com/reference/node-selection/syntax) * [dbt Ref macro](https://docs.getdbt.com/reference/dbt-jinja-functions/ref) --- class: subtitle #Zoom Out
dbt Learn Norms
Why dbt?
Modeling
--
Project Design
--
Testing
Sources
Docs
Deployment