class: title, center, middle # Setting the scene ??? Note to teacher: - This presentation should be accompanied by re-doing the tutorial live. - It comes after a number of lessons that are one-sided - Ask the audience some questions / encourage participation. There will be some students who know all this already, so give them the opportunity to participate by helping answer questions. - If people ask questions that are related to future lessons, jot them down on a shared place so you can cross them off throughout the rest of the course. --- # Scenario I'm an analyst working at "The Jaffle Shop". The Jaffle Shop uses our own app for order management: .center[
] I have a copy of this data replicated into my warehouse: ```sql select * from raw.jaffle_shop.customers ``` -- ### Why should I transform my data? ??? Hopefully answers like: - clean up the data - create new fields - build trust/test it - stop repeated work --- # My project * I'm going to transform my data to: * Clean up some dirtiness in my data (e.g. `user_id`) * Build some useful fields about my customers (`number_of_orders` etc.) * This transformed data is going to live in a table named `fct_customers` * And then connect my BI tool to the transformed data --- # How to do this manually Run a `create table as` statement! ```sql create table dev_claire.fct_customers as ( ... ) ``` -- Uh oh... .dense-text[ * What if the schema doesn't exist yet? * What if the schema does exist, and I try to recreate it? * What if this table already exists? * What if I want to parameterize the schema? (And I should!) * What if this depends on another transformation? ] ??? --- class: title, center, middle # dbt Fundamentals ---
dbt Fundamentals | Focus
Review the basics from the Getting Started tutorial
Explain models, materializations, and dependencies in the context of a dbt project
--- # To use dbt you need: ??? To use dbt, you need four things: -- 1. A data warehouse ??? * We're using Snowflake, your org might be using BigQuery / Redshift (or heaven forbid, spark) * Flip to Snowflake console and show the data -- * that can run SQL * with data already loaded in to it .center[
] ??? Questions to prompt: - How does data get into the warehouse? --- # To use dbt you need: 1. .grey[A data warehouse] 2. A dbt project -- * created with the `initialize` button * a directory with a `dbt_project.yml` file + other `.sql` and `.yml` files .center[
] ??? Create the project again/show off the models Quick side step into models though... If using the CLI use `init` command to create the project --- # What is a `.yml` file? YAML is a (relatively) human-friendly way to write dictionaries and arrays. It's often used for configuration files (🙋♀️). We use YAML a lot — it's worth making sure you have a good handle on it (esp. the importance of indentation!) --- # To use dbt you need: 1. A data warehouse 2. A dbt project * created with the `initialize` button * a directory with a `dbt_project.yml` file + other `.sql` and `.yml` files * a project contains **models** --- ## What are models? ??? See if anyone wants to have a go at explaining a model -- * Models are `select` statements that transform data. * Stored in the `models` directory, and have a `.sql` extension .center[
] --
demo
??? Demo Time: - Jump over into dbt Cloud _Group Training project: https://cloud.getdbt.com/#/accounts/26712/projects/43345/develop/ - Look at the model `customers.sql` - Show the SQL select statement. - Talk about SQL & Jinja working together - Hit `Compile` to show Jinja compiles to SQL - Hit 'Preview' to show that it outputs data --- # To use dbt you need: 1. .grey[A data warehouse] 2. .grey[A dbt project] 3. A connection to your warehouse -- * contains credentials to connect to your warehouse and a **target schema** ??? * Explain why this lives outside your project: * creds are sensitive and shouldn't be version controlled * target schemas should be different for each user, so shouldn't be hardcoded in a project * Explain what a target schema is * For CLI users — they might want more info on targets/profiles * CLI users will need a profile matching the 'profile:' in the 'dbt_project.yml' file --- # To use dbt you need: 1. .grey[A data warehouse] 2. .grey[A dbt project] 3. .grey[A connection to your warehouse] 4. A command -- * an instruction, issued from the dbt IDE * e.g. `dbt run` ??? * Run `dbt run`! * Explain what dbt is doing (wraps the select statement in DDL and runs it) * Show where to see the logs Good knowledge check questions: * How did dbt connect to the warehouse? A: connection / profile * How did dbt know what schema to use? A: the target schema * What happens if you rerun dbt? A: no downtime (this looks different on each warehouse) --- # To use dbt you need: 1. A data warehouse 2. A dbt project 3. A connection to your warehouse 4. A command --- # What happens when you run the command: `dbt run`? 1. dbt connects to your **data warehouse** (via a **profile**/**connection**) 2. dbt parses your **dbt project** 3. dbt wraps your **models** in the appropriate DDL/DML (e.g. `create table as`) 4. dbt executes this code to build your models in your **target schema** ??? - Behind the scenes, tons more happening: checking if the schema exists etc. --- .center[
] --- class: subtitle # Checkpoint: - What are the four pieces you need to use dbt? ??? 1. A data warehouse 2. A dbt project 3. A connection to your warehouse 4. A command -- ## Good so far?? - Questions?? --- ### You might be wondering: -- **How can I see what's happening?** -- * Check out the `target/` directory and the `logs/` file * **dbt Cloud:** Select the "Details" view within a run -- **How does dbt know the right SQL to use?** (`create table as` vs. `create view as`) -- * The way that dbt builds a model depends on the **materialization** you **configure** for a model. -- **What's a materialization? And a configuration?** --- # Materializations -- * `materializations` are build strategies * i.e. the SQL that your `select` statement gets wrapped in * Models can be materialized as `views` and `tables` (we'll learn more later!) * You need to **configure** a model's materialization * If you weren't using dbt, this would look like writing a lot of DDL in the Snowflake console --- # Configurations -- * `configurations` are model settings * They can be set in your `dbt_project.yml` file, _and_ in your model file. * Configurations are applied _hierarchically_ * Example configurations: * `materialized: view` * `tags: ['nightly', 'hourly']` * `enabled: True` --
demo
??? Demo Time: - "Let's see configurations in action!" - show dbt_project.yml materialization - run `dbt run` - show staging model DDL stating view Now let's do an example that highlights hierarchical nature, showing that more specific configurations take precedence - add a config blocks in a file with a different materialization (e.g. staging file materialized as a table) - run `dbt run` - show staging model DDL stating table --- # What happens when you run the command: `dbt run`? 1. dbt connects to your **data warehouse** (via a **profile**/**connection**) 2. dbt parses your **dbt project** 3. dbt wraps your **models** in the based on the the **materialization** you **configure** for a model (e.g. `create table as`) 4. dbt executes this SQL to build your models in your **target schema** -- ### You might be wondering: If I have multiple models, how does dbt know which order to run models in? -- * Use the `ref` function to create dependencies --- # Creating dependencies * Use the `ref` function to do build dependencies between models * `{{ ref('model_name') }}` * The `ref` function _also_ compiles to the full table/view name, using the correct schema for your environment * These dependencies form a **Directed Acyclic Graph (DAG)** * dbt uses this DAG to understand the order to run models in * This makes it easy to split long queries up into separate models .center[
] --- # DAGs ## A real life example: .center[
] --- background-color: #004C61 class:subtitle ## A slightly more complex DAG .center[
] --- background-color: #005E7A class:subtitle # A typical dbt DAG .center[
] --- class: subtitle ## Checkpoint: .center[
] - What models does `customers_with_attribution` depend on? ??? +sessions & raw_customers -- - What models depend on `customers_with_attribution`? ??? customer_acquisition_by_channel & revenue_by_channel -- - Questions?? --- # Knowledge check You should be comfortable with these words: .left-column[ * dbt project * models * target schema * Directed Acyclic Graphs (DAGs) ] .right-column[ * configurations (configs) * materializations * references (refs) ] --- # Knowledge check You should be able to: * Create new models (that depend on other models) * Run your models (all of them / one model at a time) * Check the SQL that's running * Configure your models (from the `dbt_project.yml` file / within a model) 🙋♀️ Please let us know if you want to go over any of these ??? Teacher notes: * Use this as a checklist that you've covered everything off * Emphasize that we're going to launch off from here — use the lunch time to make sure that everyone is following so far. * For those who do understand, there's a working session --- class: subtitle # Zoom Out
Distributed dbt Learn norms
Who is an analytics engineer?
dbt Fundamentals
--
Working Group 1
dbt Project Design
Working Group 2
Testing
Sources
Docs
Deployment
Working Group 3
---