class: title, center, middle # Building a Modern Data Stack with dbt and Databricks ## Data and AI Summit 2022 --- class: title, center, middle # Welcome --- # Welcome - Introductions - Course Overview + Agenda - How to get the most out of the course - Architecture of Databricks and dbt Cloud --- ## Introductions .left-column[
] .right-column[**Kyle Coapman** Head of Training dbt Labs ] --- ## Introductions .left-column[
] .right-column[**Fei Lang** ISV Partner Solutions Architect Databricks ] --- ## Introductions Turn and talk with those near you - Name - Role - Company - Favorite potato based food --- ## Course Overview You will learn how to: - Leverage dbt to manage data transformations in Databricks through the SQL warehouse - Transform data while building dependencies and managing environments - Apply tests to transformations to ensure data quality - Write and generate documentation of data models - Orchestrate the transformation of data on a schedule or using the API - Analyze your results using SQL queries on the data lake ---
Agenda
Welcome
Pre-course set up
Databricks SQL, Part 1
Sources
Models
Testing
Documentation
Deployment
Databricks SQL, Part 2
Wrap Up, Survey, Next Steps
--- ## Agenda with timing | Topic | Time (min) | Time (PT) | |-----------------------------|-------------|-------------| | Welcome | 15 min | 8:00a | | Setup | 15 min | 8:15a | | Databricks SQL Part 1 | 15 min | 8:30a | | Sources | 30 min | 8:45a | | Break #1 | 15 min | 9:15a | | Models | 45 min | 9:30a | | Testing and Documentation | 30 min | 10:15a | | Break #2 | 15 min | 10:45a | | Deployment | 30 min | 11:15a | | Databricks SQL Part 2 | 15 min | 11:30a | | Wrap up, Survey, Next Steps | 15 min | 11:45a | --- ## How to get the most out of the course .center[ .left-column-66[.left-column[
π―
## All in ] .right-column[
ππ½ββοΈ
## Curiosity ]] .right-column-33[
π©π»βπ»π§πΎβπ»π¨πΌβπ»
## Community ] ] --- # 4 Tabs to have open - Databricks query editor* - dbt Cloud develop interface* - Slide deck (your choice of format) - Slack channel *We will set this up in a moment if you haven't already --- class: subtitle, center, middle # Two Methods of Learning Content Based & Practice Based --- # Content-based - Focus on concepts and quick demonstrations - Content questions in #build-a-modern-datastack.... channel - Tech support in #ta-help channel - _Not expected to_ code along, we will have time for practice too! - **Checkpoints** - **Pausing for Questions** --- class: subtitle # Checkpoint: - _Do you prefer trailing commas or leading commas and why?_ -- ### Questions?? ??? * The checkpoints will look like this with a couple of questions. * Option A: Type your answer in the chat. * Option B: Raise your hand to give a verbal response. --- # Practice-based - **Milestones** will indicate clear steps to complete in your own project - Code snippets and steps are covered in the slide deck - Timer set at the front of the room - Lean on fellow developers near you - Escalate questions to TAs - If we notice a trend, we will pause and bring it back --- # Milestones 1. Configure sources 2. Develop staging models 3. Develop marts model 4. Add testing and documentation 5. Deploy your dbt Project --- # Practice-based Finish early?? Couple of options: - Keep leveling up by assisting a fellow learner - Look for the bonus exercises to build your project out even further* * Important note: Bonus exercises will not be pre-requisites/blockers for later parts of the course. --- ## Architecture of Databricks and dbt Cloud .center[
] --- ## What we are building in dbt .center[
] --- class: title, center, middle ## Questions before we jump in?? --- class: title, center, middle ## Are you ready to learn?? ---
Agenda
Welcome
Pre-course set up
Databricks SQL, Part 1
Sources
Models
Testing
Documentation
Deployment
Databricks SQL, Part 2
Wrap Up, Survey, Next Steps
--- # Pre-course set up 1. Access the lab environment in Databricks 2. Use partner connect to set up dbt Cloud 3. Load sample data 4. Intitialize your dbt Project --- ## Access the lab environment in Databricks **
Link: Databricks Workspace Registration
** Step 1: Complete the registration form to create the workspace Step 2: Click `launch lab` and allow 5-10 minutes for the environment to load Step 3: After the environment has been prepared, copy-paste the provided Databricks Workspace URL into an incognito/private window Step 4: Sign in with the provided username and password from the Environmental Details page. Step 5: Arrive in your Databricks Workspace. You should land in the SQL workspace. **
Link: PDF Workspace Registration Guide
** --- ## Use partner connect to set up dbt Cloud - In the Databricks workspace, on the left-side console: click on `Partner Connect`. .center[
] - Select the dbt tile under `Data preparation and transformation`. --- ## Use partner connect to set up dbt Cloud - Click on `Next` when prompted to `Connect to partner`. This action will create a service principal, PAT token for that service principal, and SQL Warehouse for the dbt Cloud account to use. This does mean that you will have two SQL Warehouses at your disposal from the previous step and from using Partner Connect. .center[
] --- ## Use partner connect to set up dbt Cloud - Click on `Connect to dbt Cloud`. .center[
] --- ## Use partner connect to set up dbt Cloud - After the new tab loads, you will see a form. Enter your account name (e.g. Acme Labs or Kyle's Sandbox) and password. **Save this username/password if you want to access dbt Cloud after this training.** (Note: Collapse the welcome prompt in the bottom right) .center[
] - After you have filled out the form and clicked on `Complete Registration`, you will be logged into dbt Cloud automatically. --- ## Load sample data Head back to your Databricks Workspace Change your SQL warehouse from XS to 2XS - Navigate to SQL Warehouses on the left hand menu - Click on the triple dots to edit `DBT_CLOUD_ENDPOINT` - Change the cluster size from `X-Small` to `2X-Small` - Change the auto stop from 10 minutes to 45 minutes - Click save in the top right --- ## Load sample data - In the left side menu, choose SQL editor - Select the `DBT_CLOUD_ENDPOINT` from the dropdown menu .center[
] --- ## Load sample data Create a schema for sample source data - Run the following three lines* of code in your SQL editor. This may take a few moments for the warehouse to start. ```sql -- the default schema will be where our production data lives DROP SCHEMA IF EXISTS retail cascade; DROP SCHEMA IF EXISTS dbt_user cascade; CREATE SCHEMA retail; -- our raw sample data will go here CREATE SCHEMA dbt_user; -- our development data will go here GRANT ALL PRIVILEGES ON SCHEMA retail TO users; GRANT ALL PRIVILEGES ON SCHEMA dbt_user TO users; USE SCHEMA retail; ``` * In Databricks SQL, you can select all the lines you want to run and choose run selected (shift + enter) **
Link to raw code
** --- ## Load sample data Load sample source data _Run the following three lines of code in your SQL editor_ ```sql CREATE TABLE retail.customers USING csv OPTIONS (path"/databricks-datasets/retail-org/customers/customers.csv", header "true"); CREATE TABLE retail.loyalty_segments USING csv OPTIONS (path"/databricks-datasets/retail-org/loyalty_segments/loyalty_segment.csv", header "true"); CREATE TABLE retail.sales_orders USING json OPTIONS (path"/databricks-datasets/retail-org/sales_orders/part-00000-tid-1771549084454148016-e2275afd-a5bb-40ed-b044-1774c0fdab2b-105592-1-c000.json", header "true"); ``` The schema browser on the left should now show three tables in the retail schema. **
Link to raw code
** --- ## Load sample data Query loaded sample source data _Make sure you can access the tables that you just created by running the following lines_ ```sql SELECT * FROM retail.customers limit 10; SELECT * FROM retail.loyalty_segments limit 10; SELECT * FROM retail.sales_orders limit 10; ``` **
Link to raw code
** --- ## Test your dbt Cloud connection - In the Databricks interface, navigate to `User Settings` in the left hand menu - Select the tab that says `Personal Access Tokens` - Create a new token with the comment `DBT_USER` and click generate* - Leave the token up in that tab and head over to your dbt Cloud tab * We will be using this to set up our development connection. --- ## Test your dbt Cloud connection Within dbt Cloud, do the following: - Click on your profile icon in the top right corner - Under credentials on the left, click your project on the left handside titled 'Databricks Partner Connect Trial' - Click edit to adjust your **development credentials** - For *token*, enter the personal access token from the Databricks workspace - For **schema**, enter `dbt_user` - For **threads**, you can leave this at 6. - Click save in the top right. --- ## Test your dbt Cloud connection - In dbt Cloud, navigate to the left hand menu through the hamburger menu. Select **Develop**. It might take a few minutes for your project to spin up for the first time, as it established your git connection, clones your repo, and tests the connection to the warehouse. - Above the file tree to the left, click **Initialize your project**. This builds out your folder structure with example models. - Make your initial commit by clicking **Commit**. Use the commit message `initial commit`. This creates the first commit to your managed repo and allows you to open a branch where you can add new dbt code. --- ## Test your dbt Cloud connection - Head back into your dbt Cloud account - Now you should be able to **directly query data from your warehouse** and **execute dbt run**. In "Scratchpad 1", delete all text and paste your warehouse-specific code into Scratchpad 1. Click `Preview`. ```sql select * from retail.customers ``` - In the command line bar at the bottom, type `dbt run` and click **Enter**. We will explore what happens in the next section of the course. --- # Pre-course set up At this point in the course, you should have the following windows open from this milestone: - Your Databricks workspace with access to the SQL editor - Your dbt Cloud workspace with access to the develop interface --- # 4 Tabs to have open - Databricks query editor - dbt Cloud develop interface - Slide deck (your choice of format) - Slack channel ---
Agenda
Welcome
Pre-course set up
Databricks SQL, Part 1
Sources
Models
Testing
Documentation
Deployment
Databricks SQL, Part 2
Wrap Up, Survey, Next Steps
--- class: title, center, middle # Databricks SQL --- class: title, middle, center # What is Databricks --- class: title, middle, center # Databricks' vision is to enable data-driven innovation to all enterprises --- # Databricks Architecture .center[
] --- class: title, middle, center # Why use Databricks SQL? --- ## A new home for data analysts .left-column[ Enable data analysts to quickly **perform ad-hoc and exploratory data analysis**, with a new SQL query editor, visualizations and dashboards. Automatic alerts can be triggered for critical changes, allowing to respond to business needs faster. ] .right-column[
] --- ## Simple administration and governance .left-column[ **Quickly setup SQL / BI optimized compute** with SQL warehouses. Databricks automatically determines instance types and configuration for the best price/performance. Then, easily manage usage, perform quick auditing, and troubleshooting with query history. ] .right-column[
] --- # Databricks SQL Demo - Let's see it in action - Landing Page - Sidebar Menu and App Switcher - SQL Editor - Queries - Dashboards - Data - SQL Warehouses - Query History ---
Agenda
Welcome
Pre-course set up
Databricks SQL, Part 1
Sources
Models
Testing
Documentation
Deployment
Databricks SQL, Part 2
Wrap Up, Survey, Next Steps
--- class: title, center, middle # Sources --- # If you've ever had this happen... - I'm switching my email marketing platform. I want to know which final models depend on that source data. -- - The team has changed the schema where they are loading raw tables. -- - Someone asked me "where does this data _really_ come from?" -- - Nothing in this world is perfect. My data loading tools experience occasional hiccups that result in delayed or duplicated data. -- - Something that was once true about my source data is no longer true (e.g. a new payment method was introduced) --- # Sources may be right for you! - Easily declare and update the location of raw tables - `schema` for each source - `name`/`identifier` for each table -- - Test data integrity of raw tables - Built-in: `unique`, `not_null`, `accepted_values`, `relationships` - Custom: whatever you want! - More on this later! -- - Test the freshness of loaded data, based on thresholds you define --- ## What we are building in dbt .center[
] ??? Call back to what are be building to connect the dots --- # Defining sources Sources are defined in `.yml` files in your `models/` directory. ```yml version: 2 sources: - name: retail schema: retail tables: - name: customers identifier: customers - name: loyalty_segments identifier: loyalty_segments - name: sales_orders identifier: sales_orders ``` --- # Selecting from sources Use the `{{ source() }}` function to select from a source in a model: .left-column[ ```sql with source as ( select * from {{ source('retail', 'customers') }} ), renamed as ( ... ``` ] -- .right-column[ ```sql with source as ( select * from retail.customers ), renamed as ( ... ``` ] --
demo
??? Demo Time: - in the stg_customers model, replace the direct table references with the source macro - run `dbt run` - this fails! Why does it fail? (because we haven't added the sources.yml file. Do that now.) - add sources.yml (use text on Defining Sources slide) "Once you have defined a source in your .yml file, you can now utilize the source Jinja function to select from that source in a model."" - Preview data to show it works "The source function here will compile to the full object name in the database, so select star from source-jaffleshop will compile to raw.jaffleshop.customers, with the database name raw, which is **clicks back** how we defined it in our yml file."" - show Compiled sql to prove it translates to the database.schema.object expected - Asking learners for how to update the stg_orders and stg_payments model Then jump back to the slides to talk about source-freshness π --- # Sources are rendered as green nodes They help users understand where data comes from (lineage)
--- # Feeling fresh? Optionally, specify a `loaded_at_field` and `freshness`: .denser-text[ ```yml version: 2 sources: - name: retail schema: retail tables: - name: customers identifier: customers - name: loyalty_segments identifier: loyalty_segments - name: sales_orders identifier: sales_orders loaded_at_field: _etl_loaded_at freshness: warn_after: count: 12 period: hour error_after: count: 24 period: hour ``` ] Then run `dbt source freshness` ??? - Now that we've defined our sources and we know how to test them ** jump back to intro slide ** - Has that ever happened to you before? And if it has happened to you, do you have a reliable way of capturing that? - dbt has a command called source freshness that will allow you to check the freshness of your source tables. All you need to do is add a loaded_at field to your source definition, and under that add a freshness block. - The loaded_at field should be a column from the source data that reliably captures when that row was loaded into the warehouse. This usually something like an updated_at field. - Your freshness blocks will then determine define the acceptable amount of time between the most recent record, and now, for a table to be considered "fresh". In this example, we have configured checks to run essentially every 12 hours. --- ## Snapshotting freshness .denser-text[ ```yml $ dbt source freshness Running with dbt=1.1.0 Found 5 models, 20 tests, 0 snapshots, 0 analyses, 130 macros, 0 operations, 0 seed files, 3 sources 17:29:17 | Concurrency: 4 threads (target='learn') 17:29:17 | 17:29:17 | 1 of 1 START freshness of retail.sales_orders......................... [RUN] 17:29:20 | 1 of 1 WARN freshness of retail.sales_orders.......................... [WARN in 2.91s] 17:29:21 | Done. ``` ] ### How does dbt check freshness? -- .denser-text[ * dbt runs a SQL statement .left-column[ ```sql select max({{ loaded_at_field }}) as max_loaded_at, {{ current_timestamp() }} as snapshotted_at from {{ source }} ``` ] .right-column[ ```sql select max(_batched_at) as max_loaded_at, convert_timezone('UTC', current_timestamp()) as snapshotted_at from retail.sales_orders ``` ] * Then it uses python to determine whether the delta is within the right range ] --
demo
??? Demo Time: - "Let's see that in action!" - go to statement tab and run `select * from raw.jaffle_shop.orders` to show the metadata columns added by the loading tool (_etl_loaded_at) - add the freshness block to the sources.yml file - run `dbt source freshness` - highlight pass/warn/error - bonus: highlight sql and copy into statement tab to run raw outputs --- .left-column-33[ ## In dbt Cloud: Adding this step to a job helps you understand if one of your data sources is out of date. ] .right-column-66[ ```bash dbt source freshness ```
] ??? Behind the scenes, dbt will calculate the difference between the max of the loaded_at field, so the most recent row uploaded, and the time at which the snapshot is being run. The results of this query are used to determine whether the source is fresh or not. **Demo:** Use this an opportunity to demonstrate the source-freshness feature. Do the following over in dbt Cloud: - Copy the freshness configuration from `Feeling Fresh?` slide - Run dbt source-freshness - Show the code that is run against the warehouse to execute this - Tweak the criteria to 1 hour warn and 2 hour error to show how that changes the results of the command The dbt cloud source freshness page is accessable to dbt cloud developers but also to those with read-only seats - providing a friendly self service UI for stakeholders to check on the freshness of source data. (Which hopefully means fewer emails asking whether data from [insert your favorite/most problematic source system] is up to date in the BI tool. --- class: subtitle ## Checkpoint Given the following YML file, what will the following select statements compile to? .left-column[ `sources.yml` ```yml version: 2 sources: - name: event_data schema: snowplow tables: - name: events identifier: snowplow_events ``` ] .right-column[ `SQL` ```sql select * from {{ source('event_data', 'events') }} ``` ] -- .right-column[ **compiles to:** ```sql select * from snowplow.snowplow_events ``` ] ??? **Check for Understanding:** Ask the group to predict what the code on the right will compile to based on the YML file on the left. Learners will have to pay close attention to the YML configuration - Slide 1: Compiles to `raw.snowplow.events` - Slide 2: Compiled to `raw.stripe.payments` --- class: subtitle ## Checkpoint Given the following YML file, what will the following select statements compile to? .left-column[ `sources.yml` ```yml version: 2 sources: - name: payment_data tables: - name: payments identifier: stripe_payments ``` ] .right-column[ `SQL` ```sql select * from {{ source('payment_data','payments') }} ``` ] -- .right-column[ **compiles to:** ```sql select * from payment_data.stripe_payments ``` ] --- class: subtitle ## Checkpoint ### Questions?? ??? Pause for questions on sources based off what was covered. --- class: subtitle # Knowledge check You should be familiar with: * Define sources * Select from a source in a model * Bonus: Check the freshness of your source --- # Milestone 1 - Create sources Within the dbt Cloud Develop Interface (IDE) 1. Open a development branch called `dbt-databricks-dais-22` 1. Delete the example folder in the models directory 2. Create a new folder in models called `staging` 3. Create a new folder in staging called `retail` 4. Create a new file in retail called `_sources.yml` 5. Configure sources for `customers`, `loyalty_segments`, and `sales_orders` 6. Use the source macro to select from sources in the Scatchpad 7. Commit your work with the commit message `configure sources` --- ## Milestone 1 - Create sources (1/4) Within the dbt Cloud Develop Interface (IDE) - Open a development branch called `dbt-databricks-dais-22` .left-column[
] .right-column[
] --- ## Milestone 1 - Create sources (2/4) Within the dbt Cloud Develop Interface (IDE) - Delete the example folder in the models directory - Create a new folder in models called `staging` - Create a new folder in staging called `retail` - Create a new file in retail called `_sources.yml` --- ## Milestone 1 - Create sources (3/4) Configure sources for `customers`, `loyalty_segments`, and `sales_orders` _Copy the snippet below into your _sources.yml file and update to configure sources for the three tables that we loaded_ ```sql version: 2 sources: - name: jaffle_shop tables: - name: orders - name: customers ``` **
sources documentation
** --- ## Milestone 1 - Create sources (4/4) Use the source macro to select from sources in the Scatchpad _In a scratchpad in dbt Cloud, run the following snippet to demonstrate selecting from the sources you configured_ ```sql select * from {{ source('retail','customers') }} ``` - Commit your work with the commit message `configure sources` --- # Milestone 1 - BONUS! - Check out the docs on [source freshness](https://docs.getdbt.com/docs/building-a-dbt-project/using-sources#snapshotting-source-data-freshness) - Add a source freshness test to the sales_orders model - Hint: You will need to use a timestamp column - Run the freshness test with `dbt source freshness` --- # Milestone 1 - Create sources At this point in the course, you should have: - Added the following path to your dbt project `models/staging/retail` - Created a new file called `_sources.yml` - Configured sources for `customers`, `loyalty_segments`, and `sales_orders` - Verified that you can select from these sources in dbt Cloud - Committed your work! ---
Agenda
Welcome
Pre-course set up
Databricks SQL, Part 1
Sources
Models
Testing
Documentation
Deployment
Databricks SQL, Part 2
Wrap Up, Survey, Next Steps
--- class: title, center, middle # Models --- # Scenario We now have all the raw data in the retail schema. Let's just point our business intelligence tool at the data for reporting to handle all the joins and aggregations. **Discuss:** What could go wrong? What impacts would that have on the business? --- ### What could go wrong? - Repeated code >> hard to update in multple places - Diverging business logic >> different numbers reported - Wasted compute >> $$$ --- # Historically data has been transformed with lots of coordination of DDL/DML statements. Run a `create table as` statement! ```sql create table dbt_user.fct_customers as ( ... ) ``` -- Uh oh... .dense-text[ * What if the schema doesn't exist yet? * What if the schema does exist, and I try to recreate it? * What if this table already exists? * What if I want to parameterize the schema? (And I should!) * What if this depends on another transformation? ] --- # Models in dbt - Models are `select` statements that transform data. - Stored in the `models` directory, and have a `.sql` extension - A model represents a single logical step of transformation logic --- # What does dbt do with models? dbt will build your models into your **target schema** based on your environment -- dbt will handle how that model is built (table/view/etc) through **materializations** -- We can build dependencies between models with the **ref function** --- ## What we are building in dbt .center[
] --- ## Architecture of Databricks and dbt Cloud .center[
] --- # Target Schema The **target schema** is the _schema_ that dbt will build your models into. - This is set at the user profile level for development (i.e. dbt_kcoapman, dbt_flang, dbt_user) - This is set at the environment level for deployment (i.e. analytics, dbt_prod, default) - *However there are ways to write specific models to specific schemas...wait for it... -- This enables a shared code base that can be applied to multiple environments without having to alter the code. --- # Materializations **materializations** are build strategies - i.e. the SQL that your `select` statement gets wrapped in - Models can be materialized as `views` and `tables` - You need to **configure** a model's materialization - Otherwise, this process could look like writing a lot of DDL/DML in the SQL Editor or another interface. --- # Configurations **configurations** are model settings - They can be set in your `dbt_project.yml` file, _and_ in your model file. - Configurations are applied _hierarchically_ - Additional configurations: - `tags: ['nightly', 'hourly']` - `enabled: True` - `schema: marketing` --- # What is a `.yml` file? YAML is a (relatively) human-friendly way to write dictionaries and arrays. It's often used for configuration files (πββοΈ). We use YAML a lot β it's worth making sure you have a good handle on it (esp. the importance of indentation!) --- # Some example YAML ```yml sandwiches: - name: Cuban description: A ham and cheese sandwich that likely originate in Tampa or Key West. ingredients: - name: Cuban Bread - name: Ham - name: Swiss Cheese - name: Yellow mustard - name: Peanut Butter and Jelly description: A simple sandwich with just PN&J, popular in elementary schools. - name: Italian Sub description: Long bread roll with salami, mortadella, capocollo and ham with a lot of other stuff! - name: Jaffle description: An Austrailian take on a grilled cheese with sealed edges. ``` --- # ref function - Use the `ref` function to do build dependencies between models - The `ref` function _also_ compiles to the full table/view name, using the correct schema for your environment - These dependencies form a **Directed Acyclic Graph (DAG)** - dbt uses this DAG to understand the order to run models in - This makes it easy to split long queries up into separate models .center[
] --- ## What we are building in dbt .center[
] --- background-color: #005E7A class:subtitle # A typical dbt DAG .center[
] --- # What happens when you run the command: `dbt run`? 1. dbt connects to Databricks through the **SQL Warehouse** 2. dbt parses your **dbt project** 3. dbt wraps your **models** in the based on the the **materialization** you **configure** for a model (e.g. `create table as`) 4. dbt executes this SQL to build your models in your **target schema** --- class: subtitle ## Checkpoint: 1. How does dbt know what schema to build my models in? 2. How does dbt know to build the model as a table or view? 3. How do we build a dependency between models? 4. How do we build a dependency between a model and a source? -- Target schema! -- Materialization config! -- The ref function! -- The source function! --- class: subtitle ## Checkpoint: ### Questions?? --- # Two of the common _types_ of models .left-column[ ### Staging Models: - The shape you _wish_ the data came in - Source conformed - 1:1 with raw data tables - Prefixed with `stg_` - Renaming, type casting, coalescing + nullifβing ] .right-column[ ### Marts Models: - Core business entities and processes - Business conformed - Typically involve joins and aggregations - Prefixed with `dim_` or `fct_` ] --- ## What we are building in dbt .center[
] --- # Knowledge check You should be comfortable with these words: .left-column[ * models * target schema * materializations * configs ] .right-column[ * refs * DAGs * Staging models * Marts models ] --- # Milestone 2 - Create staging models 1. Create 2 new files in models/staging/retail, `stg_loyalty_segments.sql` and `stg_customers.sql` 2. Copy SQL code for staging customers in `stg_customers.sql` 3. Write SQL code for staging loyalty_segments in `stg_loyalty_segments.sql` 4. In dbt_project.yml, configure all staging models to be views 5. Preview the data for each model 6. Execute `dbt run` to build the models 7. Commit your work with the commit message `create staging models` --- ## Milestone 2 - Create staging models (1/6) - Create 2 new files in retail, `stg_loyalty_segments.sql` and `stg_customers.sql` - Copy SQL code for staging customers into `stg_customers.sql`
link to SQL snippet for stg_customers.sql
--- ## Milestone 2 - Create staging models (2/6) Write SQL code for staging loyalty_segments in `stg_loyalty_segments.sql` ```sql with source as ( select * from {{ source('retail','loyalty_segments') }} ), staged as ( select loyalty_segment_id, loyalty_segment_description, unit_threshold, from_unixtime(valid_from, 'yyyy-MM-dd') as valid_from_date, from_unixtime(valid_to, 'yyyy-MM-dd') as valid_to_date from source ) select * from staged ```
link to SQL snippet for stg_loyalty_segments.sql
--- ## Milestone 2 - Create staging models (3/6) In dbt_project.yml, configure all staging models to be views ```yml models: my_new_project: staging: +materialized: view ``` --- ## Milestone 2 - Create staging models (4/6) Preview the data for each model .center[
] --- ## Milestone 2 - Create staging models (5/6) Execute `dbt run` to build the models .center[
] --- ## Milestone 2 - Create staging models (6/6) - Commit your work with the commit message `create staging models` --- ## Milestone 2 BONUS! - Try creating your own staging model for the sales_orders source. - This should also be in models/staging/retail --- # Milestone 2 - Create staging models At this point in the course, you should have: - Created two new SQL files in the `models/staging/retail` directory - Written transformation code for both staging models - Configured staging models as views - Previewed the data - Built the models in Databricks with `dbt run` - Committed your work! --- # Milestone 3 - Create dim_loyalty_segments 1. Create a new folder in models called `marts` 2. Create a new folder in marts called `core` 3. Create a new file in core called `dim_loyalty_segments.sql` 4. Copy SQL code for modeling loyalty_segments in `dim_loyalty_segments.sql` 5. Update the `from` clauses to leverage the `ref` function. 6. Configure marts models to be materialized as tables 7. Execute `dbt run` to build the models 8. Commit your work with the commit message `create dim_loyalty_segments model` --- ## Milestone 3 - Create dim_loyalty_segments (1/3) - Create a new folder in models called `marts` - Create a new folder in marts called `core` - Create a new file in core called `dim_loyalty_segments.sql` --- ## Milestone 3 - Create dim_loyalty_segments (2/3) - Copy SQL code for modeling loyalty_segments in `dim_loyalty_segments.sql` - Update the `from` clauses to leverage the `ref` function. - Execute `dbt run` to build the models - Commit your work with the commit message `create dim_loyalty_segments model` **
link to dim_loyalty_segments.sql code
** **
ref function documentation
** --- ## Milestone 3 - Create dim_loyalty_segments (3/3) In dbt_project.yml, configure all staging models to be views ```yml models: my_new_project: marts: +materialized: table staging: +materialized: view ``` --- # Milestone 3 BONUS! - Try adding the schema config to a model in a config block - [model configuration documentation](https://docs.getdbt.com/reference/model-configs) - [schema config documentation](https://docs.getdbt.com/reference/resource-configs/schema) - See if you can figure out how this changes the behavior of dbt --- # Milestone 3 - Create dim_loyalty_segments At this point in the course, you should have: - Created a new directory `models/marts/core` - Created a new file called `dim_loyalty_segments.sql` - Added and updated transformation code to dim_loyalty_segments - Configured marts models to be materialized as tables - Rebuilt all models with `dbt run` - Committed your work! ---
Agenda
Welcome
Pre-course set up
Databricks SQL, Part 1
Sources
Models
Testing
Documentation
Deployment
Databricks SQL, Part 2
Wrap Up, Survey, Next Steps
--- class: title, center, middle # Tests and Documentation --- class: title, middle, center # _Why is testing and documentation important?_ ## What makes it challenging? --- # Tests and Documentation Tests and documentation can be configured along side each other. .left-column[.dense-text[ ### Tests **Tests** are assertions you make about your **models**. When your assertion is true, your tests pass. You configure tests in YML, in the `models` directory. Behind the scenes, test are just `select` statements! ]] .right-column[.dense-text[ ### Documentation **Documentation** refers to a site that will capture a ton of information about your data. This is automated and will pull information from your project and data platform. You can also manually enter information about models and sources. ]] --- # Examples of Tests: .left-column[ ```yml version: 2 models: - name: dim_customers columns: - name: customer_id tests: - not_null ``` ] ??? "" Here's an example of a test we configured in one of our .yml files which ensures that our customer_id key in our customers model is not null. "" -- .right-column[ ```sql select count(*) from dbt_user.dim_customers where customer_id is null ``` ] ??? "" Here's the underlying SQL that runs for the not_null test using results of our customers table. It returns the number of records in our customers table where our customer_id is null, which will result in a failure if rows are returned. "" --- # Examples of Tests: .left-column[ ```yml version: 2 models: - name: dim_customers columns: - name: customer_id tests: - not_null - unique ``` ] ??? "" Now we're going to add a test to our configuration to ensure that every customer_id in the customers table is unique. "" -- .right-column[ ```sql select count(*) from ( select customer_id from dbt_user.dim_customers group by customer_id having count(*) > 1 ) ``` ] ??? "" Here's the underlying code - you can see that it returns all customer_ids that have more than one record associated to that id. If any rows return, then the test fails. "" --- # dbt's built-in tests: .left-column[ * **`unique`** * `not_null` * `accepted_values` * `relationships` ] .right-column[ Every value in this column is unique. ] ??? "" dbt comes packaged with four out-of-the-box tests. Unique: which, as we saw in our example, tests that each value in the specified column has no more than one row in the results. "" --- # dbt's built-in tests: .left-column[ * `unique` * **`not_null`** * `accepted_values` * `relationships` ] .right-column[ Every value in this column is not null. ] ??? "" Not null, which tests that every record has a value for the specified column. "" --- # dbt's built-in tests: .left-column[ * `unique` * `not_null` * **`accepted_values`** * `relationships` ] .right-column[ Every value in this column is a value from a given list. ```yml version: 2 models: - name: stg_orders columns: - name: status tests: - accepted_values: values: - placed - shipped - completed - returned ``` ] ??? "" Accepted values, which tests that the distinct values in are within a specified list. "" [Teacher Notes:] Example: If our orders table has a status for each order placed and in a model after this we've pivoted our data so we can see total number of orders per status, then we'd want to be notified if an order status outside of our four expected values pops up. We'd need to add any additional values as a column in our pivot to avoid under-counting orders. Note: If there is a NULL in a column which has an accepted_values test applied, the test will fail. In order to get around this, you can COALESCE() your value or write a custom test. --- # dbt's built-in tests: .left-column[ * `unique` * `not_null` * `accepted_values` * **`relationships`** ] .right-column[ Each value in this column exists in the column of another table. ```yml version: 2 models: - name: stg_orders columns: - name: customer_id tests: - relationships: to: ref('stg_customers') field: customer_id ``` ] ??? "" and relationships, which ensures referential integrity. In other words, it tests that a column in your model has a related record in a different table. "" --- # Additional Tests?? - **Singular Tests:** Custom SQL to test a specific model - **Custom Generic Tests:** Custom parameterized tests (like unique, etc.) that you write in dbt - **Packages:** Plenty of packages with tests written by other dbt developers! --- # How do we run tests? -- Run all tests: `dbt test` -- Run specific tests: `dbt test --select one_specific_model` ??? "" After we've implemented our tests, how do we run them? We just need a command - dbt test. If we wanted to only run tests for a specific model, we could use selection syntax with -s to specify which model we want to test. "" [Teacher Notes:] You can exemplify this in dbt Cloud. If you've already exemplified this, just give a recap here. β --- class: subtitle ## Checkpoint .left-column[ ```yml version: 2 models: - name: payments columns: - name: payment_id tests: - not_null - unique ``` ] .right-column[ 1. What test(s) are being run in this project? 2. What are they running on? ] -- .right-column[ **Answers:** - not null and unique! - On the payment_id column on the payments model ] --- class: subtitle ## Checkpoint .left-column[ ```yml version: 2 models: - name: payments columns: - name: payment_method tests: - accepted_values: values: - credit_card - paypal - ach_transfer ``` ] .right-column[ 1. What test(s) are being run in this project? 2. What are they running on? ] -- .right-column[ **Answers:** - accepted_values! - On the payment_method column on payments ] ??? "" What about this one? "" [Teacher Notes:] Prompt the class to answer these questions. Answers: - The test being ran is the accepted_values test. - It's testing the payment_method column. - If it fails, it tells us that the column contains a value that is not credit_card, paypal, or ach_transfer. --- class: subtitle # Knowledge check You should be starting to see: * Add tests to your models * Run your tests (all of them / one model at a time) * Explain how dbt tests work * Find the compiled SQL for tests --- # Generating documentation .denser-text[ When dbt generates documentation it takes into account: * Everything it knows about your project: * Descriptions (ofc!) * Model dependencies * Model SQL * Sources * Tests * Additional things about your warehouse (from the information schema) * Column names and types (even if they aren't documented!) * Table stats This information is used to populate a documentation website. ] --- ### Adding descriptions .dense-text[ Descriptions can be added to models, sources, and columns (in your existing `.yml` files) ] .left-column[ .denser-text[ ```yml version: 2 models: - name: stg_orders description: One record per order columns: - name: order_id tests: - unique - not_null - name: status description: "{{ doc('order_status') }}" tests: - accepted_values: values: ['placed', 'shipped', 'completed', 'return_pending', 'returned'] - name: amount description: Amount in USD ``` ] ] .right-column[ .denser-text[ ```yml version: 2 sources: - name: jaffle_shop description: A replica of the postgres database database: raw tables: - name: orders description: One record per order columns: - name: id tests: - unique - not_null - name: status description: "{{ doc('order_status') }}" tests: - accepted_values: values: ['placed', 'shipped', 'completed', 'return_pending', 'returned'] ``` ] ] ??? Demo in dbt cloud. -- .dense-text[ What's the `{{ doc('order_status') }}`? ] --- ### `docs` blocks `docs` blocks allow you to write long-form descriptions. Docs blocks are added to `.md` files in your `models/` directory. .denser-text[ ```md {% docs order_status %} One of the following values: | status | definition | |----------------|------------------------------------------------------------| | placed | Order placed but not yet shipped | | shipped | Order has been shipped but hasn't yet been delivered | | completed | Order has been received by customers | | return_pending | Customer has indicated they would like to return this item | | returned | Item has been returned | {% enddocs %} ``` ] ??? - Docs blocks are great for special formatting but also allow for more modular documentation. A given .md file can be referenced and used in multiple descriptions without having to copy and paste any text. ??? Demo Time: Regular docs: - "Let's see that in action!" - add regular model and/or column descriptions to your yml file - run `dbt docs generate` - show where those end up in the docs Doc blocks: - add a doc block file `order_status.md` - paste in the above markdown code ensuring to add newline white spaces - add description: `"{{ doc('order_status') }}"` - run `dbt docs generate` - within docs, find the doc block and talk about it - talk through all the other data shown (metadata, columns, types) - talk through DAG / lineage graph and model selection syntax --- ## dbt Cloud - Development .left-column[ _Generate_
] .right-column[ _Serve_
] ??? - No need to run the 'dbt serve' command in dbt cloud in order to view dbt docs. --- ## dbt Cloud - Deployment .left-column[ _Generate_
] .right-column[ _Serve_
] --- class: subtitle ## Checkpoint - What is the command for running tests? - What is the command for generating documentation? - What can you add descriptions for on models? - What can you add descriptions for on sources? -- `dbt test` -- `dbt docs generate` -- Models and Columns -- Sources, Tables, and Columns -- --- class: subtitle ## Checkpoint ### Questions?? --- class: subtitle # Knowledge check You should be familiar with: * Add descriptions to your dbt project * Generate documentation for your project --- # Q: Why might a test fail? 1. The SQL in your model doesnβt do what you intended. -- 2. An assumption about your source data is wrong. -- 3. A previously-true assumption about your underlying source data is no longer true --- # Q: What should you test? * Good rules of thumb: * Test primary keys (`unique`, `not_null`) * Test fields that you use in downstream models --- # Q: What should you document? Our rule of thumb: anything that is not immediately obvious. For example: * Sources: - How the source data came to be in the warehouse - Known caveats - The definitions of poorly named fields and tables * Final models: - The grain of the model. Potentially a description of how the model was built. - Column definitions, especially for columns that include business logic * Intermediate models - As required --- # Q: When should you run tests? * Manual: When you first run a project * Manual: During development * Automated: When you run dbt on a schedule * Automated: When you want to merge your code ??? "" There's two places when we should manually test: - when we first run a project or before making changes - this ensures failures aren't caused by anything you changed. - when we change things in development Then, there's two places where we should automate tests: - When we're running dbt on a schedule - this ensures we're continually checking for any bad code that was merged. With good checks before hand this should be infrequent, but a good check to have in place nonetheless. - When we open a pull request - this ensures that if a developer didn't do their due diligence, that we have an automated process that can catch things. "" --- # Milestone 4 - Add testing and documentation 1. In the models/staging/retail folder, create a new file called `_schema.yml` 2. Copy YML code into `_schema.yml` 3. Adjust `_schema.yml` to add unique and not_null tests to primary keys on both staging models 4. Add descriptions to models in `_schema.yml` to document both staging models 5. Add descriptions to columns in `_schema.yml` to document 3 or more columns in `stg_customers.sql` 6. Execute `dbt test` to test your models 7. Execute `dbt docs generate` to generate your documentation 8. Click the link to docs to view your documentation 9. Browse around the docs site to find your documentation 10. Commit your work with the commit message `add tests and documentation` --- ## Milestone 4 - Add testing and documentation (1/3) In the models/staging/retail folder, create a new file called `_schema.yml` --- ## Milestone 4 - Add testing and documentation (2/3) - Copy YML code below into `_schema.yml` - Adjust `_schema.yml` to add unique and not_null tests to primary keys on both staging models - Add descriptions to models in `_schema.yml` to document both staging models - Add descriptions to columns in `_schema.yml` to document 3 or more columns in `stg_customers.sql` ```sql version: 2 models: - name: orders description: columns: - name: order_id description: tests: - unique - not_null ``` **
testing documentation
** **
docs documentation
** --- ## Milestone 4 - Add testing and documentation (3/3) - Execute `dbt test` to test your models - Execute `dbt docs generate` to generate your documentation - Click the link to docs to view your documentation - Browse around the docs site to find your documentation - Commit your work with the commit message `add tests and documentation` --- ## Milestone 4 BONUS! - Add a doc block to one of your columns - [doc block documentation](https://docs.getdbt.com/docs/building-a-dbt-project/documentation#using-docs-blocks) - Add a singular test to your project - [singular tests documentation](https://docs.getdbt.com/docs/building-a-dbt-project/tests#singular-tests) --- # Milestone 4 - Add testing and documentation At this point in the course, you should have: - Created a new file called `_schema.yml` - Added tests to the primary key in the staging models - Added descriptions to both staging models - Added descriptions to 3 columns on `stg_customers.sql` - Run your tests by executing `dbt test` - Generated and viewed your documentation - Committed your work! ---
Agenda
Welcome
Pre-course set up
Databricks SQL, Part 1
Sources
Models
Testing
Documentation
Deployment
Databricks SQL, Part 2
Wrap Up, Survey, Next Steps
--- class: title, center, middle # Deployment --- # We've built our first models... now what? -- - How will our team members access them? They are currently in a schema named `dbt_alice` -- - What happens if our underlying data gets updated? (It should!) -- - What happens if we add new models, or change existing ones? --- # Solution: -- - Run dbt on a schedule - Use a "trusted" schema for this run Also known as: - Running dbt in production - Deploying your dbt project --- ## What we are building in dbt .center[
] --- ## Architecture of Databricks and dbt Cloud .center[
] --- # Deployment The scheduler in dbt Cloud is purpose built for running dbt jobs on a schedule. - β Always pulls the latest version of your project (continuous deployment) - β Easy to build on Pull Request (and tear down the schema afterwards - continuous integration) - β Easy to generate and share documentation with users - β Secure access model - β Email and Slack notifications if something goes wrong, with rich logs - β Persist run artifacts for later comparison and analysis --- # Environments - Environments are the different places that you run your dbt project - Two types: Development and Deployment --- # Environments Overview .center[
] --- # Development Environment This is where we have been working throughout the course thus far - We can toggle between multiple branches - We manually run commands while we develop - Everything is built in our own development target schema --- # Deployment Environment This is where we will promote out code to run - Each environment is associated with a branch - Command(s) are set to run on a schedule, an API call, or manual kick-off - Everything is build in a designated deployment schema --- # Jobs - Jobs are a set of commands and settings that are used to run dbt on a schedule - These are associated at the environment level --- # Runs Runs are instances of jobs - You can monitor these while they are running - Once they have finished you can see the results and grab artifacts from the run --- # `dbt build` command The `dbt build` command will... - run models - execute tests - run snapshots* - seed seeds* in DAG order for selected resources or the entire project *These are two features that we did not touch on in this training. --- # The commands you should run as part of your dbt job .left-column[ #### At a minimum: ```txt $ dbt build ``` ] -- .right-column[ #### Our preferred setup: ```txt $ dbt source freshness $ dbt test -s source:* $ dbt build --exclude source:* ``` ] --- # Additional strategies Recommended: - Separate "slim" CI job β fast runs, clear notifications - Weekly full refresh job β more on this later! Seen in the wild: - Separate jobs for different parts of your project --- class: subtitle # Knowledge check You should start to be familiar with: - Explain why you need to deploy your dbt project - Deploy your dbt project with dbt Cloud --- # Milestone 5 - Run dbt in production 1. Merge your development branch to main. 2. Navigate to jobs and find the Databricks Partner Connect Trial Job 3. Click on the job and click Run now to kick off the job 4. Once the job has finished, review the results! --- # Milestone 5 - Run dbt in production (1/4) - Merge your development branch to main .center[
] --- # Milestone 5 - Run dbt in production (2/4) - Navigate to jobs and find the Databricks Partner Connect Trial Job - Click on the job and click Run now to kick off the job .center[
] --- # Milestone 5 - Run dbt in production (4/4) - Once the job has finished, review the results! .center[
] --- # Milestone 5 BONUS! - Read up on Environment Variables in dbt Cloud - [Environment variables docs](https://docs.getdbt.com/docs/dbt-cloud/using-dbt-cloud/cloud-environment-variables) - Try using an environment variable in development and deployment in a SQL comment on a model and see how it works --- # Milestone 5 - Run dbt in production At this point in the course, you should have: - Merged your development branch to main - Create a new job called `Daily Run` that runs `dbt build` and generates documentation - Kicked off the job and reviewed the results ---
Agenda
Welcome
Pre-course set up
Databricks SQL, Part 1
Sources
Models
Testing
Documentation
Deployment
Databricks SQL, Part 2
Wrap Up, Survey, Next Steps
--- class: title, middle, center # Databricks SQL Part 2 ---
Agenda
Welcome
Pre-course set up
Databricks SQL, Part 1
Sources
Models
Testing
Documentation
Deployment
Databricks SQL, Part 2
Wrap Up, Survey, Next Steps
--- class: title, center, middle # Wrap Up, Survey, Next Steps --- # Wrap up **Congratulations on completing the course and learning the fundamentals of dbt + Databricks!** You have just leveled up your data skills with: - Leveraging dbt to manage data transformations in Databricks through the SQL warehouse - Transforming data while building dependencies and managing environments - Applying tests to transformations to ensure data quality - Writing and generating documentation of data models - Orchestrating the transformation of data on a schedule or using the API - Analyzing your results using SQL queries on the data lake --- class: title, middle, center # Survey - 5 minutes ###
Click for Survey
--- # Next Steps .left-column[ - Continue learning... - Databricks SQL on [Databricks Academy](https://databricks.com/learn/training/home) - dbt on [dbt Learn](courses.getdbt.com) - ...then earn credentials with... - [Databricks certification](https://databricks.com/learn/certification/data-engineer-professional) - [dbt Certification](https://www.getdbt.com/certifications/analytics-engineer-certification-exam/) ] .right-column[ .center[
] ]