class: title, center, middle # Sources ??? **Continue the story:** Make a connection to content that was just covered and how we might build off it. Example: "Excellent! We just concluded testing with Alexis and I'm excited to roll out sources with you all. You may have noticed that we have been referencing raw tables in some of our models - we are going to figure out why that might be a problem for the future and what we might do about it!" --- # If you've ever had this happen... - I'm switching my email marketing platform. I want to know which final models depend on that source data. ??? **Hook:** Use this an an opportunity to drive engagement and get folks invested in this lesson. Examples: - Give me an emoji reaction on zoom for each statement that resonates with you. *narrate how popular/common each responses is.* - For a line that gets a lot of responses, call on 1-2 students to share their story of that case. -- - I'm switching from Redshift to Snowflake. All of my source data will be in a different database and schema. -- - Someone asked me "where does this data _really_ come from?" -- - Nothing in this world is perfect. My data loading tools experience occasional hiccups that result in delayed or duplicated data. -- - Something that was once true about my source data is no longer true (e.g. a new payment method was introduced) ??? Teaching Note: Think of a good story to share about a specific situation when dbt source functionality would have really helped avoid a painful mistake or burdensome work. --- # Sources may be right for you! - Easily declare and update the location of raw tables - `database` + `schema` for each source - `name`/`identifier` for each table -- - Validate the freshness of loaded data, based on thresholds you define ??? Read aloud the text on the slides: ---
Sources | Focus
Configure sources
Build dependencies between sources and models
Configure freshness snapshots
Build intuition for why defining sources is helpful
??? **Focus Slide:** This is your opportunity to bring focus to the rest of the session. You can do this by... - slowing down just a little bit and reading each line with a little context OR - asking a learner to read these to the room --- # Defining sources Sources are defined in `.yml` files in your `models/` directory. ```yml version: 2 sources: - name: jaffle_shop database: raw schema: jaffle_shop_original tables: - name: customers identifier: customers - name: orders identifier: orders ``` -- * The `identifier` field should match the table name in _your_ database * This provides you an opportunity to rename tables that may be poorly named * The `identifier` field is optional; if you're happy with your table names, you can simply put them in `name` ??? - Sources are defined in `.yml` files in your `models/` directory, where they have to be nested under a source key. You will have to provide a source name, which is usually the name of the schema of your raw table. Snowflake users, you can also provide a database name and then after that you can list your table names. - If one of your source tables is badly named, there is also an opportunity for you to provide an alias using the identifier key. --- # Selecting from sources Use the `{{ source() }}` function to select from a source in a model: .left-column[ ```sql with source as ( select * from {{ source('jaffle_shop', 'customers') }} ), renamed as ( ... ``` ] -- .right-column[ ```sql with source as ( select * from raw.jaffle_shop_original.customers ), renamed as ( ... ``` ] --
demo
??? Demo Time: - in the stg_customers model, replace the direct table references with the source macro - run `dbt run` - this fails! Why does it fail? (because we haven't added the sources.yml file. Do that now.) - add sources.yml (use text on Defining Sources slide) "Once you have defined a source in your .yml file, you can now utilize the source Jinja function to select from that source in a model."" - Preview data to show it works "The source function here will compile to the full object name in the database, so select star from source-jaffleshop will compile to raw.jaffleshop.customers, with the database name raw, which is **clicks back** how we defined it in our yml file."" - show Compiled sql to prove it translates to the database.schema.object expected - Asking learners for how to update the stg_orders and stg_payments model Then jump back to the slides to talk about source-freshness 🚀 --- # Sources are rendered as green nodes They help users understand where data comes from (lineage)
??? Demo: Still on the stg_customers model, open the lineage tab and identify the source and model nodes. --- # Feeling fresh? Optionally, specify a `loaded_at_field` and `freshness`: .denser-text[ ```yml version: 2 sources: - name: jaffle_shop database: raw schema: jaffle_shop_original tables: - name: customers identifier: customers - name: orders identifier: orders loaded_at_field: _etl_loaded_at freshness: warn_after: count: 12 period: hour error_after: count: 24 period: hour ``` ] Then run `dbt source freshness` ??? - Now that we've defined our sources and we know how to test them ** jump back to intro slide ** - Has that ever happened to you before? And if it has happened to you, do you have a reliable way of capturing that? - dbt has a command called source freshness that will allow you to check the freshness of your source tables. All you need to do is add a loaded_at field to your source definition, and under that add a freshness block. - The loaded_at field should be a column from the source data that reliably captures when that row was loaded into the warehouse. This usually something like an updated_at field. - Your freshness blocks will then determine define the acceptable amount of time between the most recent record, and now, for a table to be considered "fresh". In this example, we have configured checks to run essentially every 12 hours. --- ## Snapshotting freshness .denser-text[ ```yml $ dbt source freshness Running with dbt=0.16.0 Found 5 models, 20 tests, 0 snapshots, 0 analyses, 130 macros, 0 operations, 0 seed files, 3 sources 17:29:17 | Concurrency: 4 threads (target='learn') 17:29:17 | 17:29:17 | 1 of 1 START freshness of jaffle_shop_original.orders......................... [RUN] 17:29:20 | 1 of 1 WARN freshness of jaffle_shop_original.orders.......................... [WARN in 2.91s] 17:29:21 | Done. ``` ] ### How does dbt check freshness? -- .denser-text[ * dbt runs a SQL statement .left-column[ ```sql select max({{ loaded_at_field }}) as max_loaded_at, {{ current_timestamp() }} as snapshotted_at from {{ source }} ``` ] .right-column[ ```sql select max(_batched_at) as max_loaded_at, convert_timezone('UTC', current_timestamp()) as snapshotted_at from raw.stripe.payment ``` ] * Then it uses python to determine whether the delta is within the right range ] --
demo
??? Demo Time: - "Let's see that in action!" - go to statement tab and run `select * from raw.jaffle_shop_original.orders` to show the metadata columns added by the loading tool (_etl_loaded_at) - add the freshness block to the sources.yml file - run `dbt source freshness` - highlight pass/warn/error - bonus: highlight sql and copy into statement tab to run raw outputs --- ## In dbt Cloud: Adding this step to a job helps you understand if one of your data sources is out of date. ```bash dbt source freshness ```
??? Behind the scenes, dbt will calculate the difference between the max of the loaded_at field, so the most recent row uploaded, and the time at which the snapshot is being run. The results of this query are used to determine whether the source is fresh or not. **Demo:** Use this an opportunity to demonstrate the source-freshness feature. Do the following over in dbt Cloud: - Copy the freshness configuration from `Feeling Fresh?` slide - Run dbt source-freshness - Show the code that is run against the warehouse to execute this - Tweak the criteria to 1 hour warn and 2 hour error to show how that changes the results of the command The dbt cloud source freshness page is accessable to dbt cloud developers but also to those with read-only seats - providing a friendly self service UI for stakeholders to check on the freshness of source data. (Which hopefully means fewer emails asking whether data from [insert your favorite/most problematic source system] is up to date in the BI tool. --- class: subtitle ## Checkpoint Given the following YML file, what will the following select statements compile to? .left-column[ `sources.yml` ```yml version: 2 sources: - name: event_data database: raw schema: snowplow tables: - name: events identifier: snowplow_events ``` ] .right-column[ `SQL` ```sql select * from {{ source('event_data', 'events') }} ``` ] ??? **Check for Understanding:** Ask the group to predict what the code on the right will compile to based on the YML file on the left. Learners will have to pay close attention to the YML configuration - Slide 1: Compiles to `raw.snowplow.events` - Slide 2: Compiled to `raw.stripe.payments` --- class: subtitle ## Checkpoint Given the following YML file, what will the following select statements compile to? .left-column[ `sources.yml` ```yml version: 2 sources: - name: payment_data database: raw schema: stripe tables: - name: payments identifier: stripe_payments ``` ] .right-column[ `SQL` ```sql select * from {{ source('payment_data','payments') }} ``` ] --- class: subtitle ## Checkpoint - How does using sources, instead of directly referencing tables, impact our work as analytics engineers? ??? **Check for understanding:** Ask for 1-3 learners to share how they might use sources in their future workflow. -- ## Questions?? ??? Pause for questions on sources based off what was covered. --- class: subtitle #Hands-On (8 min)
Create:
`_sources.yml` file in `staging` folder
Sources YAML structure (copy & paste from docs.getdbt.com)
Link:
Staging layer to sources (use source function)
Find:
Source nodes in DAG
Define:
Source freshness on one of your tables
Test:
Source freshness using the appropriate command
--- class: subtitle # Knowledge check You should be able to: * Define sources * Select from a source in a model * Check the freshness of your source --- # Resources * [dbt Source](https://docs.getdbt.com/docs/build/sources) * [Only run source tests](https://docs.getdbt.com/faqs/tests/testing-sources) * [Source Configurations](https://docs.getdbt.com/reference/source-configs) * [Source Freshness - properties](https://docs.getdbt.com/reference/resource-properties/freshness) * [Source Freshness - exclude](https://docs.getdbt.com/faqs/project/exclude-table-from-freshness) * [Source Freshness - jobs](https://docs.getdbt.com/docs/deploy/source-freshness) --- class: subtitle #Zoom Out
dbt Learn Norms
Why dbt?
Modeling
Project Design
Sources
--
Testing
--
Docs
Deployment