class: title, center, middle # dbt Materializations --- # What is a materialization? -- 1. A special kind of **Jinja macro** that wraps your model SQL - Includes all the boilerplate DDL/DML: `create`, `alter`, `drop`, ... -- 2. An abstraction on top of the **database** - Does this model already exists in the database? - What if it exists as a table, but I want it to be a view? - What if Redshift uses `alter table` for views, but Snowflake uses `alter view`? (It's true!) ---
dbt Materializations & Incremental Models | Focus
Name and describe the four standard materializations in dbt
Explain some basic use cases for views, tables, and ephemeral materializations
--- # Standard materializations .left-column-66[ - Table - View - Ephemeral - Incremental - Snapshot* ### You can write your own! - Just like you can write your own macros, schema tests, etc. - E.g. ["insert by period"](https://github.com/dbt-labs/dbt-utils#materializations) ] .right-column-33[
] ??? - Materializations are some of the trickiest macros we write --- # Imagine: some people | id | name | favorite_color | is_creative | |----|---------|----------------|-------------| | 1 | Alice | green | false | | 2 | Barbara | blue | true | --- # Imagine: a model `models/creative_people.sql` ```sql select id as person_id, name, favorite_color from raw_people where is_creative = true ``` --- # Table In the database: `analytics.creative_people` | person_id | name | favorite_color | |----|---------|----------------| | 2 | Barbara | blue | -- If we run: ```sql select * from analytics.creative_people ``` We will see: | person_id | name | favorite_color | |----|---------|----------------| | 2 | Barbara | blue | --- # View In the database: `analytics.creative_people` ```sql select id as person_id, name, favorite_color from raw_people where is_creative = true ``` -- If we run: ```sql select * from analytics.creative_people ``` We will see: | person_id | name | favorite_color | |----|---------|----------------| | 2 | Barbara | blue | --- # Ephemeral - Does not exist in the database - Reusable code snippet - Interpolated as CTE in a model that `ref`s this model -- If we run: ```sql select * from analytics.creative_people ``` We will see: ```sql Database Error ``` --- # What if... | id | name | favorite_color | is_creative | |----|---------|----------------|-------------| | 1 | Alice | green | false | | 2 | Barbara | blue | true | -- | 3 | Casper | red | true | -- ### What would you do? * **Table:** Rebuild whole thing from scratch! Brute force always works. * **View:** Recalculate every time I'm queried. Always up to date, often slow. * **Ephemeral:** That's not really my problem, yanno? I just sorta go with the flow. -- * **Incremental:** I've got an idea! --- .left-column[ # The Big Idea * Let's keep the old table, and we'll just _add_ the _new_ records * Less wasteful and probably faster * We just need to know: * What we mean by "new" * What we mean by "add" * As long as we have good answers, we're in good shape ] .right-column[
] --- layout: true # Who is new? And how do you know? --- | id | name | favorite_color | is_creative | |----|---------|----------------|-------------| | 1 | Alice | green | false | | 2 | Barbara | blue | true | | .highlight[3] | Casper | red | true | -- - The auto-increment ID from our source data system is greater than the previous greatest auto-increment ID - We can put unquestioning trust in our source data systems, and have never been burned before -- ... loljk --- | id | name | favorite_color | is_creative | created_date | |----|---------|----------------|-------------|--------------| | 1 | Alice | green | false | 2020-03-01 | | 2 | Barbara | blue | true | 2020-03-02 | | 3 | Casper | red | true | .highlight[2020-03-03] | - A date when the record was initially created - _Much_ more reliable than an auto-increment ID - Records can only be created once and never updated later -- ... right? --- | id | name | favorite_color | is_creative | created_date | updated_date | |----|---------|--------------------|-------------|--------------|--------------| | 1 | Alice | green | false | 2020-03-01 | 2020-03-01 | | 2 | Barbara | .highlight[yellow] | true | 2020-03-02 | .highlight[2020-03-04] | | 3 | Casper | red | true | 2020-03-03 | 2020-03-03 | -- | 4 | Zelda | indigo | true | .highlight[2020-03-02] | .highlight[2020-03-02] | -- - Did you know _that_ could happen? - Or that _that_ could happen? --- layout: false class: subtitle # After lunch .left-column[ ### Track 1: Talking - Incremental models (demo) - Modern data warehouses - Open floor: data tooling and ecosystem ] .right-column[ ### Track 2: Doing - More Jinja, packages, custom schema tests - Case study: modeling from scratch - Project therapy ] --- class: title, center, middle # Incremental models `where updated_date >= '2023-05-26'` --- # Why? * It costs time (and money) to transform data * Historical data doesn't (shouldn't) change * If you can get away _without_ re-transforming historical data, you save time and money ### Good candidates - Immutable event streams: tall + skinny table, append-only, no updates - If there _are_ any updates, a reliable `updated_at` field --- # Why not? * It's tricky, and there are trade-offs! * Which is costlier: database time ($$), or human analyst time ($$$$)? * It's probably not worth it if: * You have small-ish data * Your data changes constantly: new columns, renamed columns, etc. * Your data is updated in unpredictable ways * Your transformation performs comparisons or calculations that require _other_ rows (table + view) --- # And if it works?
--- # Example: Snowplow events * Page views and pings on **learn.getdbt.com**! One was fired just now, as you're reading this. And... now :) * Create a source like so: ```yml version: 2 sources: - name: snowplow database: raw loaded_at_field: collector_tstamp freshness: error_after: {count: 1, period: hour} tables: - name: event ``` --- ### We've got a model: events → page views .denser-text[ ```sql with events as ( select * from {{source('snowplow','event')}} ), page_views as ( select domain_sessionid as session_id, domain_userid as anonymous_user_id, web_page_context.value:data.id::varchar as page_view_id, page_url, count(*) * 10 as approx_time_on_page, min(derived_tstamp) as page_view_start, max(collector_tstamp) as max_collector_tstamp from events, lateral flatten (input => parse_json(contexts):data) web_page_context group by 1,2,3,4 ) select * from page_views ``` ] --- # Tactics - Why do we roll up events into page views and sessions? - How do we want to materialize this model? -- - Start with `view` - When it takes too long to query, switch to `table` - When it takes too long to build, switch to `incremental` -- - How can we make this incremental? --- class: middle
.center[[Read it!](https://discourse.getdbt.com/t/on-the-limits-of-incrementality/303)] --- # "The big easy" _Keep it simple:_ This model rolls up **events** into **page views**. A page view contains one or more events with the same `page_view_id`. The full model SQL is on the next slide. It's very similar to the model code we had previously, with a few additions. Pay close attention to: * `unique_key` * `is_incremental()` _What do these mean?_ ([docs](https://docs.getdbt.com/docs/building-a-dbt-project/building-models/configuring-incremental-models/)) --- .denser-text[ ```sql {{config( materialized = 'incremental', unique_key = 'page_view_id' )}} with events as ( select * from {{source('snowplow','event')}} {% if is_incremental() %} where collector_tstamp >= (select max(max_collector_tstamp) from {{ this }}) {% endif %} ), page_views as ( select domain_sessionid as session_id, domain_userid as anonymous_user_id, web_page_context.value:data.id::varchar as page_view_id, page_url, count(*) * 10 as approx_time_on_page, min(derived_tstamp) as page_view_start, max(collector_tstamp) as max_collector_tstamp from events, lateral flatten (input => parse_json(contexts):data) web_page_context group by 1,2,3,4 ) select * from page_views ``` ] --- # Special Jinja variables ### `{{ this }}` Represents the currently existing database object mapped to this model. ### `is_incremental()` Checks four conditions: .dense-text[ 1. Does this model already exist as an object in the database? 2. Is that database object a table? 3. Is this model configured with `materialized = 'incremental'`? 4. Was the `--full-refresh` flag passed to this `dbt run`? ] Yes Yes Yes No == an incremental run ??? If all of the conditions are met, `is_incremental()` returns `true`. --- # Try it out! * What SQL does dbt use when first building the `fct_page_views` model? * What SQL does dbt use for every run thereafter? * How long does each run take? * Convince yourself that it's working: ```sql select * from {{source('snowplow','event')}} where collector_tstamp >= ( select max(max_collector_tstamp) from {{ ref('fct_page_views') }} ) ``` --- # Conceptual framework
How is this idempotent? - At any point, you could `run --full-refresh` and get the "true" table. - The goal of incremental models is to _approximate_ the "true" table in a fraction of the runtime --- # Under the hood What's the DDL/DML that dbt is running? 1. Create a temp table of new records 2. Reconcile the existing table with the temp table, using one of: - `merge` - `delete` + `insert` ("upsert") - `insert overwrite` ("replace") _entire partitions_ Depends on: - database support - user input: `incremental_strategy` ([docs](https://docs.getdbt.com/docs/building-a-dbt-project/building-models/configuring-incremental-models/#what-is-an-incremental_strategy)) --- class: subtitle, center, middle # How things fall apart --- ## Take our logic a step further .denser-text[ With the rest of the model as before, we're going to add a new CTE. Replace ```sql ) select * from page_views ``` with ```sql ), indexed as ( select *, row_number() over ( partition by session_id order by page_view_start ) as page_view_in_session_index from page_views ) select * from indexed ``` and give it a run! ] --- # What if...
What if a session is split across our automated runs? We end up transforming half the page views in one incremental run, and half in another. --- # What to do? In the simple approach, we would calculate two sets of page views: * Computed from first half of events in the session. `page_view_in_session_index` is 1, 2, 3... * Computed from second half of events in the session. `page_view_in_session_index` is 1, 2, 3... Why does this happen? - Running dbt in the middle of a session - Late-arriving events --- # Late-arriving events .left-column-66[
X: days between firing + collection Y: % of all events ] .right-column-33[ - Dataset: 2.5 months of Snowplow data, 285mm events - 99.82% of events arrive within 1 hr of firing - 99.88% within 24 hr - .highlight[99.93% within 72 hr] ] --- # "Close enough & performant"
- _Always_ recalculate the 3 previous days - Empirically, this will work for 99.93% of all events on the first try - Full refresh on a weekly basis --- # Take it one step further .dense-text[ ```sql indexed as ( select *, row_number() over ( partition by session_id order by page_view_start ) as page_view_in_session_index, row_number() over ( partition by anonymous_user_id order by page_view_start ) as page_view_for_user_index from page_views ) select * from indexed ``` ] --- # "Always correct & slow"
- If a user has a new event, recalculate _all_ page views for that user - This works with window functions! But it's **_slowwwww_**... --- # Too clever by half
- Whenever a user has a new session, pull the user's _most recent session only_, and perform relative calculations - This takes some hard thinking! E.g. [Segment package](https://github.com/fishtown-analytics/segment/blob/master/models/sessionization/segment_web_sessions.sql#L67) --- # What about _truly_ massive datasets? Bigger datasets (and Big Data technologies) are different cost-optimization problems. - Always rebuild past 3 days. Fully ignore late arrivals. - Always replace data at the _partition level_. - No unique keys. - Targeted lookback? No way: too much extra data to scan. We're going to talk about this next :) --- # It's the modeling, silly Keep the inputs and transformations of your incremental models as singular, simple, and immutable as possible. - Slowly changing dimensions, like a `product_name` that the company regularly rebrands? Join from a `dim` table. - Window functions for quality-of-life counters? Fix it in post—i.e., a separate model. This is how our Snowplow package calculates each user's `session_index`. Rethink! Incrementality introduces a level of complexity (for you and for the database) that is **necessarily a trade-off**. --- class: subtitle # Knowledge check You should be able to: - Configure `incremental` models - Identify whether data sources are good or bad candidates - Understand the trade-offs of transforming data incrementally ### Questions? --- class: subtitle # Zoom Out ---