class: title, center, middle # Techniques of the trade ### The Final Countdown ??? We've already learned the tools of the trade (dbt), but now we are learning the techniques. We're going to go through a series of problems that analytics engineers often have to solve, and discuss how we would typically solve them. Feel free to jump in with suggested techniques at any point. Notes for teacher: * This presentation has _a lot_ in it. The goal of this lesson is to show students what they _can_ do with dbt/give them the resources to solve these problems in the future. * Also use this time to answer any unsolved problems from earlier in the course * Take a collaborative approach: for each problem, see if any students can suggest a solution ---
Techniques of the trade | Focus
Get a brief overview of Top 10 advanced dbt features and their use cases
Features include Seeds, Snapshots, Hooks, Operations, & more
--- class: subtitle, center, middle # Part I. More than models --- # 10. My source data keeps changing --- ### On Monday: | order_id | status | created_at | updated_at | |----------|---------|------------|------------| | 1 | pending | 2020-01-01 | 2020-02-20 | ### On Tuesday: | order_id | status | created_at | updated_at | |----------|---------|------------|------------| | 1 | shipped | 2020-01-01 | 2020-02-21 | ??? Imagine you have an orders table where the status field can be overwritten as the order is processed. One day you query the table and get back this result, but the next day you get this result. You can no longer answer: - When did that order ship? - How long did it take to change status? -- ## What we want: | order_id | status | updated_at | dbt_valid_from | dbt_valid_to | |----------|---------|------------|----------------|--------------| | 1 | pending | 2020-01-01 | 2020-01-01 | 2020-01-02 | | 1 | shipped | 2020-01-02 | 2020-01-02 | null | ??? We want to turn our **mutable** data into **immutable data** --- # ❓ How do I make my data immutable? | order_id | status | updated_at | dbt_valid_from | dbt_valid_to | |----------|---------|------------|----------------|--------------| | 1 | pending | 2019-01-01 | 2019-01-01 | 2019-01-02 | | 1 | shipped | 2019-01-02 | 2019-01-02 | null | -- ## ✅ Technique: Snapshots --- # Anatomy of a snapshot A `.sql` file in the `snapshots` directory: .dense-text[ ```sql -- snapshots/orders_snapshot.sql {% snapshot orders_snapshot %} {{ config( target_database='analytics', target_schema='snapshots', unique_key='id', strategy='timestamp', updated_at='updated_at', ) }} -- Pro-Tip: Use sources in snapshots! select * from {{ source('jaffle_shop', 'orders') }} {% endsnapshot %} ``` ] ??? - Snapshots are simple select statements which define a dataset. They live in the snapshots directory, and are enclosed in Jinja tags --- # Anatomy of a snapshot From the command line: ```bash $ dbt snapshot ``` ??? - Every time you run the dbt snapshot command, dbt will run your select statement and check if the dataset has changed compared to its last known state. - If the records have changed, then dbt will create a new entry in the snapshot table for the new state of the record. If there are net-new records in the source dataset, then dbt will create an initial entry for the record. -- .caption[ [Learn more about snapshots](https://docs.getdbt.com/docs/snapshots) ] --- # ❓ 9. I have a long case statement ```sql -- models/orders.sql select order_id, case when country_code = 'AU' then 'Australia' when country_code = 'UK' then 'United Kingdom' when country_code = 'US' then 'United States' ... end as country_name from {{ ref('orders') }} ``` -- ## ✅ Technique: seeds ??? Seeds are CSV files that you can `ref` in your project --- # Anatomy of a seed `data/country_codes.csv`: ```txt country_code,country_name AU,Australia UK,United Kingdom US,United States ``` -- Command: ```bash $ dbt seed ``` --- # Anatomy of a seed Downstream model: ```sql -- models/orders.sql select orders.order_id, country_codes.country_name from {{ ref('orders') }} as orders left join {{ ref('country_codes') }} as country_codes on orders.country_code = country_codes.country_code ``` ??? * Seeds live in the `data` folder as `.csv` files * Load them with the `dbt seed` command * You can `ref` them like any other model * Appropriate for version controlling business logic, not for loading huge datasets -- .caption[ [Learn more about seeds](https://docs.getdbt.com/docs/seeds) ] --- # Other use cases for seeds * Email addresses to exclude * List of employee accounts --- # ❓ 8: I have a query that I want to version control But I don't need (or want) to create a model! -- ## ✅ Technique: analyses --- # Anatomy of an analysis `analysis/mrr_dashboard.sql` ```sql select date_month, sum(is_active::integer) as customers, sum(mrr) as mrr from {{ ref('fct_mrr') }} group by 1 ``` .caption[ [Learn more](https://docs.getdbt.com/docs/analyses) ] --- # ❓ 7: Actually, I want to use that query as a test! (But I still don't want it to be a model.) -- ## ✅ Technique: singular tests --- # Anatomy of a singular test `tests/assert_total_payment_amount_is_positive.sql` ```sql -- Refunds have a negative amount, so the total amount should always be >= 0 -- Therefore return records where this isn't true to make the test fail select order_id, sum(amount) as total_amount from {{ ref('fct_payments' )}} group by 1 having not(total_amount >= 0) ``` Run with `dbt test` (or `dbt test --select test_type:singular`) .caption[ N.B. tests should return **rows**. The test command will summarize via `count(*)` allowing you to set thresholds in a config. [Learn more](https://docs.getdbt.com/docs/building-a-dbt-project/tests) ] --- # ❓ 6: I want to version control some KPIs -- ## ✅ Technique: metrics table --- # Anatomy of a metrics table 1. Standard structure: name, type, department, date, aggregation 2. One model for each metric 3. Union them together 4. Date spine (final grain: one record per day per metric) 4. Single model for comparison window functions | date_day | metric_channel | metric_name | metric_type | value | past_7 | past_14 | past_21 | ly_value | ... | target | |------------|----------------|-------------|-------------|-------|--------|---------|---------|----------|-----|--------| | 02/06/2019 | marketing | leads | count | 20 | 200 | 300 | 150 | 15 | ... | 15 | | 02/06/2019 | marketing | mql | count | 7 | 40 | 60 | 50 | 3 | ... | 10 | | 02/06/2019 | marketing | sqls | count | 2 | 10 | 5 | 20 | 2 | ... | 3 | | 02/06/2019 | accounting | revenue | amount | 150 | 3000 | 2000 | 1500 | 20 | ... | 100 | ??? They can get tricky to handle in the BI layer: * You lose the ability to drill into the data when you do this! * Further aggregation can mess things up, especially for non-additive metrics (e.g. conversion rates) --- # Pulling it all together * Metrics models are on far right of DAG, select from dim + fact **models** * Bring in goals, targets, forecasts, definitions as **seeds** * Take a **snapshot** of metrics table to identify historical anomalies * Compare to expected or historical values with a **data test** To ask questions like: - Is YTD revenue within 2 standard deviations of target? - Is MRR up 10% compared to this day last month? - Is last year revenue within 1% of where we calculated it to be yesterday? --- class: subtitle, center, middle # Part II. Quality of life --- ## ❓ Problem 5: I need to run _some_ of my models _sometimes_. - Some models hourly, other models daily - Just test my most critical models -- ### ✅ Technique: tags! --- .dense-text[ In dbt_project.yml: ```yml models: my_project: marts: product: +tags: ["hourly"] finance: +tags: - "daily" - "important" ``` In a model: ```sql {{config( materialized = 'table', tags = ["hourly"] )}} ``` In deployment: ``` dbt run -s tag:hourly ``` ] --- # ❓ Problem 4: My runs take too long when I'm developing -- ## ✅ Technique: Limit data in dev ??? * Especially useful for event-style data --- # `{{ target }}` * A dbt-specific Jinja variable containing information about your current target * `{{ target.name }}` * `{{ target.schema }}` * `{{ target.type }}` * `{{ target.user }}` * etc [Learn more](https://docs.getdbt.com/docs/target) ??? * These docs are dbt docs, not Jinja docs --- # Anatomy of limiting data in dev ```sql with pages as ( select * from {{ source('snowplow', 'pages') }} {% if target.name == 'dev' %} where event_date >= dateadd('day', -3, current_date) {% endif %} ), ... ``` -- - You can use `target` for _lots_ of environment-aware logic. - _Alternatives:_ [`var`](https://docs.getdbt.com/reference/dbt-jinja-functions/var/) or [`env_var`](https://docs.getdbt.com/reference/dbt-jinja-functions/env_var/). More flexible but also more finicky! 🤷 ??? Tradeoffs: * This doesn't work so well for relational data — you can lose relational integrity * Also how do you then test that your dashboards work as expected? --- # Problem 3: I want to use custom schemas in prod but not in dev. .left-column[ In production, control... ``` analytics.dim_customer intermediate.customer__daily_channels marketing.fct_customer_attribution ``` ] .right-column[ In development, convenience! ``` dev_kyle.dim_customer dev_kyle.customer__daily_channels dev_kyle.fct_customer_attribution ``` ] -- ## ✅ Technique: env-based schema logic --- # Anatomy of env-based schemas In `dbt_project.yml`: ```yml models: marketing: +schema: marketing intermediate: +schema: intermediate ``` In `macros/generate_schema_name.sql`: ```sql {% macro generate_schema_name(custom_schema_name, node) -%} {{ generate_schema_name_for_env(custom_schema_name, node) }} {%- endmacro %} ``` Based on [this code](https://github.com/dbt-labs/dbt-core/blob/dev/kiyoshi-kuromiya/core/dbt/include/global_project/macros/etc/get_custom_schema.sql), but you can write whatever logic you want. --- # Not just schemas dbt looks for three macros when figuring out what goes where: - database (project) naming --> `generate_database_name` - schema (dataset) naming --> `generate_schema_name` - aliasing --> `generate_alias_name` --- ## ❌ Alternative: bespoke Jinja statements In dbt_project.yml: ```yml models: marketing: +schema: "{{ 'marketing' if target.name == 'prod' else target.schema }}" intermediate: +schema: "{{ 'intermediate' if target.name == 'prod' else target.schema }}" ``` This is one environment-aware area where we don't recommend using `target`, `var`, or `env_var`. Things get very tricky very quickly. --- ## ❓ Problem 2: I can't query my stuff!! I am getting permission denied errors on relations that dbt creates: ```txt -- snowflake SQL compilation error: Object 'ANALYTICS.DBT_CLAIRE.MY_TABLE' does not exist or not authorized. -- redshift ERROR: permission denied for relation my_table ``` With so many environments, it's very confusing! -- ### ✅ Technique: version-controlled grant statements ??? We want to ensure that as soon as dbt creates a relation, my BI tool can `select` from it There's a few ways to do this, and it's highly warehouse dependent -- ```sql grant usage on schema dbt_claire to role reporter; grant select on dbt_claire.customers to role reporter; ``` ??? Essentially we want to run a statement _like_ this --- # Tool A: Hooks * Snippets of SQL that you can run: * `on-run-start`: at the beginning of a `dbt run` or `dbt seed` * `on-run-end`: at the end of a run * `pre-hook`: before a model runs * `post-hook`: after a model runs ??? Side step into hooks very quickly to look at it as a way to do this --- # Tool A: Hooks * Configured from the `dbt_project.yml` file or in a model ```yml # dbt_project.yml on-run-end: - "grant usage on {{ target.schema }} to role reporter" models: +post-hook: "grant select on {{ this }} to role reporter" ``` ```sql -- models/my_model {{ config(post_hook="grant select on " ~ this ~ "to role reporter")}} select ... ``` .caption[ [Learn more about hooks](https://docs.getdbt.com/docs/hooks/) ] --- # Tool B: Operations A _macro_ that you can run using the `run-operation` command ```sql -- macros/grant_select.sql {% macro grant_select() %} {% set grant_sql %} grant usage on schema {{ target.schema }} to role reporter; grant select on dbt_claire.customers to role reporter; {% endset %} {% do run_query(grant_sql) %} {% endmacro %} ``` Command line: ```bash $ dbt run-operation grant_select ``` .caption[ [Learn more about operations](www.getdbt.com/docs/using-operations) ] ??? Note that you have to run the query These won't work, but here's how we put it together --- # Implementation details Consider: * Are you using multiple schemas? You'll need to grant the `schemas` variable * Can you leverage default/future grants? * Can you abstract this logic into a macro instead of hooks? * Redshift: Are you granting to a single user, or an entire group? [Our preferred approach](https://discourse.getdbt.com/t/the-exact-grant-statements-we-use-in-a-dbt-project/430) --- # Other use cases for hooks & operations * Creating UDFs at the start of a run ([Discourse](https://discourse.getdbt.com/t/using-dbt-to-manage-user-defined-functions/18)) * Performing warehouse maintenance on Redshift ([package](https://hub.getdbt.com/dbt-labs/redshift/latest/)) * Staging external tables ([package](https://hub.getdbt.com/dbt-labs/dbt_external_tables/latest/)) --- class: subtitle, center, middle # Part III. Giant shoulders --- # 1. I can't have been the first person who... -- - ❓ Needs to create a table of all days - ❓ Wants to add a threshold to the uniqueness test - ❓ Needs to do an ETL migration (and check we didn't break anything) - ❓ Needs to generate a ton of sources -- ## ✅ Technique: use (or adapt) a package! - [dbt-utils](https://hub.getdbt.com/dbt-labs/dbt_utils/latest/), [audit-helper](https://hub.getdbt.com/dbt-labs/audit_helper/latest/), [codegen](https://hub.getdbt.com/dbt-labs/codegen/latest/) - [package hub](https://hub.getdbt.com/) --- # I can't have been the first person who... - ❓ Needs to model subscription data - ❓ Needs to model customer attribution -- ## ✅ Technique: read a playbook! - [Monthly recurring revenue](https://github.com/dbt-labs/mrr-playbook) - [Marketing attribution](https://github.com/dbt-labs/attribution-playbook) --- # I can't have been the first person who... - ❓ Wants to use blue-green deployments via zero-copy cloning on Snowflake - ❓ Wants to create near-real-time models by mixing incremental + view - ❓ Needs to test nested BigQuery fields - ❓ Needs to determine whether to have one or several dbt projects across the organization -- ## ✅ Technique: search in [discourse](https://discourse.getdbt.com/top/all?order=views)! --- # Techniques - Snapshots - Seeds - Analyses - Data tests - Metrics - Tags - Env-based logic (`target`) - Env-based relation naming - Hooks & operations - Packages, playbooks, discourse, oh my! --- class: subtitle, center, middle # Fin! --- class: subtitle # Zoom Out
Polish your dbt project
Intro to Jinja
dbtonic Jinja
Packages and projects
dbt Materializations
Incremental Models
Techniques of the trade
--
Closing ceremony
---