class: title, center, middle # Techniques of the trade - internal learn edition! --- class: subtitle, center, middle # Part I. Borrowing from software engineering --- ## ❓ How can we share files better? We are having a _really_ challenging time sharing different versions of files with each other. Alexis sent me an email with `dim_customers.sql` Jeremy pinged me in Slack with `dim_customers_v1.sql` I have `dim_customers_final.sql` on my desktop --- # ✅ Version Control All of the work that a developer does in dbt is written in code - typically sql, yml, or markdown dbt takes advantage of version control - git specifically - to make collaboration way easier. --- ## ❓ We keep messing up each other's models in development I'm developing some models and so is Alexis. I run the code only to realize that Alexis just changed the underlying data in her workflow. Or worse, I just made a change to the code and broke our KPI dashboard --- # ✅ Environments dbt leverages environments to allow users to operate in their own "sandbox" I'll build into the schema `dbt_kcoapman` and Alexis will build into the schema `dbt_awongbaird` * The _production_ models will build into their own schema `dbt_prod` --- ## ❓ I have to update the same code across multiple files I have a bunch of long SQL files that all start with the same logic. .left-column[ model_1 ``` select orderid as order_id, paymentid as payment_id, date as order_date ... ... ... ``` ] .right-column[ model_2 ``` select orderid as order_id, paymentid as payment_id, date as order_date ... ... ... ``` ] --- # ✅ Modularity with models Developers can break up the SQL into modular chunks that build off each other. These models can then be linked through the `ref` function. This allows for less repeated code and automated dependency management. --- ## ❓ Is there a way to make SQL more dynamic? SQL is by nature a declarative language. If something about the underlying data changes, I have to update the downstream code sometimes. There are no loops or conditional statements --- # ✅ Jinja + Macros Jinja brings aspects of Python directly into the SQL analytics workflow. You can leverage loops, conditionals, and variables. --- ## ❓ Is there a way I can make sure my data is _right_? After I transform my data I often... - scroll through the data to make sure it looks right - run some queries against the data - export the data to a CSV, put it in Excel, run some manual checks --- # ✅ Tests dbt has two types of tests built in - Schema tests = configurable, scalable - Data tests = specific, ad hoc --- ## ❓ I found some dbt code I want to use - should I just copy and paste it in? My colleague wrote some really cool code for generating model files. A dbt community member wrote an excellent testing suite of macros that I want to use. --- # ✅ Packages The packages feature allows you to import models and macros from another code base directly into your own project. This allows you to get the latest version of the package _instead_ of copy pasting in the code from another repo and updating manually. --- ## ❓ How can I make sure that my documentation of the data is up to date? I have a really large Google sheet that documents my data for my organization to self-serve. This is often out of date so I end up answering questions in chat/email all the time. The most common question I get is where does this data come from? --- # ✅ Documentation Documenting your models in dbt happens directly in the codebase rather than in some external tool. This is all automated into dbt docs which includes: - DAG - Model and column descriptions - Some metadata (more to come) --- # ❓ Can I refresh my models on a cadence? Every Monday I run a report, attach it to an email, and send it to my boss. This is the exact same process every Monday morning. --- # ✅ Deployment dbt Cloud empowers users to run their dbt project on a schedule. This can be any sequence of dbt commands. Typically includes: - `dbt run` - `dbt test` --- # ❓ Question My team is constantly editing the files in our project in their development environments. When they're ready to propose a change, they open a Pull Request in GitHub. - I want to know that the change will work before merging, to avoid breaking production - Once the change is merged, I want it in production! --- # ✅ CI/CD - Continuous Integration: test all proposed changes, in a separate environment, and report back with ✅ or ❌ - Continuous Deployment: always build the most recent "main" version of the code (and ideally, rebuild it any time it changes) --- class: subtitle, center, middle # Part II. Analytics Engineering tools (or cool dbt specific stuff) --- # ❓ How can I visualize raw tables in the DAG? Can I test my raw tables? I can see all the models in the DAG as blue nodes. I'd love to know exactly where that data comes from though. Is it possible to test that data before I transform it too? --- # ✅ Sources Sources are configured in YML and dependencies are built with the `source` function. Sources can be tested and documented - just like models! Bonus! There is a snapshot freshness command to make sure that the raw data is up to date. --- # ❓ Can I make dbt aware of BI dashboards? Some of my dashboards depend on my final models and I want to include that in my workflow. I'd also love to build only the models that power that dashboard. --- # ✅ Exposures Exposures allow you to make dbt aware of things that exist _outside_ of dbt but depend on dbt: BI dashboards, applications, reverse syncs, machine learning pipelines, ... These will be rendered in the DAG in orange. Combined with sources, you can now see the full lineage of your data warehouse. If you use dbt Cloud, you can also check an exposure's status: - Is the source data powering that exposure fresh? - Are tests passing on the transformations upstream of that exposure? --- # ❓ Can I change a model to be built as a table instead of a view? Whenever I run the command: `dbt run`, all my models are built as views. I'd love to build these as tables instead. --- # ✅ Materializations Materializations make it easy to change how a model is materialized. This can be done in two ways: - `config` blocks - `dbt_project.yml` file --- # ❓ How can I _only_ add the new records to a model? If I configure a model as a table, it will be be dropped and completely rebuilt every time I run `dbt run` I'd love to just add the new records instead of starting from scratch everytime. --- # ✅ Incremental Models Incremental models can be written to just add the new records to a table. This can reduce compute and build times for very large models. --- # ❓ How can I capture type 2 slowly changing dimensions? On Monday: | order_id | status | created_at | updated_at | |----------|---------|------------|------------| | 1 | pending | 2020-01-01 | 2020-02-20 | On Tuesday: | order_id | status | created_at | updated_at | |----------|---------|------------|------------| | 1 | shipped | 2020-01-01 | 2020-02-21 | --- # ❓ How can I capture type 2 slowly changing dimensions? What we want: | order_id | status | updated_at | dbt_valid_from | dbt_valid_to | |----------|---------|------------|----------------|--------------| | 1 | pending | 2020-01-01 | 2020-01-01 | 2020-01-02 | | 1 | shipped | 2020-01-02 | 2020-01-02 | null | --- # ✅ Snapshots Snapshots allow you to capture this time of model. The snapshot will only capture changes when it is run. Unlike tables, these do not get dropped when you run `dbt snapshot` --- # ❓ How can I get a quick CSV that won't change into dbt? I have a table of country code mappings or employee addresses. I'd love to have these in my warehouse as a table. --- # ✅ Seeds The command `dbt seed` will upload an csv's in your `data` folder into your warehouse. These can be referenced with the `ref` function just like other models! --- # ❓ Can I stash some SQL in my dbt project that I don't want to be built at a model? --- # ✅ Analyses The analysis folder can be used for storing SQL that you want in version control Use cases: - ad hoc queries - training queries - auditing --- # ❓ Love the deployment concept - can I be notified when things break? If a tests fails, I want to be the first to know about it so I can take action. --- # ✅ Alerts dbt Cloud currently supports Slack and Email notifications. --- class: subtitle, center, middle # Part III. Contributing to the knowledge loop --- # I can't have been the first person who... - ❓ Wants to use blue-green deployments via zero-copy cloning on Snowflake - ❓ Wants to create near-real-time models by mixing incremental + view - ❓ Needs to test nested BigQuery fields - ❓ Needs to determine whether to have one or several dbt projects across the organization -- ## ✅ [Discourse](https://discourse.getdbt.com/top/all?order=views) --- # I can't have been the first person who... - ❓ Needs to model subscription data - ❓ Needs to model customer attribution -- ## ✅ Playbooks - [Monthly recurring revenue](https://github.com/dbt-labs/mrr-playbook) - [Marketing attribution](https://github.com/dbt-labs/attribution-playbook) --- # I have an idea for dbt core! - ❓ How can I fix a bug? - ❓ I have a new testing idea -- ## ✅ Contribute through github - Reach out to Jeremy - Pick up an issue - Contribute! --- # I have some data questions/ideas... - ❓ I am hitting a bug with a package? - ❓ How should I model this complex data? - ❓ I'd love to share a blog post to other analytics engineers? - ❓ I made this awesome meme about data -- ## ✅ dbt Community Slack - Find a channel and jump in! --- class: subtitle ## Internal Learn Pt. 2 .left-column[ ### Track 1 - Jerco - for the software engineer - seeing analytics code - thinking about dbt infrastructure ] .right-column[ ### Track 2 - Alexis - for the analytics engineer experience - writing analytics code ]