class: title, center, middle # Data Warehouses --- # Why are we talking about data warehouses? - The advent of Redshift (et al) was **_the_** instigating event in making the modern data stack possible - dbt is nothing without the supercomputer that runs its SQL
--- # Pragmatism .right-column[ Learning about your data warehouse is a surefire way to - raise the ceiling on your skills as an analytics professional - save a lot of money ] .left-column[
] ---
Data Warehouses | Focus
Explain the difference between a transactional and analytical database
Understand what databases are doing to run your queries behind the scenes
Understand differences between databases, warehouses, lakes, lakehouses
Explain the key features of a modern data warehouse
Explain which data platforms are best in class and why
--- # What is a data warehouse? --
--- # "Data" is just - information - stored in files - with more structure or less - does it really matter how it's stored? --- ### CSV ```csv id,name,favorite_color,is_creative 1,Alice,green,false 2,Barbara,blue,true ``` ### JSON ```json {"id": 1, "name": "Alice", "favorite_color": "green", "is_creative": false} {"id": 2, "name": "Barbara", "favorite_color": "blue", "is_creative": true} ``` --- ### Parquet ```parquet 00000000 50 41 52 31 15 04 15 c8 ba 01 15 c8 ba 01 4c 15 |PAR1..........L.| 00000010 c6 0f 15 04 00 00 00 2a e9 6c f2 19 00 00 4e 7f |.......*.l....N.| 00000020 25 00 00 5e 30 d0 e1 37 00 00 4e 7f 25 00 00 2e |%..^0..7..N.%...| 00000030 f9 22 cb 03 00 00 4e 7f 25 00 00 f2 ba cd fb 01 |."....N.%.......| 00000040 00 00 4e 7f 25 00 00 4e be 04 ac 10 00 00 4e 7f |..N.%..N......N.| 00000050 25 00 00 c4 bc 95 26 18 00 00 4e 7f 25 00 00 a8 |%.....&...N.%...| 00000060 cd 63 00 1c 00 00 4e 7f 25 00 00 a4 0d 1f 37 16 |.c....N.%.....7.| 00000070 00 00 4e 7f 25 00 00 92 b3 57 b5 0c 00 00 4e 7f |..N.%....W....N.| 00000080 25 00 00 4e 7b 7e 8f 3c 00 00 4e 7f 25 00 00 94 |%..N{~.<..N.%...| 00000090 2e 7a 95 00 00 00 4e 7f 25 00 00 74 a7 38 2f 3b |.z....N.%..t.8/;| ``` --- # Databases .left-column[ - Interfaces into data! - Some language (SQL) is used to query the data - Data storage is abstracted - Machine-readable optimized file formats - On spinning disks, inside boxes, somewhere in Virginia/Oregon/Ireland - (Of course it matters how) ] .right-column[
] ??? We're talking almost exclusively about structured data for the rest of this presentation, but of course, Mongo et al are databases, too --- # Transactional Databases .left-column[ Optimized for: - Reads and writes - Thousands of concurrent queries In this category: - Postgres - MySQL - SQL Server - ... ] .right-column[ For instance: - Get one user’s email address - Add seven items to a user’s cart - Change a user’s last name ```sql select * from products where trending = true ```
] --- # Analytic Databases .left-column[ Optimized for: - "Analytical" queries - Importing and exporting data in batches In this category: - Amazon Redshift - Snowflake - Google BigQuery - Azure SQL Data Warehouse - .... ] .right-column[ For instance: - Number of distinct users in Ohio are over 38 years young - Rolling 14d average of pageviews per day for the last five years ] --- # Database Comparisons* | db | Latency | Concurrent Queries | Rows Scanned / Query | |---------------|---------------|--------------------|----------------------| | Transactional | ~milliseconds | ~thousands | ~thousands* | | Analytic | ~seconds | ~dozens | >> millions | --- # How does this affect analytics? .left-column[ ### Then (transactional) - Storage and aggregations are expensive - Joins are cheap - Datasets are best when federated between dozens/hundreds of objects ] .right-column[ ### Now (analytic) - Storage and aggregations are cheap - Joins are expensive - Massive datasets are updated in batches and queried in partitions ] --- # It's 1998. You've got mail. ``` Dear Analyst, I would like to know how many widgets we sold in 1997 (last year). Just widgets in our TV category, that is. Oh, and if you could slice that number by the brand name of the widget, and dice it by the country where we sold them, that'd be rad. Thanks! Tristan SVP Marketing Analytics ``` --- ### Before: snowflake schema (ca. 2000s)
--- ### Before: snowflake schema (ca. 2000s) ```sql SELECT B.Brand, G.Country, SUM(F.Units_Sold) FROM Fact_Sales F INNER JOIN Dim_Date D ON F.Date_Id = D.Id INNER JOIN Dim_Store S ON F.Store_Id = S.Id INNER JOIN Dim_Geography G ON S.Geography_Id = G.Id INNER JOIN Dim_Product P ON F.Product_Id = P.Id INNER JOIN Dim_Brand B ON P.Brand_Id = B.Id INNER JOIN Dim_Product_Category C ON P.Product_Category_Id = C.Id WHERE D.Year = 1997 AND C.Product_Category = 'tv' GROUP BY B.Brand, G.Country ``` --- ### After: denormalized model (2020) .denser-text[ | sale_date | units_sold | store_id | store_number | state_province | country | ean_code | product_name | brand_name | product_category_name | |-----------|------------|----------|--------------|----------------|---------|---------------|--------------|--------------|-----------------------| |1997-01-01 | 7 | 1234 | 12 | PA | US | 901234-123457 | Widget | Widgetastic | tv | |1997-01-02 | 8 | 1234 | 12 | MA | US | 901234-123457 | Widget | Widgetastic | tv | |1997-01-03 | 9 | 1234 | 12 | NSW | AU | 901234-123457 | Widget | Widgetastic | tv | ] --- ### After: denormalized model (2020)
--- ### After: denormalized model (2020) ```sql select brand, country, sum(units_sold) as total from analytics.fact_sales where date_trunc('year', sale_date) = '1997-01-01' and product_category = 'tv' group by 1,2 ``` --- class: subtitle, center, middle # Transactional. Analytic. ### Which one is right? --- # Both! .img-center[
] -- .left-column[.center[ Power your application with a transactional database ]] .right-column[.center[ Analyze data with an analytical database ]] --- # A modern approach 1. Replicate data into a warehouse - Transactional databases - SaaS products - Event tracking 2. Query the data in the warehouse 3. Be a cultural change agent for data in your organization 4. Profit --- # P.S. What's in a name? .left-column[ Transactional Database, a.k.a.: - Relational Database - RDMS or RDBMS - Rowise, Row-oriented, Row Store - OLTP - e.g. Postgres ] .right-column[ Analytic(al) Database, a.k.a.: - Data Warehouse - Columnar Database - Column-oriented, Column Store - OLAP - e.g. Redshift ] ??? - RDMS or RDBMS = Relational Database Management System - OLTP = Online Transactional Processing - OLAP = Online Analytical Processing --- class: subtitle # Checkpoint: What type of database should you use if... - You are creating a site where users can write reviews of movies, books, and music? -- - You want to combine data from Snowplow and Stripe to visualize attribution data in Looker? -- ## Questions? ??? --- class: subtitle, center, middle # Data Warehouses --- name: big-q # Q: What's _fast_ in a data warehouse? --- template: big-q name: first-a .left-column[ ### A: "Analytical" queries This structure of data storage allows them to .highlight[**limit scanned data**] when aggregating across a subset of columns. ] --- template: first-a name: second-a .right-column[ ### A: Parallel processing Data warehouses can benefit from .highlight[**horizontal and vertical scaling**]. - _horizontal = more computers_ - _vertical = better computers_ ] --- class: subtitle, center, middle # So... why is my query slow? --- # Mental model - A query, written in SQL, is interpreted by the database optimizer - The optimizer writes an execution plan in machine code - The execution plan contains a sequence of tasks - Some tasks are easy, some are more strenuous - Some tasks can be performed in parallel, some to need to wait for others to finish | Task | Time | Time (Relative) | |---------------------|-------|-----------------| | 1 CPU Cycle | 0.3ns | 1 second | | Memory Access | 120ns | 5 minutes | | Disk read | 1ms | 1 month | | Internet: SF to NYC | 40ms | 4 years | --- # Conceptual framework - Scanning data is **_slow_** - Moving data around is **_slowww_** - Warehouses are "logical" and "physical" - We like to think about the "logical" part of it - Sometimes it’s necessary to consider the "physical" - Biggest takeaway? - Your time is valuable - Make choices that keep you focused on the "logical" --- # In practice .left-column[ Databases need to scan entire tables: - Filters (`where`) - ...unless your table is sorted / pre-partitioned / z-ordered ] .right-column[ Databases need to scan _and move_ data: - Uniqueness (`distinct`) - Joins - Window functions ] **The optimizer is your friend!** Take the time to listen. - If the database optimizer does its job perfectly, you never need to think about sorting, distributing, clustering, shuffling, range joining, ... - If you _only_ think about the "logical" and _never_ about the "physical," you cannot empathize with the database optimizer, and you will write non-performant SQL ??? - Joins implicitly require finding all unique values for match/comparison, which is why they're similar to getting `distinct` values. Rabbi Hillel's questions, paraphrased: - If you are only for the optimizer, who will be for you? - If you are only for yourself, who are you? --- class: subtitle # Checkpoint: - How can your data team benefit from scalable compute resources? -- - How does knowledge of the 'physical' aspects of a data warehouse impact our work as analysts? -- ## Questions? --- class: subtitle, center, middle # Comparisons --- # Apples to apples What's required to **optimize** and **maintain** warehouse performance? .denser-text[ .left-column-66[ .left-column[ ### Redshift - Sort keys: compound or interleaved - Dist style: single column, even, or all
1
- Vacuum, analyze
1
- Column compression - Read `explain` plans .caption[
1
Recently improved] ] .right-column[ ### BigQuery - Partitioning - Clustering - Nesting/arraying repeated records - Read query plan ] ] .right-column-33[ ### Snowflake - Clustering (~sorting) - Only for large tables - Read `explain` or visually inspect query profile ] ] ??? Draw an arrow from left ("more work") to right ("less work"). A well tuned, perfectly optimized Redshift cluster will execute queries _as quickly as_ a comparable Snowflake cluster, for a lower price. But a poorly tuned Redshift cluster is, well, a cluster. --- class: subtitle, center, middle # What about Data Lakes? ---
--- # Spark, Presto, Athena, ... - Incredibly powerful and cost-effective - Built _for_ petabyte-scale - May support other languages! - data engineers can write Scala, Java - data scientists can write Python, R - You _can_ run dbt on them! - Delta Lake (Databricks-hosted Spark): 98% feature parity --- # Drawbacks Harder to use, and harder to reason about: - SQL dialects are often Hive-style, rather than PostgreSQL - Data storage is significantly less abstracted: file paths + formats matter dbt supports these via "adapter plugins": - Need additional packages to "shim" compatibility - Depending on the database, some features are unsupported --- # Challenge: modeling at scale _Whether you're using Snowflake, BigQuery, or Spark:_ - All models are materialized incrementally - Replace discrete partitions, rather than merging/matching records* - Full refresh: never, or manually trigger - Testing is relative - E.g. test that `unique` failures are < 1% of all rows - Tables are "sharded" by a top-level identifier - region, account, client, etc. --- # Review: bases, warehouses, lakes | name | tech | stored as | storage/ compute | optimization | semi-structured data | "external" data | intuition | |----------|------------------------|-----------|------------------|--------------|----------------------|-----------------|-----------| | Postgres | transactional database | rows | same | indexes | some support | via extensions? | everything works the way you expect | | Redshift | analytic database | columns | tightly coupled | sort + dist | minimal support | S3 via Spectrum | most things | | Spark | data processing engine | files | separate | file formats, metadata | first-class support | all data is external |
an acquired taste!
| --- class: subtitle, center, middle # QED --- # The Modern Data Warehouse _a.k.a Cloud Data Platform, Data Lakehouse, ..._ It offers the best of both worlds: - Intuitive **user experience** of a database: users, groups/roles, authentication, permissions, information schema, everything is SQL - Flexible and scalable **performance** of a data lake and distributed processing technology It's what partners best with dbt! --- # Conditions - 90+% of core functionality is implemented as SQL/DDL - Separation of compute + storage - Support for semi-structured + unstructured data
2
- Support for querying data from external file storage
2
These are features that, five years ago, were only found in Apache Spark—at the cost of architectural challenges and a limited SQL dialect. .caption[
2
See appendix for details] --- # Superlatives _Most likely to..._ - Scale without a hitch: **Snowflake** - Save you money in the long run: **BigQuery** - Reward past DBA experience: **Redshift** - Have a little something for everyone: **Databricks** Each of these is, to a greater or lesser extent, a data lake _posing as_ a data warehouse (or "lakehouse"). --- class: subtitle # Checkpoint: - Who are the biggest players in data warehousing? -- - What data warehouse technologies are you using at your organization? --- class: subtitle # Knowledge check You should understand: - The foundational differences between transactional and analytical databases You should be building intuition around: - Why certain queries/models build more slowly - Which dbt features depend on features of the underlying data warehouse - Which cloud data platforms are best-in-class and why - Why someone might opt for a data lake or lakehouse ## Questions?? --- class: subtitle # Zoom Out --- --- class: subtitle, center, middle # Appendix ### Specific feature support in 2020 --- # A. Semi-structured data It's common that 90% of your data is tabular/structured, but there's a little bit that makes more sense as JSON! | bldg_id | location | occupancy | vacant_units | amenities | |---------|-------------|-----------|--------------|----------------------------------------------------------------------------| | 1 | Rittenhouse | 54 | `[2, 3, 9]` | `{"doorman": true, "pets_allowed": true, "pet_deposit": 1000}` | | 2 | Fairmount | 8 | `[]` | `{ "yard_size_sqft": 100 }` | | 3 | Austin, TX | 5 | `[1]` | `{ "yard_size_sqft": 1000, "pets_allowed": true }` | vs. | bldg_id | location | occupancy | vacant_unit_1 | num_vacant | doorman | pets_allowed | pet_deposit | yard_size_sqft | |---------|-------------|-----------|--------------|----------------------------------------------------------------------------| | 1 | Rittenhouse | 54 | 2 | 3 | true | true | 1000 | | | 2 | Fairmount | 8 | | 0 | | | | 100 | | 3 | Austin, TX | 5 | 1 | 1 | | true | | 1000 | --- # A. Semi-structured data Can your database read unstructured data? | JSON support | Redshift
3
| Snowflake | BigQuery | Spark | |------------------------|----------------------------------|-----------------------------|--------------------------| ---------------------- | | Unnesting dictionaries | Kind of: `json_extract_path_text`| `amenities:doorman` | `amenities.doorman` | `amenities['doorman']` | | Flattening arrays | Not natively | `lateral flatten` | `cross join unnest` | `explode` | | Loading data | Flatten + unnest first | Load as-is (`variant` blob) | Load as repeated records | Query as-is | Can it _write_ columns of semi-structured data? | JSON support | Redshift | Snowflake | BigQuery | Spark | |------------------------|---------------|--------------------------------|-------------------|-----------| | Creating dictionaries | No | `object_construct` | `struct` | `map` | | Creating arrays | No | `array_construct`, `array_agg` | `[]`, `array_agg` |`collect_` | .caption[
3
Better Redshift support is on the horizon] --- # B. External data Can your database query files living in S3, Google Cloud Storage, Azure Blob Storage, HDFS, ...? | External support | Redshift | Snowflake | BigQuery | Spark | |---------------------------------|-------------------------------------|-------------------------------------|----------------------------------------| --------------------- | | COPY from/to | S3 | S3/GCS/Azure | GCS + G Drive | S3/Azure/GCS/HDFS/etc | | Read from external tables | Via Spectrum. Manual partitioning | All via DDL, automatic partitioning | Powerful schema inference, now in DDL! | Hive-style DDL | | External-only transformation | Limited (cf. Athena) | Extensive | Limited | Complete | .caption[Read more about [dbt + external tables](https://github.com/dbt-labs/dbt-external-tables)]