Table of Contents
- ⢠No silver bullets: Building the analytics flywheel
- ⢠Identity Crisis: Navigating the Modern Data Organization
- ⢠Scaling Knowledge > Scaling Bodies: Why dbt Labs is making the bet on a data literate organization
- ⢠Down with 'data science'
- ⢠Refactor your hiring process: a framework
- ⢠Beyond the Box: Stop relying on your Black co-worker to help you build a diverse team
- ⢠To All The Data Managers We've Loved Before
- ⢠From Diverse "Humans of Data" to Data Dream "Teams"
- ⢠From 100 spreadsheets to 100 data analysts: the story of dbt at Slido
- ⢠New Data Role on the Block: Revenue Analytics
- ⢠Data Paradox of the Growth-Stage Startup
- ⢠Share. Empower. Repeat. Come learn about how to become a Meetup Organizer!
- ⢠Keynote: How big is this wave?
- ⢠Analytics Engineering Everywhere: Why in the Next Five Years Every Organization Will Adopt Analytics Engineering
- ⢠The Future of Analytics is Polyglot
- ⢠The modern data experience
- ⢠Don't hire a data engineer...yet
- ⢠Keynote: The Metrics System
- ⢠This is just the beginning
- ⢠The Future of Data Analytics
- ⢠Coalesce After Party with Catalog & Cocktails
- ⢠The Operational Data Warehouse: Reverse ETL, CDPs, and the future of data activation
- ⢠Built It Once & Build It Right: Prototyping for Data Teams
- ⢠Inclusive Design and dbt
- ⢠Analytics Engineering for storytellers
- ⢠When to ask for help: Modern advice for working with consultants in data and analytics
- ⢠Smaller Black Boxes: Towards Modular Data Products
- ⢠Optimizing query run time with materialization schedules
- ⢠How dbt Enables Systems Engineering in Analytics
- ⢠Operationalizing Column-Name Contracts with dbtplyr
- ⢠Building On Top of dbt: Managing External Dependencies
- ⢠Data as Engineering
- ⢠Automating Ambiguity: Managing dynamic source data using dbt macros
- ⢠Building a metadata ecosystem with dbt
- ⢠Modeling event data at scale
- ⢠Introducing the activity schema: data modeling with a single table
- ⢠dbt in a data mesh world
- ⢠Sharing the knowledge - joining dbt and "the Business" using TÄngata
- ⢠Eat the data you have: Tracking core events in a cookieless world
- ⢠Getting Meta About Metadata: Building Trustworthy Data Products Backed by dbt
- ⢠Batch to Streaming in One Easy Step
- ⢠dbt 101: Stories from real-life data practitioners + a live look at dbt
- ⢠The Modern Data Stack: How Fivetran Operationalizes Data Transformations
- ⢠Implementing and scaling dbt Core without engineers
- ⢠dbt Core v1.0 Reveal āØ
- ⢠Data Analytics in a Snowflake world
- ⢠Firebolt Deep Dive - Next generation performance with dbt
- ⢠The Endpoints are the Beginning: Using the dbt Cloud API to build a culture of data awareness
- ⢠dbt, Notebooks and the modern data experience
- ⢠You donāt need another database: A conversation with Reynold Xin (Databricks) and Drew Banin (dbt Labs)
- ⢠Git for the rest of us
- ⢠How to build a mature dbt project from scratch
- ⢠Tailoring dbt's incremental_strategy to Artsy's data needs
- ⢠Observability within dbt
- ⢠The Call is Coming from Inside the Warehouse: Surviving Schema Changes with Automation
- ⢠So You Think You Can DAG: Supporting data scientists with dbt packages
- ⢠How to Prepare Data for a Product Analytics Platform
- ⢠dbt for Financial Services: How to boost returns on your SQL pipelines using dbt, Databricks, and Delta Lake
- ⢠Stay Calm and Query on: Root Cause Analysis for Your Data Pipelines
- ⢠Upskilling from an Insights Analyst to an Analytics Engineer
- ⢠Building an Open Source Data Stack
- ⢠Trials and Tribulations of Incremental Models
Firebolt Deep Dive - Next generation performance with dbt
Tune in with Kevin Marr from Firebolt to learn about current trends in data warehousing and how they will impact tomorrowās data workflows.
Follow along in the slides here.
Browse this talkās Slack archives #
The day-of-talk conversation is archived here in dbt Community Slack.
Not a member of the dbt Community yet? You can join here to view the Coalesce chat archives.
Full transcript #
Elize Papineau: [00:00:00] Hello, thank you for joining us at Coalesce . My name is Elize Papineau. I am a senior analytics engineer at dbt Labs. And Iāll be the host of this session. The title of the session is Firebolt deep dive, next generation performance with dbt, and weāll be joined by Kevin Marr and Cody Scharz from Firebolt just released some rad merge and even more excitingly today thereāll be announcing a new dbt adapter and giving us an intro to Firebolt before we get started, I just want to shout out that all chat conversation is going to be taking place in the #coalesce-firebolt of dbt Slack. If youāre not a part of the chat you have come to join right now, please visit community.dbt.com and search for #coalesce-firebolt when you enter the space.
We encourage you to interact with other attendees, ask questions as they come up, make comments and react [00:01:00] throughout the session. After the session, the speakers will be available in the Slack channel to answer your questions, but we encourage you to post those questions as they come up during the session.
Letās get started over to you, Kevin and Cody.
Cody Scharz: All right. Thank you. Hello everyone. Iām Cody and Iām a product manager with Firebolt. And I am joined by Kevin who is head of ecosystem product of Firebolt where heās working on building out integrations with Firebolt across the stack and specifically dbt. So what weāre gonna be talking about today?
First we want to spend some time talking about where the data warehouse market is and how far itās come over the last decade. And some of the challenges that are still present without understanding. Weāll tell you more about Firebolt, how Firebolt is different and why weāre building a new data warehouse.
Now, after that, Kevin, weāll give you a [00:02:00] sneak peek into the Firebolt dbt adapter and show a demo of dbt running with Firebolt. So looking back, it was a great decade for cloud analytics and industry. The industry made huge strides across a wide range of dimensions that has led to an explosion in data usage, growth of the industry, and likely plays a big part in why many of us are working with data today.
[00:02:26] A great decade for cloud analytics #
Cody Scharz: We wanted to reflect on a few of the big stuff. So Redshift was the first cloud data warehouse released in 2012. And Amazon showed that data warehouses can be SAS, such that setting them up is just a few clicks and not a multi-month project. And giving you the ability to scale the warehouse easily provides tremendous value data, Databricks, leverage spark, and polarized data where distributed computing in the cloud.
Cody Scharz: And also made distributed analytics simplified and improved over the MapReduce era and big query introduced a serverless data warehouse followed [00:03:00] shortly by Athena, which simplified things even further. And Peru proved really well suited for some workloads like an ad hoc analytics and Snowflake popularized, decoupled, storage, and compute with a great and easy use experience that gave the user many of the right abstractions for managing a data warehouse.
And with all of that advancement ELT became the preferred way to manage transformations. And this was a locked by all the work done to improve the scale of the data warehouse and the rise of dbt. Weāre all here today because of how powerful dbt is and how the pair of a cloud data warehouse with dbt is a total game changer.
[00:03:36] Advances bring new challenges #
Cody Scharz: But all of these advances have also created new demands and new data challenges. We expect more from our analytics today and use cases are becoming more operational in nature and customers. These use cases for client or weāre relying on ever growing data sets and terabyte is slowly but surely becoming the new gigabyte.
Thatās not the case for you. Just wait as it will be [00:04:00] data velocity increases can quickly lead to a lot more data captured and more data, more sources means more types of workloads across ELT analytics and more. And then thereās the rise of data apps. Not only are we trying to analyze. But users are expecting faster data experiences gone are the days when weāre waiting.
Dozens of seconds for queries was acceptable. Standards have been set by the leading organizations in terms of performance and data availability and customers expect the same experiences. Everywhere. Few seconds is too slow. Sub second to single second experiences are a must and SLS and standard across data apps and users expect production, grade reliability and uptime from data apps.
And lastly is cost access to more data and limitless compute makes it easy to run up. A big bill compute costs in the cloud are not going down. So with expanded data and limitless compute access can lead to a greatly increased costs, costs to address the cost challenge. We [00:05:00] need a new approach. More nodes is not always going to cut it.
[00:05:04] Firebolt #
Cody Scharz: So with that Firebolt was born from a realization that weāre just getting started with cloud data and whatās possible. And weāre here to build a modern data platform for builders of next gen data experiences. So what do we mean by that? And how do we do it first? Firebolt focuses on unbelievable speed and efficiency designed to give data engineers, the ability to architect sub-second analytics experiences, at terabyte scale.
Firebolt does this through indexing and using the most advanced query execution methodology. Weāll drill into this more later, modern cloud native architecture, recent cloud data warehouses, except the standard and expectation here. You need decoupled, storage and compute to get the most out of todayās tools and experiences and Firebolt is fully cloud native decoupled, storage, and compute.
And Firebolt is built for developers and data engineers. The way we work with data has changed. We now have new [00:06:00] processes, tools, approaches, and beyond that different skill sets and backgrounds are getting involved with data. Weāre building a data warehouse that embraces these new practices and builds them natively into the Firebolt experience.
[00:06:13] Built for speed #
Cody Scharz: First, letās talk about speed and efficiency. The reason current cloud data warehouses are slow is that they have to move large partitions or blocks into their compute cash and have to scan them entirely when the query only needs a small subset of the data. So even though most queries only scan one to 2% of the data and those partitions are micro partitions data warehouses have to move and scan them entirely to find the small number of relevant records.
The way we were able to overcome this bottleneck is through spare indexing. Sparse index points to small data ranges within files, which are much smaller than partitions or micro partitions. These data ranges are the smallest unit that is moved and scanned resulting in dramatically less data movement and data scans compared to other query engines.[00:07:00]
So sparse indexing requires sort of data. This is what we call triple left on ingestion Firebolt sorts, compresses and apply sparse indexing. As a user, you donāt have to worry about the details or the file format. Itās all abstracting for you to manage behind the scenes when you create a table and Firebolt you specify the field or fields you want and your source index and the files are sort are physically sorted by them.
[00:07:23] Even faster with more index types #
Cody Scharz: See the primary index syntax in this code example here, and Firebolt gets even faster with our indexes. Weāre going to highlight two building blocks here, the aggregating index and the joint though. So first the aggregating index aggravating indexes are designed to be an easier way to use the same principles behind materialized views and other roll-up method methods.
And Firebolt aggregating indexes are another index type that gets applied to the table and optimize the query execution time theyāre used when you have a repeating query patterns and you can prepare part or some of the results that. [00:08:00] You define them once can define however many you need on a table.
And then the optimizer will pick the best one at query time. Best part, no work for you to do behind the scenes. These are automatically maintained by Firebolt. Next is joint index. Join indexes. Allow you to get de-normalized scale and performance while using your favorite data modeling methodology. You no longer need to de normalize your table to get a good grade.
Much like aggregating indexes, you can build multiple joint indexes on a table and the optimizer will build it back. Pick the best joint index. We can use joint indexes to recruit great rewrite and really speed up a joint. And furthermore, a joint index helps replace full table scans with look-ups and other operators that are much faster and less expensive.
So Firebolt is a modern cloud data warehouse architecture with decoupled storage, compute, designed to support all analytics workloads in a scalable and isolated manner. But this means like other cloud data warehouses, you can load any data [00:09:00] and reuse it across any compute called engines in Firebolt with a new use case, you can set up a new engine to support a new project without impacting other workers.
[00:09:09] Built for data engineers and developers #
Cody Scharz: And this approach allows customers to start small and add new projects by spinning up a new engine. You can then right-size these engine for the workload based on compute needs. So now, as I mentioned earlier, weāre focusing on improving the life for data engineers, analytics, engineers, and developers, and Firebolt is designed to help with the challenges that modern data stacks.
First we believe that data engineers and developers need and appreciate flexibility and control at Firebolt. You can do everything with SQL, including the management of engines. It gives you great control or keeping things as simple as SQL verbals, also unique in giving you granular control over compute resources.
So you can more directly match your workload to its computer. And Firebolt is also fully programmable for our rest API APIs and our [00:10:00] Python, SDK, and other SDKs coming soon to programmatically create and manage a Firebolt environment with your orchestration tool of choice. The fact that youāre a data engineer doesnāt mean we want you to work too hard.
Firebolt is also about eliminating the grunt work. Scaling is just a few clicks or API calls away. Semi-structured data is natively supported with Lambda functions that give you. Indexes are easily defined and managed behind the scenes automatically. So thereās no manual intervention required. Inquiries are optimized, so you donāt have to last, you never have to worry about storage management and things like vacuuming.
[00:10:40] The future is data as code #
Cody Scharz: And so we believe the future is code where we continue to push the limits on bringing the best practices from software development to data entry. And there are few dimensions of embracing a data as code product philosophy. So first is data ops. What data ops means to us is bringing DevOps best practices, like [00:11:00] version control, automatic tests, automated testing, and first class environment.
So our teams to collaborate more effectively experiment and gain confidence before changing their pipelines with data. As we all know, this gets tricky and roll back as much harder than in software develop. But a lot of these current challenges can be improved. This also includes better enabling CI and CD pipelines and data.
Some of the dbt is really at the forefront on, but overall thereās still a ways to go. And dbt has inspired many teams to adopt version control, ad testing, and use different environments for production and development work. Automated end to end testing, which helps you quickly identify potential negative downstream impact of any change, a data warehouse that is more aware of your tests can help streamline your workflow and improve the experience for data entry.
And we believe weāre still in that we believe the days where you manually provision your data warehouse, infrastructure, objects, and roles will soon be [00:12:00] gone and will be fully replaced by declarative tools like Terraform or more imperative infrastructure as code tools like Provimi, we are committed to making Firebolt the easiest to deploy via code with native experience.
And last but not least, we believe in code modularity. Being able to define templates and reusing code elsewhere is a key building block to a long-standing platform. Thatās why our bet is on dbt code. Modularity helps in breaking down complex models, test individual components and debug and isolation and reuse those units of work downstream.
So with that, Iāll hand it over to Kevin to talk about the Firebolt dbt adapter.
[00:12:42] Fireboltās dbt adapter #
Kevin Marr: Thanks Cody. So letās I guess weāll switch it over to my screen share and a great, thanks again. Itās with great pleasure that I get to introduce Firebolt brand new [00:13:00] dbt adapter. The ecosystem team at Firebolt has been hard at work building this for the last couple months. And, weāre just really excited to be able to contribute this to the dbt community and also to obviously provide the benefits of dbt to our own customers.
What I want to do with yāall is walk through the highlights of the adapter and also actually show you a sneak peek, jump into the actual product itself and do a live demo.
The goal with the dbt adapter for Firebolt wasnāt just to, check the box and then move on. We really want to continue to invest in this integration and this relationship so that weāre materially enhanced. The dbt experience when youāre using it on top of Firebolt and even today, weāre just getting started, but there are three things specifically that I wanted to point out that weāre really proud of.
Kevin Marr: And first, and perhaps most [00:14:00] obviously is that Firebolt is really fast, right? Thatās our primary differentiator. And if you think about it, every analytics engineer and every data engineer, their primary job is just making sure that. Giving the best did experience to their end users. And the speed at which queries get back to you is a really big part of that experience.
Second is how indexes are handled in the adapter. So you saw from Codyās presentation, we have some novel indexes that help us get to those really fast queries. And weāve baked those into the dbt adapter in a very native way. And Iāll talk more about that in the following. And then third, perhaps itās not a differentiator per se, or itās not unique to the Firebolt adapter, but weāre really big fans of the dbt external tech tables package in dbt, because it actually extends its scope of [00:15:00] operating beyond just Iāll say, intra database transformation, right?
dbt external tables allows you to actually do the ingestion. As well into the data warehouse by actually reading the data from S3 and loading it into external tables Firebolt itself also has external tables. And so this was a really obvious integration for us, and weāre really proud that we have this feature, which was not at all a hard requirement from our first first.
So I talked about indexes. So remember these from earlier, we have our aggregating indexes and joined indexes. This is just examples of the DDL in Firebolt to create these things. And when youāre using dbt, you can do it the dbt way. So if youāre familiar with dbt, this should look familiar to you, but now we have these novel proprietary index types baked right into the model config.
So obviously, by having this as a version controlled dbt, weāll take care of all the DDL. Itāll take care of all the lifecycle management of [00:16:00] these indexes when you drop the tables and rebuild them and all of that. But I think whatās really interesting is that the way we handle aggregating indexes.
So in Firebolt aggregating index is conceptually similar to perhaps a materialized view with query, right? But we decided not to expose this as an actual relation or an actual table. Itās actually exposed as an index in Firebolt, which we think is actually conceptually better aligned with the way that youāre thinking about the purpose of a materialized view.
If itās just to accelerate queries, indexes should be for accelerating queries. It shouldnāt add noise to your overall. Data schema or the data, the demo, the dimensional model that you have. And so likewise, that concept is parlayed into the dbt experience as well. I donāt have to create additional models to express a aggregate index is instead, I just had them as index is like any other index.
And my [00:17:00] DAG doesnāt have these materialized views adding noise and taking away from the pure dimensional model of my Great. So with that, I might have noticed Iām sharing my whole screen. I want you to want to walk through a demo and Iām going to jump to my next tab over here, which is actually Looker, which is near and dear to my heart.
I used to work at Looker. I did product management for look ML, that data model, which I know a lot of dbt folks out there are familiar with, but Iām not here. Do a demo of liquor, obviously. And the reason I have is open itās just to acquaint you all with the dataset that Iām going to be using, which is an ad tech dataset.
Itās actually real data. We opt skated it, it was donated to us like one of our customers, and itās a pretty big dataset. And I have a dashboard here. In this demo, weāve turned out all caching and Looker is I can simply just hit the refresh button and all these queries are gonna run in Firebolt in real time, you can see them go from faded to full resolution.
There are full clarity [00:18:00] as they have the queries refresh. I can go in here and I can change the filters. Again. These are the stated values, but say, I want to, slice this over entirely different media source over this ad tech data I can do and itās updated and Iām getting very low. Feedback on this massive dataset, again, all powered by a Firebolt under the hood and Iāll jump into the Firebolt UI itself.
We can take a look at this data a little more closely. So here I have a Iām in the Firebolt app on the right. I can see my schema and I can do letās look still, letās take a look at how big this LTV table is. So Iāll run this. I always find it funny how we rounded the nearest micro milliseconds.
So we had 0.00 seconds, but itās 32 billion rows in this table. And if I actually show the tables, you can see the size of it. So Iāll run to go to the bottom of this result. We look at our LTV table here at the bottom. Very bottom of the screen. You can see the size on compressed. So itās 17 terabytes at two tebibytes [00:19:00] uncompressed, and then compress down to a terabyte when it is compressed. And weāre crunching through a good amount of data here. And again, this was just one of the query, like copy and pasted from Looker, which I can write here just to show you the actual measured runtime. And weāre returning these queries in about 0.19 seconds .
This is all again, powered by powered by sparse indexes and all of the primary index of the joining mixes tag indexes that we have in Firebolt. Anyways letās talk about dbt now for a second. So with all of this all of this data that we have in. Hereās the DDL that we use to populate this demo.
Of course we ingest data for us using external tables from our S3 buckets. From there, we create fact tables. Iām going to as quickly fact tables and dimension tables and our indexes and all that. This is pretty standard looking DGL. But if I hop into dbt, Which Iāll actually Iāll show you our project and had so hereās [00:20:00] my dbt project for the demo.
And again, I can see these familiar tables that we had that we were analyzing and Looker. Now here. And dbt. And again I hereās, I have my account table with my join index. Iāll jump to my LTV table, which is a fact table in Firebolt where Iāve moved to find mark aggregating and. And also weāre taking advantage of that dbt external tables package.
So here I have some sources defined what buckets they come from, the schema is. So letās actually go through all this. Letās run it. Letās build it into Firebolt. So before I do, so letās just jump over to back to Firebolt app. Another tab I have I have a brand new database here. Have nothing inside.
Youād see on the right itās completely empty. So letās first just load in these external table. So Iām in dbt. Iāll do my command to run the external tables.
[00:21:00] Thatās plugging away. Thereās one with another one and thereās the third. Great. So Iāll jump back into fireballs. Refresh this page. I can get the updated scheme view. Okay. And the limbo holds here behind my external tables. I can go and take a look at these. Letās just. Verify, weāve got the data in there, select star from all the state account, not like the auto complete.
Great. So Iāve got my data flowing in and of course, external table is not the fully optimized table forums. We want to have to get these into fact and dimension tables according to our model. So letās jump back into dbt and I will dbt.
And now weāre taking those external tables and weāre now running through all the DDL needed to create the fact and dimension tables based on those external datasets.[00:22:00]
Great. So thatās all done. And again, Iāll jump back into the UI here in Firebolt and just.
Okay. So Iāve got my new tables and you can see from my profiles that it didnāt show you that didnāt actually have my password. Weāve prefects these table names. So you can get a developer isolation while youāre all working on dbt with side out of my initials, cam to all these tables I can look at these, letās look at LTV so I can look at some of the data that flows through there.
Thatās great. Letās look at the indexes that were created. Cool. So you can see all those things that we defined in our dbt project, and I can run, something like I had that same Looker query. Letās just run this guy. Cool. And again, weāre getting that sub-second this time 0.1, zero seconds to do some [00:23:00] analysis right on this data.
So thatās hopefully a illustrative of the soup to nuts workflow of dbt on Firebolt. Thanks. Thanks for watching.
Last modified on: Apr 19, 2022