Table of Contents
- • No silver bullets: Building the analytics flywheel
- • Identity Crisis: Navigating the Modern Data Organization
- • Scaling Knowledge > Scaling Bodies: Why dbt Labs is making the bet on a data literate organization
- • Down with 'data science'
- • Refactor your hiring process: a framework
- • Beyond the Box: Stop relying on your Black co-worker to help you build a diverse team
- • To All The Data Managers We've Loved Before
- • From Diverse "Humans of Data" to Data Dream "Teams"
- • From 100 spreadsheets to 100 data analysts: the story of dbt at Slido
- • New Data Role on the Block: Revenue Analytics
- • Data Paradox of the Growth-Stage Startup
- • Share. Empower. Repeat. Come learn about how to become a Meetup Organizer!
- • Keynote: How big is this wave?
- • Analytics Engineering Everywhere: Why in the Next Five Years Every Organization Will Adopt Analytics Engineering
- • The Future of Analytics is Polyglot
- • The modern data experience
- • Don't hire a data engineer...yet
- • Keynote: The Metrics System
- • This is just the beginning
- • The Future of Data Analytics
- • Coalesce After Party with Catalog & Cocktails
- • The Operational Data Warehouse: Reverse ETL, CDPs, and the future of data activation
- • Built It Once & Build It Right: Prototyping for Data Teams
- • Inclusive Design and dbt
- • Analytics Engineering for storytellers
- • When to ask for help: Modern advice for working with consultants in data and analytics
- • Smaller Black Boxes: Towards Modular Data Products
- • Optimizing query run time with materialization schedules
- • How dbt Enables Systems Engineering in Analytics
- • Operationalizing Column-Name Contracts with dbtplyr
- • Building On Top of dbt: Managing External Dependencies
- • Data as Engineering
- • Automating Ambiguity: Managing dynamic source data using dbt macros
- • Building a metadata ecosystem with dbt
- • Modeling event data at scale
- • Introducing the activity schema: data modeling with a single table
- • dbt in a data mesh world
- • Sharing the knowledge - joining dbt and "the Business" using Tāngata
- • Eat the data you have: Tracking core events in a cookieless world
- • Getting Meta About Metadata: Building Trustworthy Data Products Backed by dbt
- • Batch to Streaming in One Easy Step
- • dbt 101: Stories from real-life data practitioners + a live look at dbt
- • The Modern Data Stack: How Fivetran Operationalizes Data Transformations
- • Implementing and scaling dbt Core without engineers
- • dbt Core v1.0 Reveal ✨
- • Data Analytics in a Snowflake world
- • Firebolt Deep Dive - Next generation performance with dbt
- • The Endpoints are the Beginning: Using the dbt Cloud API to build a culture of data awareness
- • dbt, Notebooks and the modern data experience
- • You don’t need another database: A conversation with Reynold Xin (Databricks) and Drew Banin (dbt Labs)
- • Git for the rest of us
- • How to build a mature dbt project from scratch
- • Tailoring dbt's incremental_strategy to Artsy's data needs
- • Observability within dbt
- • The Call is Coming from Inside the Warehouse: Surviving Schema Changes with Automation
- • So You Think You Can DAG: Supporting data scientists with dbt packages
- • How to Prepare Data for a Product Analytics Platform
- • dbt for Financial Services: How to boost returns on your SQL pipelines using dbt, Databricks, and Delta Lake
- • Stay Calm and Query on: Root Cause Analysis for Your Data Pipelines
- • Upskilling from an Insights Analyst to an Analytics Engineer
- • Building an Open Source Data Stack
- • Trials and Tribulations of Incremental Models
Firebolt Deep Dive - Next generation performance with dbt
Tune in with Kevin Marr from Firebolt to learn about current trends in data warehousing and how they will impact tomorrow’s data workflows.
Browse this talk’s Slack archives #
The day-of-talk conversation is archived here in dbt Community Slack.
Not a member of the dbt Community yet? You can join here to view the Coalesce chat archives.
Full transcript #
Elize Papineau: [00:00:00] Hello, thank you for joining us at Coalesce . My name is Elize Papineau. I am a senior analytics engineer at dbt Labs. And I’ll be the host of this session. The title of the session is Firebolt deep dive, next generation performance with dbt, and we’ll be joined by Kevin Marr and Cody Scharz from Firebolt just released some rad merge and even more excitingly today there’ll be announcing a new dbt adapter and giving us an intro to Firebolt before we get started, I just want to shout out that all chat conversation is going to be taking place in the #coalesce-firebolt of dbt Slack. If you’re not a part of the chat you have come to join right now, please visit community.dbt.com and search for #coalesce-firebolt when you enter the space.
We encourage you to interact with other attendees, ask questions as they come up, make comments and react [00:01:00] throughout the session. After the session, the speakers will be available in the Slack channel to answer your questions, but we encourage you to post those questions as they come up during the session.
Let’s get started over to you, Kevin and Cody.
Cody Scharz: All right. Thank you. Hello everyone. I’m Cody and I’m a product manager with Firebolt. And I am joined by Kevin who is head of ecosystem product of Firebolt where he’s working on building out integrations with Firebolt across the stack and specifically dbt. So what we’re gonna be talking about today?
First we want to spend some time talking about where the data warehouse market is and how far it’s come over the last decade. And some of the challenges that are still present without understanding. We’ll tell you more about Firebolt, how Firebolt is different and why we’re building a new data warehouse.
Now, after that, Kevin, we’ll give you a [00:02:00] sneak peek into the Firebolt dbt adapter and show a demo of dbt running with Firebolt. So looking back, it was a great decade for cloud analytics and industry. The industry made huge strides across a wide range of dimensions that has led to an explosion in data usage, growth of the industry, and likely plays a big part in why many of us are working with data today.
[00:02:26] A great decade for cloud analytics #
Cody Scharz: We wanted to reflect on a few of the big stuff. So Redshift was the first cloud data warehouse released in 2012. And Amazon showed that data warehouses can be SAS, such that setting them up is just a few clicks and not a multi-month project. And giving you the ability to scale the warehouse easily provides tremendous value data, Databricks, leverage spark, and polarized data where distributed computing in the cloud.
Cody Scharz: And also made distributed analytics simplified and improved over the MapReduce era and big query introduced a serverless data warehouse followed [00:03:00] shortly by Athena, which simplified things even further. And Peru proved really well suited for some workloads like an ad hoc analytics and Snowflake popularized, decoupled, storage, and compute with a great and easy use experience that gave the user many of the right abstractions for managing a data warehouse.
And with all of that advancement ELT became the preferred way to manage transformations. And this was a locked by all the work done to improve the scale of the data warehouse and the rise of dbt. We’re all here today because of how powerful dbt is and how the pair of a cloud data warehouse with dbt is a total game changer.
[00:03:36] Advances bring new challenges #
Cody Scharz: But all of these advances have also created new demands and new data challenges. We expect more from our analytics today and use cases are becoming more operational in nature and customers. These use cases for client or we’re relying on ever growing data sets and terabyte is slowly but surely becoming the new gigabyte.
That’s not the case for you. Just wait as it will be [00:04:00] data velocity increases can quickly lead to a lot more data captured and more data, more sources means more types of workloads across ELT analytics and more. And then there’s the rise of data apps. Not only are we trying to analyze. But users are expecting faster data experiences gone are the days when we’re waiting.
Dozens of seconds for queries was acceptable. Standards have been set by the leading organizations in terms of performance and data availability and customers expect the same experiences. Everywhere. Few seconds is too slow. Sub second to single second experiences are a must and SLS and standard across data apps and users expect production, grade reliability and uptime from data apps.
And lastly is cost access to more data and limitless compute makes it easy to run up. A big bill compute costs in the cloud are not going down. So with expanded data and limitless compute access can lead to a greatly increased costs, costs to address the cost challenge. We [00:05:00] need a new approach. More nodes is not always going to cut it.
[00:05:04] Firebolt #
Cody Scharz: So with that Firebolt was born from a realization that we’re just getting started with cloud data and what’s possible. And we’re here to build a modern data platform for builders of next gen data experiences. So what do we mean by that? And how do we do it first? Firebolt focuses on unbelievable speed and efficiency designed to give data engineers, the ability to architect sub-second analytics experiences, at terabyte scale.
Firebolt does this through indexing and using the most advanced query execution methodology. We’ll drill into this more later, modern cloud native architecture, recent cloud data warehouses, except the standard and expectation here. You need decoupled, storage and compute to get the most out of today’s tools and experiences and Firebolt is fully cloud native decoupled, storage, and compute.
And Firebolt is built for developers and data engineers. The way we work with data has changed. We now have new [00:06:00] processes, tools, approaches, and beyond that different skill sets and backgrounds are getting involved with data. We’re building a data warehouse that embraces these new practices and builds them natively into the Firebolt experience.
[00:06:13] Built for speed #
Cody Scharz: First, let’s talk about speed and efficiency. The reason current cloud data warehouses are slow is that they have to move large partitions or blocks into their compute cash and have to scan them entirely when the query only needs a small subset of the data. So even though most queries only scan one to 2% of the data and those partitions are micro partitions data warehouses have to move and scan them entirely to find the small number of relevant records.
The way we were able to overcome this bottleneck is through spare indexing. Sparse index points to small data ranges within files, which are much smaller than partitions or micro partitions. These data ranges are the smallest unit that is moved and scanned resulting in dramatically less data movement and data scans compared to other query engines.[00:07:00]
So sparse indexing requires sort of data. This is what we call triple left on ingestion Firebolt sorts, compresses and apply sparse indexing. As a user, you don’t have to worry about the details or the file format. It’s all abstracting for you to manage behind the scenes when you create a table and Firebolt you specify the field or fields you want and your source index and the files are sort are physically sorted by them.
[00:07:23] Even faster with more index types #
Cody Scharz: See the primary index syntax in this code example here, and Firebolt gets even faster with our indexes. We’re going to highlight two building blocks here, the aggregating index and the joint though. So first the aggregating index aggravating indexes are designed to be an easier way to use the same principles behind materialized views and other roll-up method methods.
And Firebolt aggregating indexes are another index type that gets applied to the table and optimize the query execution time they’re used when you have a repeating query patterns and you can prepare part or some of the results that. [00:08:00] You define them once can define however many you need on a table.
And then the optimizer will pick the best one at query time. Best part, no work for you to do behind the scenes. These are automatically maintained by Firebolt. Next is joint index. Join indexes. Allow you to get de-normalized scale and performance while using your favorite data modeling methodology. You no longer need to de normalize your table to get a good grade.
Much like aggregating indexes, you can build multiple joint indexes on a table and the optimizer will build it back. Pick the best joint index. We can use joint indexes to recruit great rewrite and really speed up a joint. And furthermore, a joint index helps replace full table scans with look-ups and other operators that are much faster and less expensive.
So Firebolt is a modern cloud data warehouse architecture with decoupled storage, compute, designed to support all analytics workloads in a scalable and isolated manner. But this means like other cloud data warehouses, you can load any data [00:09:00] and reuse it across any compute called engines in Firebolt with a new use case, you can set up a new engine to support a new project without impacting other workers.
[00:09:09] Built for data engineers and developers #
Cody Scharz: And this approach allows customers to start small and add new projects by spinning up a new engine. You can then right-size these engine for the workload based on compute needs. So now, as I mentioned earlier, we’re focusing on improving the life for data engineers, analytics, engineers, and developers, and Firebolt is designed to help with the challenges that modern data stacks.
First we believe that data engineers and developers need and appreciate flexibility and control at Firebolt. You can do everything with SQL, including the management of engines. It gives you great control or keeping things as simple as SQL verbals, also unique in giving you granular control over compute resources.
So you can more directly match your workload to its computer. And Firebolt is also fully programmable for our rest API APIs and our [00:10:00] Python, SDK, and other SDKs coming soon to programmatically create and manage a Firebolt environment with your orchestration tool of choice. The fact that you’re a data engineer doesn’t mean we want you to work too hard.
Firebolt is also about eliminating the grunt work. Scaling is just a few clicks or API calls away. Semi-structured data is natively supported with Lambda functions that give you. Indexes are easily defined and managed behind the scenes automatically. So there’s no manual intervention required. Inquiries are optimized, so you don’t have to last, you never have to worry about storage management and things like vacuuming.
[00:10:40] The future is data as code #
Cody Scharz: And so we believe the future is code where we continue to push the limits on bringing the best practices from software development to data entry. And there are few dimensions of embracing a data as code product philosophy. So first is data ops. What data ops means to us is bringing DevOps best practices, like [00:11:00] version control, automatic tests, automated testing, and first class environment.
So our teams to collaborate more effectively experiment and gain confidence before changing their pipelines with data. As we all know, this gets tricky and roll back as much harder than in software develop. But a lot of these current challenges can be improved. This also includes better enabling CI and CD pipelines and data.
Some of the dbt is really at the forefront on, but overall there’s still a ways to go. And dbt has inspired many teams to adopt version control, ad testing, and use different environments for production and development work. Automated end to end testing, which helps you quickly identify potential negative downstream impact of any change, a data warehouse that is more aware of your tests can help streamline your workflow and improve the experience for data entry.
And we believe we’re still in that we believe the days where you manually provision your data warehouse, infrastructure, objects, and roles will soon be [00:12:00] gone and will be fully replaced by declarative tools like Terraform or more imperative infrastructure as code tools like Provimi, we are committed to making Firebolt the easiest to deploy via code with native experience.
And last but not least, we believe in code modularity. Being able to define templates and reusing code elsewhere is a key building block to a long-standing platform. That’s why our bet is on dbt code. Modularity helps in breaking down complex models, test individual components and debug and isolation and reuse those units of work downstream.
So with that, I’ll hand it over to Kevin to talk about the Firebolt dbt adapter.
[00:12:42] Firebolt’s dbt adapter #
Kevin Marr: Thanks Cody. So let’s I guess we’ll switch it over to my screen share and a great, thanks again. It’s with great pleasure that I get to introduce Firebolt brand new [00:13:00] dbt adapter. The ecosystem team at Firebolt has been hard at work building this for the last couple months. And, we’re just really excited to be able to contribute this to the dbt community and also to obviously provide the benefits of dbt to our own customers.
What I want to do with y’all is walk through the highlights of the adapter and also actually show you a sneak peek, jump into the actual product itself and do a live demo.
The goal with the dbt adapter for Firebolt wasn’t just to, check the box and then move on. We really want to continue to invest in this integration and this relationship so that we’re materially enhanced. The dbt experience when you’re using it on top of Firebolt and even today, we’re just getting started, but there are three things specifically that I wanted to point out that we’re really proud of.
Kevin Marr: And first, and perhaps most [00:14:00] obviously is that Firebolt is really fast, right? That’s our primary differentiator. And if you think about it, every analytics engineer and every data engineer, their primary job is just making sure that. Giving the best did experience to their end users. And the speed at which queries get back to you is a really big part of that experience.
Second is how indexes are handled in the adapter. So you saw from Cody’s presentation, we have some novel indexes that help us get to those really fast queries. And we’ve baked those into the dbt adapter in a very native way. And I’ll talk more about that in the following. And then third, perhaps it’s not a differentiator per se, or it’s not unique to the Firebolt adapter, but we’re really big fans of the dbt external tech tables package in dbt, because it actually extends its scope of [00:15:00] operating beyond just I’ll say, intra database transformation, right?
dbt external tables allows you to actually do the ingestion. As well into the data warehouse by actually reading the data from S3 and loading it into external tables Firebolt itself also has external tables. And so this was a really obvious integration for us, and we’re really proud that we have this feature, which was not at all a hard requirement from our first first.
So I talked about indexes. So remember these from earlier, we have our aggregating indexes and joined indexes. This is just examples of the DDL in Firebolt to create these things. And when you’re using dbt, you can do it the dbt way. So if you’re familiar with dbt, this should look familiar to you, but now we have these novel proprietary index types baked right into the model config.
So obviously, by having this as a version controlled dbt, we’ll take care of all the DDL. It’ll take care of all the lifecycle management of [00:16:00] these indexes when you drop the tables and rebuild them and all of that. But I think what’s really interesting is that the way we handle aggregating indexes.
So in Firebolt aggregating index is conceptually similar to perhaps a materialized view with query, right? But we decided not to expose this as an actual relation or an actual table. It’s actually exposed as an index in Firebolt, which we think is actually conceptually better aligned with the way that you’re thinking about the purpose of a materialized view.
If it’s just to accelerate queries, indexes should be for accelerating queries. It shouldn’t add noise to your overall. Data schema or the data, the demo, the dimensional model that you have. And so likewise, that concept is parlayed into the dbt experience as well. I don’t have to create additional models to express a aggregate index is instead, I just had them as index is like any other index.
And my [00:17:00] DAG doesn’t have these materialized views adding noise and taking away from the pure dimensional model of my Great. So with that, I might have noticed I’m sharing my whole screen. I want you to want to walk through a demo and I’m going to jump to my next tab over here, which is actually Looker, which is near and dear to my heart.
I used to work at Looker. I did product management for look ML, that data model, which I know a lot of dbt folks out there are familiar with, but I’m not here. Do a demo of liquor, obviously. And the reason I have is open it’s just to acquaint you all with the dataset that I’m going to be using, which is an ad tech dataset.
It’s actually real data. We opt skated it, it was donated to us like one of our customers, and it’s a pretty big dataset. And I have a dashboard here. In this demo, we’ve turned out all caching and Looker is I can simply just hit the refresh button and all these queries are gonna run in Firebolt in real time, you can see them go from faded to full resolution.
There are full clarity [00:18:00] as they have the queries refresh. I can go in here and I can change the filters. Again. These are the stated values, but say, I want to, slice this over entirely different media source over this ad tech data I can do and it’s updated and I’m getting very low. Feedback on this massive dataset, again, all powered by a Firebolt under the hood and I’ll jump into the Firebolt UI itself.
We can take a look at this data a little more closely. So here I have a I’m in the Firebolt app on the right. I can see my schema and I can do let’s look still, let’s take a look at how big this LTV table is. So I’ll run this. I always find it funny how we rounded the nearest micro milliseconds.
So we had 0.00 seconds, but it’s 32 billion rows in this table. And if I actually show the tables, you can see the size of it. So I’ll run to go to the bottom of this result. We look at our LTV table here at the bottom. Very bottom of the screen. You can see the size on compressed. So it’s 17 terabytes at two tebibytes [00:19:00] uncompressed, and then compress down to a terabyte when it is compressed. And we’re crunching through a good amount of data here. And again, this was just one of the query, like copy and pasted from Looker, which I can write here just to show you the actual measured runtime. And we’re returning these queries in about 0.19 seconds .
This is all again, powered by powered by sparse indexes and all of the primary index of the joining mixes tag indexes that we have in Firebolt. Anyways let’s talk about dbt now for a second. So with all of this all of this data that we have in. Here’s the DDL that we use to populate this demo.
Of course we ingest data for us using external tables from our S3 buckets. From there, we create fact tables. I’m going to as quickly fact tables and dimension tables and our indexes and all that. This is pretty standard looking DGL. But if I hop into dbt, Which I’ll actually I’ll show you our project and had so here’s [00:20:00] my dbt project for the demo.
And again, I can see these familiar tables that we had that we were analyzing and Looker. Now here. And dbt. And again I here’s, I have my account table with my join index. I’ll jump to my LTV table, which is a fact table in Firebolt where I’ve moved to find mark aggregating and. And also we’re taking advantage of that dbt external tables package.
So here I have some sources defined what buckets they come from, the schema is. So let’s actually go through all this. Let’s run it. Let’s build it into Firebolt. So before I do, so let’s just jump over to back to Firebolt app. Another tab I have I have a brand new database here. Have nothing inside.
You’d see on the right it’s completely empty. So let’s first just load in these external table. So I’m in dbt. I’ll do my command to run the external tables.
[00:21:00] That’s plugging away. There’s one with another one and there’s the third. Great. So I’ll jump back into fireballs. Refresh this page. I can get the updated scheme view. Okay. And the limbo holds here behind my external tables. I can go and take a look at these. Let’s just. Verify, we’ve got the data in there, select star from all the state account, not like the auto complete.
Great. So I’ve got my data flowing in and of course, external table is not the fully optimized table forums. We want to have to get these into fact and dimension tables according to our model. So let’s jump back into dbt and I will dbt.
And now we’re taking those external tables and we’re now running through all the DDL needed to create the fact and dimension tables based on those external datasets.[00:22:00]
Great. So that’s all done. And again, I’ll jump back into the UI here in Firebolt and just.
Okay. So I’ve got my new tables and you can see from my profiles that it didn’t show you that didn’t actually have my password. We’ve prefects these table names. So you can get a developer isolation while you’re all working on dbt with side out of my initials, cam to all these tables I can look at these, let’s look at LTV so I can look at some of the data that flows through there.
That’s great. Let’s look at the indexes that were created. Cool. So you can see all those things that we defined in our dbt project, and I can run, something like I had that same Looker query. Let’s just run this guy. Cool. And again, we’re getting that sub-second this time 0.1, zero seconds to do some [00:23:00] analysis right on this data.
So that’s hopefully a illustrative of the soup to nuts workflow of dbt on Firebolt. Thanks. Thanks for watching.
Last modified on: Nov 29, 2023