Table of Contents
- • No silver bullets: Building the analytics flywheel
- • Identity Crisis: Navigating the Modern Data Organization
- • Scaling Knowledge > Scaling Bodies: Why dbt Labs is making the bet on a data literate organization
- • Down with 'data science'
- • Refactor your hiring process: a framework
- • Beyond the Box: Stop relying on your Black co-worker to help you build a diverse team
- • To All The Data Managers We've Loved Before
- • From Diverse "Humans of Data" to Data Dream "Teams"
- • From 100 spreadsheets to 100 data analysts: the story of dbt at Slido
- • New Data Role on the Block: Revenue Analytics
- • Data Paradox of the Growth-Stage Startup
- • Share. Empower. Repeat. Come learn about how to become a Meetup Organizer!
- • Keynote: How big is this wave?
- • Analytics Engineering Everywhere: Why in the Next Five Years Every Organization Will Adopt Analytics Engineering
- • The Future of Analytics is Polyglot
- • The modern data experience
- • Don't hire a data engineer...yet
- • Keynote: The Metrics System
- • This is just the beginning
- • The Future of Data Analytics
- • Coalesce After Party with Catalog & Cocktails
- • The Operational Data Warehouse: Reverse ETL, CDPs, and the future of data activation
- • Built It Once & Build It Right: Prototyping for Data Teams
- • Inclusive Design and dbt
- • Analytics Engineering for storytellers
- • When to ask for help: Modern advice for working with consultants in data and analytics
- • Smaller Black Boxes: Towards Modular Data Products
- • Optimizing query run time with materialization schedules
- • How dbt Enables Systems Engineering in Analytics
- • Operationalizing Column-Name Contracts with dbtplyr
- • Building On Top of dbt: Managing External Dependencies
- • Data as Engineering
- • Automating Ambiguity: Managing dynamic source data using dbt macros
- • Building a metadata ecosystem with dbt
- • Modeling event data at scale
- • Introducing the activity schema: data modeling with a single table
- • dbt in a data mesh world
- • Sharing the knowledge - joining dbt and "the Business" using Tāngata
- • Eat the data you have: Tracking core events in a cookieless world
- • Getting Meta About Metadata: Building Trustworthy Data Products Backed by dbt
- • Batch to Streaming in One Easy Step
- • dbt 101: Stories from real-life data practitioners + a live look at dbt
- • The Modern Data Stack: How Fivetran Operationalizes Data Transformations
- • Implementing and scaling dbt Core without engineers
- • dbt Core v1.0 Reveal ✨
- • Data Analytics in a Snowflake world
- • Firebolt Deep Dive - Next generation performance with dbt
- • The Endpoints are the Beginning: Using the dbt Cloud API to build a culture of data awareness
- • dbt, Notebooks and the modern data experience
- • You don’t need another database: A conversation with Reynold Xin (Databricks) and Drew Banin (dbt Labs)
- • Git for the rest of us
- • How to build a mature dbt project from scratch
- • Tailoring dbt's incremental_strategy to Artsy's data needs
- • Observability within dbt
- • The Call is Coming from Inside the Warehouse: Surviving Schema Changes with Automation
- • So You Think You Can DAG: Supporting data scientists with dbt packages
- • How to Prepare Data for a Product Analytics Platform
- • dbt for Financial Services: How to boost returns on your SQL pipelines using dbt, Databricks, and Delta Lake
- • Stay Calm and Query on: Root Cause Analysis for Your Data Pipelines
- • Upskilling from an Insights Analyst to an Analytics Engineer
- • Building an Open Source Data Stack
- • Trials and Tribulations of Incremental Models
Batch to Streaming in One Easy Step
In this talk we will demonstrate how to get started with streaming analytics using the same analytics engineering skillset that keeps you productive in batch analytics.
Traditionally, streaming analytics has required a separate knowledge base, tooling, and ecosystem from batch analytics, one that is comparatively poorer as an ecosystem with a lot less tooling.
We believe that the future of streaming is the same as that of batch analytics - using dbt and SQL for modeling and working with your data. In this talk, we’ll walk through how Materialize helps you build a SQL-first streaming pipeline, as well as go through real-life examples of how Materialize users are building streaming pipelines in production with the very same dbt models they built for their batch data.
Browse this talk’s Slack archives #
The day-of-talk conversation is archived here in dbt Community Slack.
Not a member of the dbt Community yet? You can join here to view the Coalesce chat archives.
Full transcript #
Amy Chen: [00:00:00] Hello, everybody. Welcome to the second half of Day 2 of Coalesce. Hopefully you’re all feeling very energized about all the spicy talks or maybe slightly called out for your job title. I’m Amy Chen. I’m a senior partner engineer at dbt Labs. I use they/them pronouns and I will be your host for the session, which is, Batch to Streaming in One Easy Step.
The topic at hand is something very near and dear to my heart. So I’ll hope you’ll make some noise for Arjun, who is the CEO of Materialize and Emily, who is the data platform lead at Drizly. They’ll be showing us how Materialize makes streaming analytics accessible to analytics engineers.
So quick housekeeping before we get started. All of the chat conversations will take place in the Coalesce Materialize channel in dbt Slack. If you are not already part of the Slack, you have plenty of time to join. Visit community.getdbt.com and search for coalesce-materialize. When you enter the space in the Slack [00:01:00] channel, we welcome you to ask questions, comment, and react at any point.
If there are not any spicy memes with batch versus streaming in Slack,why are you even here? Kidding. Kidding. After the session our two speakers will be available in the Slack channel to answer your questions. However, we highly encourage you to ask any questions at any point during the session. Now let’s get right to it over to you, Arjun.
Arjun Narayan: Thank you so much, Amy. I’m really glad to be here. Hi everyone. I’m Arjun Narayan, the co-founder and CEO of Materialize. Materialize is a streaming database for real-time applications and analytics built on top of a next generation stream processor called timely data flow. And today I’m excited to show you how you can move from batch to streaming in just one easy step all with just the SQL, and love.
Now, until now, if you’ve been familiar with streaming. It has not seemed very easy or approachable, but with modern streaming technology, I’m here to tell you that streaming can be [00:02:00] just as easy to use as batch. In fact, it’s so easy that you can move from batch to streaming in just once easy step. Now, before I convince you, I’ll go over a little bit of the history and the differences between batch and streaming, how and why we started using batch processing and how new streaming technologies like the one at the heart of Materialize unlock real-time production use cases for the analytics.
And at that point, if you’re still a little skeptical, I’m going to hand the mic over at the Emily Hawkins, the data platform lead at Drizly. Emily is going to present her firsthand experience of moving Drizly’s abandoned cart reactivations from a batch process to a streaming process using just SQL and Materialize.
[00:02:43] Defining Batch Processing #
Arjun Narayan: Now to make sure we’re all on the same page before I get started. Let’s work with some definitions. Let’s define batch processing and stream processing. Batch processing is a process that reads some amount of input data, it could be very large, perform some calculations on that data, it could be very complex, and then output [00:03:00] some specific results. That result might take a minute to compute, an hour to compute, or could take a whole overnight.
The important thing to note here is that data processing does not really occur between batches. Batch processes produce discrete groups of outputs, a single input, a single run, a single output. For those of you who are already familiar with dbt every time you dbt run your model, you’re triggering a bad transformation of your input data to produce a single set of outputs.
[00:03:28] Defining Stream Processing #
Arjun Narayan: Your data will not be transformed between dbt runs. You got to click it again. Now with stream processing, unlike batch processing, it processes data continually and on no particular schedule. Anytime new data streams into your database, those rows are automatically transformed and update the output. You don’t wait for the next dbt round to know how these new values affect your processes output.
Instead, that output is continually transformed whenever new data arrives. So with batch processing, you are responsible for triggering or scheduling transformations. [00:04:00] But with stream processing, your streaming database is responsible for transforming your data in real time. That means if you use the streaming database with dbt you start off the run by dbt run in your models, but that’s it.
Then it happens continuously, forever after. From these definitions alone, it’s pretty clear to me that we should all want to work with streaming data.
From orchestration standpoint, I just think of all the time and effort to still figure he needs running again. Or is it about which rows were included in each of your micro batches. Instead, as new data arrives, your streaming database will automatically transform each new row for you from a product standpoint. There’s almost no use case that wouldn’t benefit from correct real-time data.
We may have talked ourselves out of wanting real-time data for reasons of difficulty, but it’s hard to deny that we would be better. To prove it let’s [00:05:00] explore a few different use cases through the lens of batch and stream processing.
[00:05:08] Spectrum of Use Cases #
Arjun Narayan: All right, we’re good to go. On one end of the spectrum, let’s start with batch. There are a few use cases where batch might always be good enough. Batch is probably good enough to calculate payroll every two weeks or crunch numbers for quarterly board reports, but we quickly exhaust a set of use cases that are just perfectly served by batch processing.
On the other end of the spectrum, there are use cases where streaming is a hard business requirement where you cannot wait for the next batch of calculations to run, to get some results. Think of financial use cases or fraud detection use cases. Companies do not make trades on hours old information or detect fraud hours after it’s happened.
They like to do that when it happens. Those use cases have typically already been serviced by expensive specialized streaming teams, building and maintaining custom [00:06:00] microservices that hasn’t been always accessible to analytics engineers. Now, between these two use cases, there is a vague middle ground.
These are use cases, review your business. Your customers would really be much better off with correct real-time data, but where we’d be using batch because streaming hasn’t always been easy. Imagine you want to run an abandoned cart reactivation campaign. My guess is you’re sending those out a few hours after the customer has abandoned the cart.
Wouldn’t it be better to send that email precisely when it’s the right time to do perhaps even immediately. Imagine you have a business critical dashboard that’s only updated once every few hours. Would it be better if the metrics that your team uses to organize their days, workflow was always up to date.
When we take a closer look, there are very few cases where batch processing is the perfect way to process the data forever. In the future, most of our use cases actually fall into this middle ground where they would be better served [00:07:00] by correct up-to-date streaming data, but are currently served by stale batch data because that’s, what’s easy to use and that’s what we have available.
Or use cases would benefit from streaming data. Why are we all still stuck on batch at Materialize? We believe than many of us have been stuck with batch because existing streaming systems simply have not been good enough to warrant making the switch. In fact, we believe that existing streaming systems, if they were good enough, we’d all be streaming by now.
They’ve been too complicated, too expensive and too restrictive. So what have they gotten wrong? I think the answer is pretty simple. We think the biggest failure of existing streaming systems is they just don’t support SQL. Some streaming systems don’t support any SQL at all, and require you to write custom microservices, code other systems, even the ones that do say they support SQL only support some small subset of the standard. And they definitely don’t support the really good stuff that you get in batch warehouses, like multi-way joins, subqueries, stuff that [00:08:00] really is crucial to modeling the business logic that is important to your specific process. And it’s a huge problem. The first is that reduced or non-existent SQL support makes it incredibly different to move from batch to streaming.
You probably have years of carefully crafted batch queries written and SQL that you would have to dedicate a lot of engineering time to rebuild and refactor these pipelines that frankly already work. And these engineers are going to have to understand the nuanced semantics of your new streaming systems for most companies and most streaming products.
[00:08:35] Existing Streaming Systems Don’t Support SQL #
Arjun Narayan: This means hiring an entirely new team to form entirely new processes, just to move to streaming. So even though most of your use cases could benefit from going to streaming, and this would be, you’re probably going to have a hard time actually getting that off the ground, given the amount of effort required and because existing streaming systems don’t support SQL, it’s just too common.[00:09:00]
And finally because existing streaming systems don’t support SQL, they exclude a very large group of folks who would benefit from this, which is the analytics engineering community. Analytics engineers, or folks who apply software engineering best practices to data and analytics code basis.
There are folks who are expected to own their organization’s entire data stack from data integration to the data warehouse, to the SQL logic that lives in the warehouse to the final reports and output that get produced. The primary language of analytics engineers is SQL and SQL alone has been good enough to own the entire data stack for an organization, except for one thing, streaming.
I expect many of you in the audience today identify as analytics engineers. You are responsible for and completely trusted to own your organization’s data stack. And my question to you really is why should you be locked out of the power of streaming. So if you’re on board and you want a stream, which is SQL, we should probably [00:10:00] start by saying by defining what we mean.
[00:10:04] Streaming with Materialize #
Arjun Narayan: When I say streaming, I don’t mean micro batch. I don’t mean Lambda views. I don’t mean anything that would result in more than a second of latency. True, real-time streaming that you can use with the SQL that you have. And today I’m excited to tell you that this streaming system exists. Materialize is a streaming database for real-time applications and analytics.
And with Materialize, you can stream using just SQL. Now you’re probably wondering how does it work? It starts when you tell Materialize about your upstream data sources, this could be a Kafka topic, a post-test table or a Kinesis stream. All you need to do is execute SQL statement that points Materialize to your upstream data source.
And then you’re set. You’re already streaming Materialize. All of your existing data and then continue to ingest new and updated data as it flows from your upstream data sources. So for instance, as new records arrive in your Kafka topic, or [00:11:00] as new rows get added or updated or deleted in your Postgres table, Materialize will automatically ingest those changes.
Once your streaming data is available in Materialize, you can transform it in real time by creating materialized views. That’s what that’s the whole point of the name, because Materialize is a SQL database. You just run SQL statements to create your views, just like you would in any other data warehouse by creating materialized views and you are transforming your streaming data in real time using just SQL.
Remember earlier when I said that streaming data, but with streaming data, your database responsibility, to keep your results up to date, that’s what Materialize is. As Materialize ingest these new rows from your upstream sources, it keeps the results of your materialized views up to date.
You don’t refresh the materialized views. You don’t run dbt run to run a transformation. Again, Materialize automatically calculates the results of your views as the new data arrives without any human intervention. [00:12:00] Once your streaming data is being ingested by Materialize and once you’ve transformed that data, you can interact with it a few different ways first because Materialize is Postgres SQL compatible, you can query it using just SQL.
You can even use your familiar SQL shell. There are a few other ways that you can interact with your transformed data, which I believe will be particularly interesting to the analytics engineers in the audit. First you can use Materialize to power real-time business intelligence dashboards.
This means that each time a user refreshes your dashboard, they will always see consistent up-to-date results. Second, you can use the results of your queries to power real-time applications, such as e-commerce notifications, or we can use it to detect fraud in finance. Data or any of these other middle ground use cases that we explored earlier, you can do all of this by either querying your database, it’s SQL after all and the rest of your systems most likely do speak that, or you can send it back out to another streaming system, like Kafka to be [00:13:00] part of the rest of your streaming industry. Now at this point, you might be wondering why do I need to use Materialize? Can I just create a materialized view in the data warehouse that I’m using already?
The answer is that materialized views Materialize was purpose-built to handle streaming data. And so our materialized views are a little different under the hood to what you may be used to. In fact, our materialized views are superpower. Let’s talk about them. Each time you query a database, you incur some costs.
This can be measured in wall clock time to get your results back, the actual CPU time used, and an actual dollars that you built for. Now, if you want to repeatedly ask some question like how much revenue your sales team has generated you wouldn’t want to pay the same price each time you ask that question.
Materialized views reduce all types of query costs by maintaining the results over select statements. There’s only one catch. The materialized views will need to be updated when the underlying data changes in traditional [00:14:00] databases, particularly with the more complex views. They typically require a full refresh, meaning that the same select query with the same query costs will be executed under the hood.
This is where Materialize is very different. When you create a material using Materialize, Materialize is responsible for maintaining correct up-to-date results for you without any human interventions, Materialize always does an incremental refresh and maintains the result. It does this by using a modern, incremental stream processor as its underlying execution engine.
So when you create a materialized view, what Materialize is doing is compiling that SQL down into a running data flow. For those of you who aren’t familiar, timely data flow is a battle-tested next generation stream processor. It’s a lot to cover in a single talk, but I encourage you to read more about it on our website if you’re interested, but the real magic of Materialize is you don’t actually need to know.
You just need to run SQL. And then you’re set, you’ve created an incrementally maintained [00:15:00] materialized view that automatically updates in milliseconds compared to traditional materialized views as new data flows upstream. Materialized views will ensure that the results of your select statement is always up-to-date. The data flow underpinning your materialized view updates with each single row.
[00:15:17] dbt + Materialize #
Arjun Narayan: Each time you query it, you get correct up-to-date results. And finally, because Materialize exposes a SQL interface to its users, we were able to build an adapter to connect your existing dbt projects to Materialize, download our dbt Materialize adapter, and point your dbt project that you’re running Materialize in and you’re ready to transform your streaming data in real time using the SQL you already have. What does this look like in practice? Because Materialize is a SQL database, you define your models, just like you would with any other underlying batch data warehouse using a select statement. But when you choose your models, materialization type, you indicate materialized view.
And that creates an incrementally maintained [00:16:00] materialized view in Materialize. And because it’s a SQL database, you can simply reuse your existing working batch queries to transform your data in real time. And the best news, which I’ve saved for last, because we are responsible for maintaining correct results to your queries for you, you will only ever need to dbt run your models once. Never again. I want to repeat this to let it sink in. Using dbt and Materialize you can dbt run your models once and never again. You don’t have to figure out a schedule. You don’t have to set up triggers. The database is responsible for transforming the data in real time and keeping the results.
Every data analyst listening to this talk could get started streaming to date with one easy step using Materialize. I want to take another moment to highlight that every single person in this audience who knows SQL is capable of bringing streaming data to their organization today. I hope I’ve been able to connect.
But in case you’re still not convinced I brought some back up. I’m going to hand the mic over to Emily Hawkins, data [00:17:00] platform lead at Drizly. Earlier this year, Emily led the effort of moving Drizly’s abandoned cart notifications from a batch process to a streaming one using Materialize. Over to you, Emily.
[00:17:09] Using Materialize in Drizly #
Emily Hawkins: Thanks, Arjun. I’m super excited to be here and share how Drizly is using Materialize in dbt. Before we get into that, I want to quickly talk about Drizly. So for those who haven’t heard of Drizly, we are the largest online marketplace for alcohol currently in the US and Canada. We have thousands of retailers on our network.
And if you haven’t heard, we were recently acquired by Uber this year and the deal closed just only a couple months ago. And also just have a little special promo code for everyone to use, Drizly dbt. Feel free to use that, get some cola happy hour drinks going. There’s the [00:18:00] cocktail session later tonight. So feel free to use it for that.
Cool. So to start off with how we were originally dealing with abandoned cart we were doing that in our normal batch dbt process in Snowflake. So I want to talk a little bit about what is an abandoned cart. So someone comes to Drizly, they’re shopping for some items. They add them to their cart.
But maybe they get distracted and forget to check out. So they have abandoned their cart. I’m sure you get these notifications from a bunch of different online retailers. But our process before Materialize was done on a 24 hour cadence. So our dbt project in Snowflake runs once a day.
So in that case, we wouldn’t know about an abandoned cart until the following day. [00:19:00] And that’s when the notifications would get sent out. And just with the nature of Drizly’s business the next day abandoned cart was definitely very likely people weren’t probably weren’t in need of those items that they added to cart the previous day.
And some people might be asking, why didn’t you just run your dbt models, like more often? That’s a valid question. For one, like our project really just isn’t set up for that. It’s set up for a 24 hour cadence. So that would require, a big effort. These are all clickstream, like incrementally built models.
And we also just didn’t want to force Snowflake to do something that it’s not meant to do. We really wanted to build like a strong streaming platform. And that’s what Materialize helped us do. So with Materialize the process is [00:20:00] pretty much the same. We have dbt managing all of our SQL. And as Arjun mentioned, we have materialized views that are updated incrementally as new data flows in.
And now we’re able to send out notifications on a 30 minute cadence, and this is just a time range that we decided on as a team. So someone adds something to cart. If there’s no corresponding checkout event in 30 minutes we’re going to send them that abandoned cart notification nudging them to finish their purchase.
So as Arjun mentioned really we got to use SQL the same way as we already were and our existing dbt project and Snowflake at Drizly. We’ve been using dbt for a couple of years now so we already have that, strong dbt skill set on the team. It was really great that we were able to continue [00:21:00] managing our transformations in that way.
We were able to keep that, SQL logic within the data team. As Arjun mentioned, again, we didn’t have to rebuild a bunch of our logic with engineering time. And then going forward as we build more things out in Materialize, we’re really excited that the data team can continue to own that transformation logic, the same way that we own all of our dbt models in Snowflake.
So touching on what our architecture looks like in Materialize. So we have our data. So in this case, we can keep with the abandoned cart example. So we have a Kafka topic for add to cart events. We have a Kafka topic for check events and those are flowing into Materialize into a source object.
And [00:22:00] again, all of these things in the box within dbt we get to use all the same great dbt functionality that you know and love. So macros, post hooks, run operations, all of that stuff. We can continue to leverage and take advantage of.
Then we have our materialized views. Again, we can keep with our same best practices that we have for our batch process. So we have our models that read from the sources. So our staging models, this is where we can read in the source data extract. The json do any sort of casting, renaming, making the data super usable.
And then we can do all of the fancy drawings and logic that we need to do to be able to say yes, this user has an abandoned cart. Once we have that, we can then send the data out using a sync object. This is where we can send the data back out [00:23:00] of Materialize into another Kafka topic. Which then can be read into our outbound service.
So in this case, our CRM tool where then the notifications can be triggered and sent out to our customers.
So to wrap it up for me again, we got to keep using SQL for all of our data transformation. We got to keep using dbt, which was super amazing for us. We love dbt obviously. And we were able to do the whole data transformation process in dbt and Materialize without meeting a whole bunch of engineering work and custom code. So I’ll hand it back to Arjun to wrap this up.
Arjun Narayan: Thank you so much, Emily. I was super excited that Emily agreed to present her use case in this talk because it’s a use case I’m particularly excited about. She was able to quickly transform a batch process to a streaming one without any loss of [00:24:00] ease or functionality.
It’s a great example of the types of production level use cases you can build with just a bit of engineering effort using Materialize.
And now I believe we have many questions from the audience, so Amy, I would love to get into those.
[00:24:18] Q&A #
Amy Chen: Of course. Yeah. Let me grab the first one. I think this is a great one. Where are the workshops run? What? Oh, wait, no. Someone already answered that. Your Materialize team is really on point today.
Someone else. I feel that the work flow being applied here is blurring the lines between what should be engineering in the engineering realm versus data. How do you decide what flows/notifications cart up ambulant versus it should be tackled by data versus edge?
Arjun Narayan: I do think that one problem in current organization structure is just, this is complete silo between data and part of [00:25:00] what’s nice about bringing dbt to the streaming use cases that have traditionally been far away from the analytics side is people can start speaking the shared language again, right?
What’s nice about this is people are defining the actual underlying metrics using the same common models. And that means that the engineering and the analytics are back in sync again. I think it has to move to a more collaborative space, but a large part of that is bringing everybody back to SQL and I think you’ll see a lot of love for SQL and the production side.
They use OLTP databases all the time in production. And this doesn’t necessarily have to be something that is split into two communities and the communities can collaborate on building production workflows.
Amy Chen: Emily, did you want to add anything or no?
Emily Hawkins: No, I agree with all that. And we, as the data team, are the most familiar with our data. So for us, it makes sense for us to own that transformation logic versus having to explain it [00:26:00] to another team, have them implement it.
So it was nice for us to be able to own that.
Amy Chen: Awesome. The last question has a lot of pluses so let’s go with that one. Does this mean you have two data warehouses, a Snowflake and Materialize and subsequently two dbt projects?
Arjun Narayan: Emily, I’ll give that to you.
Emily Hawkins: Yes. Yeah, we have Snowflake, our dbt project for Snowflake, and then we have a separate streaming dbt project on Materialize.
Amy Chen: Cool. Does anyone feel up for one more question? Since we have two minutes? Cool. Since Josh was the creator of the initial Materialize adapter, I’ll take his question. What do deploys look like for the streaming service with dbt plus Materialize? Is it a blue-green thing?
Arjun Narayan: Emily, I’ll throw that to you since you are the production user.
Emily Hawkins: So this is very early on, like this just went into production, so we don’t [00:27:00] have anything too fancy yet. But as Arjun mentioned, you only need to dbt run once. So really the only time we’d have to run something again, is if we changed something. So far we haven’t had to do that yet. I guess once we get into a more mature state, we’ll definitely want to set up, like some CI/CD and make sure that nothing’s gonna break after we merged the change.
Arjun Narayan: I can speak for some of our customers as well, who are running a blue-green set up where they essentially hydrate a new Materialize instance, switch over the production traffic, and then decommission the old instance. It is in our roadmap to make that one click seamless in our cloud product early next year.
Last modified on: Apr 19, 2022