dbt + Materialize: Streaming to a dbt project near you
Materialize’s new dbt adapter helps analysts create streaming SQL pipelines.
Streaming data hasn’t always been accessible to analytics engineers. In most data stacks, it either looks like data engineers working in Scala, or data scientists working across several notebooks.
Well, take heart AE friends, the future of analyst-accessible streaming is now (or soon, as it were):
Together, we’re opening the door for analysts to become first-class creators and users of streaming analytics. This integration will push the boundaries of the analytics workflow and improve streaming capabilities across our shared ecosystem.
Or said more succinctly, “You’re bringing streaming analytics to the people!” –Guillermo Sánchez Dionis, Analytics Engineer at GoDataDriven after testing the plugin.
What’s wrong with how real time analytics is handled today?
Batch-based tooling for real-time use cases is hard to maintain at scale.
Ok wait, before we start, let’s make sure we’re all on the same page: batch-based processes are fine for most analytics production use cases (fight me).
But while real-time use cases are less common, they often carry higher stakes. Think of anomaly/risk detection or even inventory management. Unfortunately, we don’t have great options to address these needs today.
And trust me, I’ve tried. In this article I describe implementing lambda views to achieve something close to real-time analysis. That works for some use cases, but can be limiting in others, especially at higher data volumes. This is just because there’s an upper-bound to how much we can optimize SQL before performance starts to suffer.
I hit my limit at around 1TB in that scenario, and I’m sure you can imagine how useful a real-time dashboard reading a 1TB view would be.
As data volumes scale up, we need tooling built for streaming.
Current streaming tooling is inaccessible to analysts
How do analysts work with today’s tools for streaming data?
Well, they don’t. That’s because analysts aren’t usually well-practiced in setting up Kafka, jumping into Kafka Streams, and writing transformations in Scala/Java - nor should they be.
Instead, they lean on data engineers to set up pipelines and manage transformations. As a result, today’s analysts can’t really contribute to streaming data transformations where we know they could provide immense value.
Making streaming analytics more accessible to anyone who knows SQL can reduce data team silos, increase speed of analysis, and improve data quality.
dbt + Materialize
Materialize has just given dbt users the keys to the candy shop. By removing the barrier to streaming data modeling, Materialize’s dbt adapter enables users to own real-time transformation workflows, exactly like they’d own batch-based alternatives.
But depending on whether you’re coming from dbt, or Materialize, you might have different questions about how it works.
Materialize + dbt for dbt users
Materialize creates materialized views, built for streaming. These materialized views are not cached, as with more traditional data warehouses, but rather, exist as queries that are continually updated.
If you’ve previously used batch-based processes with dbt, it might blow
your mind a little to learn that Materialize’s plugin enables you to
dbt run just once. 🤯
In the above example, you can see a materialized view that dbt creates on Materialize, updating without manual input.
Materialize + dbt for Materialize users
If you’re already on Materialize, you might need a bit of help understanding dbt. I like to think I have a pretty good idea, but Andy, Head of Community at Materialize, said it best:
“dbt is the best way to manage your Materialize deployment. With dbt, you can optimize your workflow with version control, macros, testing, and documentation.”
Josh Arenburg from Datalot seems to agree. As one of the first to use dbt + Materialize in production, Josh shared with me how he scaled his Materialize pipelines using macros to parameterize and deploy views. Rather than running through a dozen create source statements separately, he used a dbt jinja loop to iterate over the same statement.
His data engineers and data analysts now collaborate in git across multiple data warehouses for a much more streamlined (+documented and tested) transformation workflow.
The future of streaming analytics
Where will the new super duo of streaming analytics and analytics engineering take us? I’ve had a few conversations about what’s possible with Andy Hattemer, Head of Community at Materialize, and Drew Banin, Co-Founder and Chief Product Officer at Fishtown Analytics. Here’s a few use cases they’re most excited about:
- Real time dashboards, both internal and user-facing: Empowering data analysts to own transformation processes that power real-time insights for internal and external customers.
- Operational use-cases: Batch-based processes work well for updating email lists, but realtime use-cases like spam detection require a streaming approach.
With dbt + Materialize, these use cases are now accessible to anyone that knows SQL. But they’re also just the beginning!
There are so many questions not being asked in today’s batch paradigm– because they can’t be answered with it. We hope that by making streaming workflows more accessible to every member of the data team, we’ll see answers to questions we didn’t even know we had.
Getting started with dbt + Materialize
If this post got you thinking about what streaming analytics might look like at your organization, I’d love for you to join the beta! Or if you’re looking for a little more guidance, join ~200 of your closest friends in the #db-materialize slack channel. Ask questions, get advice, and connect with others preparing to make the same transition.
Special thanks to Josh Wills (WeaveGrid), who created the first Materialize dbt adapter and Jessica Laughlin (Materialize) who shaped the adapter into the state it is today.
Last modified on: Nov 29, 2023