Table of Contents

You don’t need another database: A conversation with Reynold Xin (Databricks) and Drew Banin (dbt Labs)

Reynold oversees technical contributions to Apache® Spark™ at Databricks, initiating efforts such as DataFrames and Project Tungsten.

Originally presented on 2021-12-06

Reynold Xin is a technical co-founder and Chief Architect at Databricks. He’s also a co-creator and the top contributor to the Apache Spark project.

In this casual conversation with Drew Banin, co-founder and Chief Product Officer at dbt Labs, the two will be discussing the data infrastructure trends they find most interesting.

There’s a lot happening in the data world right now. Where does the lakehouse go from here? Have we gone too far with database specialization? Will Drew ever be invited to lecture on databases at Carnegie Mellon again? Join us as Reynold and Drew share their true feelings on these topics and more.

Browse this talk’s Slack archives #

The day-of-talk conversation is archived here in dbt Community Slack.

Not a member of the dbt Community yet? You can join here to view the Coalesce chat archives.

Full transcript #

Barr Yaron: [00:00:00] Hi everyone. And thank you for joining us at Coalesce. I’m having an amazing time so far. My name is Barr. I work on product here at dbt Labs and I’m thrilled to kick off You Don’t Need Another Database: A Conversation with Reynold Xin and Drew Banin. There’s a lot happening in the data world right now.

That’s an understatement, and one of the reasons we’re all here today. In this session, we’re lucky to pick the brains of Drew, co-founder and chief product officer of dbt Labs and Reynold, co-founder and chief architect at Databricks. We will learn about emergent trends in the data space, the future of data lakes, and what might just be hype.

First, some housekeeping, all chat conversation is taking place in the Coalesce Databricks channel on dbt Slack. If you’re not part of the chat, you have time to join right now. Search for Coalesce Databricks when you enter the dbt community Slack channel.[00:01:00]

Drew, whom you’ll be hearing from today makes things happen. He is fluent in English, very fluent in YAML, and pretty good at French, I’ve heard. I work closely with, and for Drew and he wears many hats. He has been the number one open source maintainer of dbt, community champion, and he knows the ins and outs of our cloud product better than anyone, like the back of his hand.

Every person who works with him, myself included, learns so much from him. And one of those lessons is to constantly push the envelope and the vision of what’s possible in data, and to do that in service of empowering analysts. You’ll hear about that today, and you can hear even more about how he thinks about the world in his keynote on Wednesday. Reynold, I have not had the chance to work with, but his reputation precedes him.

He’s the Databricks co-founder as well as the co-creator and number one contributor to the Apache Spark project, the most popular [00:02:00] and de facto framework for big data processing. He began working on Spark back when he was a PhD student at UC Berkeley. At Databricks he’s led the development of Spark and has been behind many game-changing initiatives, including the new Databricks SQL offering.

Maybe we’ll hear about that. After the session, both of them will be available on the Slack channel to answer all of your questions. However, this is not like an in-person event where you’re supposed to stay quiet during the talk. We encourage you to ask questions, make comments, react at any point in the Slack channel.

If you are so moved. Let’s get started and I’ll pass it over to you, Drew and Reynold.

Drew Banin: Hey, thanks so much, Barr, that was very kind. And Reynold, thank you so much for being here with us today. Really appreciate you. And maybe hit the unmute.

Reynold Xin: Yeah. Really happy to be here.

Drew Banin: So Barr gave a great overview, but I’m wondering just because Spark has been so influential and you’ve played such a big role in it. Can you tell us a little bit [00:03:00] about, your background working on Spark and how that started and what shape it’s taken over the years?

Reynold Xin: Yeah. So this is actually a fairly interesting story that a lot of people don’t know about Spark itself. Most people knew it started at UC Berkeley as a research project and over the years it’s taken over the world, but initially it was had a very humble beginning.

It had to do with the one million dollars which is the Netflix price. So back in 2009, 2010 Netflix had this Netflix prize competition. And the idea is for it’s incredibly important for Netflix to understand what movies people would like watching. So they can procure those movies or use those movies, like Stranger Things. I think Amy mentioned on the Slack channel.

So they did this open competition by anonymizing all the movie rating datasets, and then asked anybody in the world can participate. You can get a handle on the data set and come up with a better machine learning model. Then their baseline whoever would come up with the best model wins. Lester was one of the students at a [00:04:00] UC Berkeley AMPLab and he was a very competitive person.

And plus, as a PhD student back in Berkeley, he was making $2,000 a month. And the competition has a word of mouth was actually a million dollars. So Lester really wanted to compete, but he ran into a big problem, which is, it was a pretty sizeable dataset. He didn’t have the tools to transform them and he didn’t have the tools to express the machine-learning algorithms on them.

They wouldn’t work on Matlab, like a single node Python program. So he talked to Matei who at the time was a sort of systems PAC CEO. And I said, "hey, why don’t you help me build some too? And I can split some money together?". That’s actually how Spark was born.

The first version of it was dedicated specifically to this task of, "hey, give it a bunch of data, get it into shape that’s ready for building machine learning models. And then how do we build a machine learning models quickly out of it?".

It was only 600 lines of code, very simple. And then obviously it started, as more and more people realize, "hey, it seemed like the ability to handle larger amount of data quickly and [00:05:00] being able to iterate quickly is pretty important. It’s not just something that maybe a PhD trying to work on a Netflix prize". And then, so everything just happened from there.

I joined the project about a year later. This, when I got to Berkeley at first, I actually didn’t work on this side of the house. And then over the years I was more like I specialized in databases, data warehousing, and I brought a lot of the classic database ideas and implementations into the Spark ecosystem.

They started with a very strong roots on like, how do we do massive privatization to handle machine learning data? And then over the years I brought a lot of, so how do we just process data in general leveraging sort of three or four decades of database knowledge in academia.

Drew Banin: That’s an incredible story. What was the state of the art back in 2009, 2010. If say, parallel universe, Spark wasn’t invested. How would folks do it back then?

Reynold Xin: Yeah, so I think it was incredibly difficult. There are two camps. There is the data warehouse camp, which [00:06:00] you could dump all the data in and then it runs dashboarding.

Like even back in 2009, 2010, you’ve probably run dashboards. Pretty well. And for business data, and then there’s a Duke Camp, which is the big data camp. And the big data camp has one incredible improvement over the data warehouse camp, which is you could dump anything you want into it. And it’s super cheap.

But it was almost impossible to do anything. If you want the right to not produce program, to just run a very simple query, you’d have to write like tens of thousands of lines of code. It’s impossible to do anything. And that’s where I think the world got stuck.

Drew Banin: Yeah. Was that like Pig back then? Was that the state of the art? Or does that come later?

Reynold Xin: Pig, Hive were still just getting out the door and very few people use Hive anymore and nobody is using Pig anymore. There’s a lot of tools that were invented. So trying to attack a low-hanging fruit and over the years, most of them become irrelevant. But yeah. Definitely, it’s a very different time from now.

Drew Banin: Okay. Spark has surely evolved an incredible amount since that [00:07:00] original Netflix challenge, but this concept…

Reynold Xin: Somebody might be curious what the hell happened to the Netflix challenge? He actually he got the model roughly 10% point better than the baseline model, which is a massive deal for measuring any algorithm and he tied for number one with somebody else, but then his entry was submitted 20 minutes late. So he lost a million dollars. So Spark, if Spark were invented 20 minutes earlier, Lester would have been a million dollars richer.

Drew Banin: That’s pretty incredible. It just speaks to the importance of performance in interactive query response time, getting those results back . The fundamental thing that you touched on, it’s like the idea of the unstructured or semi-structured data like living in, in this data lake and being accessible for analytics.

I’ll just say when we started building dbt it was on more traditional data warehouses and it was probably maybe 2018 or 2019, when we first started investigating, integrating dbt with Spark and back then, I think it was just vanilla Spark we started with. Today is a momentous day [00:08:00] because you and the folks at Databricks have somebody to announce, is that right?

[00:08:03] dbt-Databricks adapter #

Reynold Xin: Absolutely. So we I think the announcement actually went out very early this morning. There was the new dbt Databricks adapter that we have actually worked together on dbt Labs and Databricks. And it’s a big deal in our mind because it’s a signal it’s a very strong intention to collaborate here and then actually create one of the best I think ways to run dbt on Databricks.

And it’s not just adapter, like it’s not just, "hey, we built another connector and so what?" We have actually it also signals the beginning of us looking at, "hey, what are the types of queries dbt generating? How can the platform metric run those queries better once we have gotten those queries onto data?".

This we’re talking just off stage right now, like our team that works on the core engine are actually looking at, the type of queries that were generated and optimize for that specific pattern and the incremental data load queries would run actually eight times faster in the next release of the Databricks platform, which customers don’t have to do anything to. They would just get automatically updated.[00:09:00]

Drew Banin: That kind of integration is really, certainly to us and the folks that, that use dbt on Databricks. You mentioned the incremental approach, is it the merge statement in particular that gets a speed up or something else?

Reynold Xin: I have to look into the details of it, but I think it’s actually be, I’m actually pulling up the query here right now. I think that it’s a very long query that gets generated. That’s about U.S. Decimal, three pages long with an MD5 expression in it. I personally don’t understand the queries that’s generated a lot.

But the it’s actually the specific combination of expressions for that. That would I think that actually could be optimized.

Drew Banin: Okay. That’s a really powerful thing for folks to…

Reynold Xin: I don’t think it’s for merge. It’s for the just incremental getting data with maybe a marker of, "hey, this is the latest data that’s been coming in and more data can come in the future".

Drew Banin: Got it. Okay. So it’s really cool to hear things like that because back in 2019 I remember very well trying to integrate dbt with Spark and there’s a lot, there’s a lot missing. Like we were trying to build the incremental [00:10:00] paradigm and we were working with vanilla Spark back then.

And it, it just was not a great experience to spin up clusters. Like missing the data catalog that dbt needed to generate its queries, but a lot of that’s changed in the past couple of years and specifically around the works that the work that you’re doing in Databricks. Can you give us like a little bit of a survey of how the thing has changed inside of Databricks from maybe who you were building for two or three years ago versus what kinds of use cases you’re thinking about today?

[00:10:26] Databricks use cases: today vs. two or three years ago #

Reynold Xin: Yeah, absolutely. I well, s o it actually, in many ways, mirrors, the Spark journey. Which is, by the way, what you said earlier, is dbt grew out of necessity, very similar for Spark there was a concrete problem that there didn’t have good solutions to. And so Matei and Lester invented Spark together through that.

And I think in your case you have a concrete problem probably was, "hey, I’m with the new term, I’m an analyst engineer with none of this modern data stack. I want to bring maybe a rigor to my data pipelines or data engineering jobs. I want to make sure things are correct. [00:11:00] I want to make sure I can test them. I want to make sure I can roll things out in a way that’s safe". And those are the software engineering principles. I would say software engineers deal with all the time, but once you get to data, it becomes very ad hoc. It’s not clear what the standard way is. And you guys created a way to do this. And it’s actually in with concepts with a lot of the data analysts that just are using SQL. I think that’s a big deal. So maybe before I go into their books, sorry, pretty excited about dbt stuff.

Drew Banin: So one thing on that, on the topic of necessity and dbt we started the company five and a half years ago almost as a consult. And the really funny thing was we worked on like sprints. So there was no time based contracts. It was all based on like work delivered. And so the sort of equivalent to the Netflix challenge for a million dollars was like a $3,000 statement of work to deliver marketing attribution in two weeks for us.

So different kinds of stakes that play there. But it was the same necessity of, we want to do this work. We wanted to know that it was reliable. We wanted to understand what had changed at what point in time, because we were working with different clients different kinds of engagements.

And so very [00:12:00] much like the thing we built in dbt came out of that environment of wanting to make analytics more repeatable and like scrutable in a way.

Reynold Xin: I think I see a lot of great products and projects evolve this way, which is some of the creators of the project had some concrete problems they have to solve.

And they turned out a concrete problem is shared by everybody in the world and become, and in that regards to Spark, Databricks, dbt share very common routes. All right, so now I’m coming to the so what how are the workflow was evolving on top of Databricks and Spark? So initially a lot of, I think there’s a lot of the workloads on Databricks and Apache Spark project, primarily target.

I would say two… so I’ll follow on with dimensions. One is the users of the product or projects tend to be fairly technical in the sense that they’re writing Scala code, they’re writing Python code they might be writing. And a lot less of them were actually using like vanilla SQL.

The other dimension is there’s a lot of focus because of a rule that the project was talking about. It was invented to be on machine learning models. There’s a lot of focus on data [00:13:00] science and machine learning. So a lot of the Databricks workflow, initially, were focusing on those and by the way, but as part of data science and machine learning, obviously you have the whole job of data science and machine learning.

It’s not only about building models. A lot of it is about cleaning data, getting data into the right shape or the right form. So if you look under the hood, they are very similar to the type of data transformations you’ll be doing and data warehouses, but the way you express them is very different. The actual job is very different.

And then over the course of, I would say the past two or three years we noticed, we always have some SQL support because as I said, "Hey it’s actually the core operations under the hood that very similar,a join is a join. It doesn’t matter if we’re using Python using my SQL you’re joining data on it".

So we always support it SQL. But we didn’t invest massively in it, but then over the course of the last two or three years, we know this, "Hey, our SQL usage is growing organically very fast as a matter of fact, faster than pretty [00:14:00] much any of the other programming" networks. And then we started interviewing customers and trying to understand why, and what’s going on.

And many of them told us a consistent message, which is, "Hey, I’m already using a building my data engineering pipelines and doing a lot of data science on this". And a lot of data are produced by Databricks, and typically we ETL a subset of it that would consider more businesses into a data warehouse. But then inevitably some analysts come back and say, "Hey, I want access to that raw dimension of data" or "I need the actual code that, that hasn’t been ETL yet".

So, we faced with a dilemma. Do we disrupt our backlog, which could be six months or a year long to create a new pipeline to ETL that or do we just give customers like our customers, which is our customer’s customers? Now analysts access directly to Databricks and run through SQL, even though we know SQL and Databricks didn’t work that well.

So, many actually ended up slowly giving me more access and that’s how it organically happened. And then we started [00:15:00] looking at a problem a lot more and we felt, "Hey if there’s already a clear desire to hear from our customers, despite us not pushing for it maybe we should actually look into investing heavily".

So for the past two or three years, we’ve spent massive amount of, or reoriented almost the entire company to look at how can we actually deliver really good SQL experience on this massive amount of data in the data lakes. It already produced by Databricks. And I would say we left no stone untouched.

So going from all the way at the bottom of stack, like how launch compute in less than a second to how do we rewrite our actually execution engine to be able to process data much faster for specific SQL type of workloads and all the way to the UI. It’s like, how do we create a SQL editor that cup when out of this comes in

and they feel like, oh, this is like a very similar to, from everything else I’ve used before. Yeah. Got it. Yeah. I’m happy to get into more details in any of that, but I don’t know if we have time.

[00:15:59] Reynold’s role behind the creation of databricks and spark #

Drew Banin: Yeah. Let’s [00:16:00] well, let’s go one, just one layer deeper. I understand that you were a driving force behind like the creation of Databricks and Spark, is that right?

Reynold Xin: Yeah, I wrote most of the initial code and the design… I should be clear, Spark is a massive open source project with thousands of people contributing co I can’t claim all the full credit, but I wrote like the first, so that may be few hundred lines of code that define the data frame API.

Drew Banin: Yeah. So as the originator of the data frame, and Spark, it’s time to do an analysis. Do you reach for like Scala or similar, or is it SQL these days?

Reynold Xin: I would do both and I think it’s actually important to have both the reason this they’re very different, personal. And there are different jobs to be done.

I think SQL is incredibly powerful. And probably if you just count by the number of like individuals that be using SQL by far eclipse like Python, Scala, from the perspective of I think a lot, what we do see a lot is because we coming from a totally different perspective, it’s a lot of the [00:17:00] workloads there are.

They don’t fit that well into like traditional data warehouses and people sometimes try to shoehorn them in, but they don’t work that well. Like when you have something extremely complicated, that would actually take maybe a couple of thousand lines of Python to express, which a lot of data cleaning workloads, domain, specific applications do.

It becomes incredibly awkward to express all that in SQL. So we think different there, and by the way, and many of those are some of the most compute heavy workloads. And also, or some of the most importance of business workloads because they move the needle so much. So I think having both actually pretty important.

And we’re seeing that the world also now converging towards that, which is even in the past few years, every age of traditional data warehouse is trying to bolt on some sort of additional programming language would be without SQL.

Drew Banin: Sure. It’s an interesting kind of arc. You talked about how maybe earlier in, in the Spark lifecycle, you’d see folks do a lot of ETL work and Databricks, maybe we’ll load it into a more traditional warehouse and that was used for BI.

And we’re, you’re saying that [00:18:00] these days that’s changing. And so there’s this kind of two things to talk about. And the first one I wanna talk about is we’ve seen a big proliferation of different databases for different use cases. And I think the title of this session is You Don’t Need Another Database, right?

[00:18:12] What the future looks like? Fewer or more databases? #

Drew Banin: So can you tell us a little bit about where you’re coming from in that arc and what you think the future looks like if it’s fewer or more databases fundamentally?

Reynold Xin: Yeah, I think the world will slowly go towards consolidation. And the reason is for data workloads at the end of the day data is the common shared asset and having multiple systems.

It is true that if you specialize for any particular workload, you might be able to make progress much sooner and be able to support that type of workload, especially in performance, much better, but in practice, there is an extremely high cost o f having to run a lot of different systems, there’s extremely high cost to maintain them.

Because you like I think one of our customers, admins recently was interviewed and they talked about this issue. It’s just if I run two different systems [00:19:00] with the same copy of data in practice, they’re not the same because access control might be different. They might not be perfectly synchronized.

And then I have different consumers of that data coming up with similar metrics or same definition of the metric, but with different results. And then sooner or later, the rest of the organization stopped trusting the data because they don’t know which one is correct. And that is a huge issue I see.

So I think over time, I definitely can see the appeal of I need a for example, graph database or I need a data warehouse specifically for serving BI workloads. I think over time, a more sort of converge on there are fewer copy of the data and make most likely here’s one copy of the data and they’re extremely niche cases.

That really nothing else can satisfy, they’re so important. I’m always going to create like some special thing on the side in support.

Drew Banin: Yeah. So Martin and Tristan touched on this a little bit this morning in terms of the data space and its analogs [00:20:00] with sort of the application development space. And if you’re building an application, it’s reasonable to have Redis and PostgreSQL and maybe some sort of like key value database, but I guess I’m wondering, how to translate that into the data world. Does it look more like centralized compute with different types of like data applications on top of it? Or like, how do you think about that convergence?

Reynold Xin: I think the most likely way, and we are seeing it actually happening solely in a lot of the larger enterprises.

And but I think a lot of it’s actually driven by maybe the fad companies early on, they figured like in terms of tech, It’s not true that you can just take whatever Google built internal and then actually have the rest of the world use it. But if you look at the high level, like 20,000 feet level, it’s actually decent.

Maybe the view of what the future might go. If you look at, so Google, Facebook, Microsoft, Amazon, all of this, the Netflix of this like top companies, the way they do this, Hey let’s create, they might not, you call it a data lake. It’s like [00:21:00] here repository of data. Google case is a Colossus a file system, but if you’re looking at it, it’s just a data lake and you dump all of your data in it.

It doesn’t matter what it is. And then we have so open spec to define what are the storage, and then you can actually create a different to maybe engines for processing them on. Like in the case of Google, they have F1 for serving pretty much all the queries and added like this new system called Napa for serving materialize deals.

But then you have all of the systems there, sit on top of open data. They can, in the case of Facebook, they use press the ones Spark to and they use Spark for a lot of ETL and a lot of the so maybe longer running SQL and Presto for the interactive SQL. And basically they created this architecture that we call it the lake house.

Which has, you’re still all of the data and data lakes, and then your one, all of your workloads just against the data lake. And from a techno, whether there are still organizational issues politics you have to solve in the organization, just like everything else, but from a tech point of view all of this company is standardize on, okay, [00:22:00] here’s the more or less than lake house architecture, which is you dump all the data in the data lake with open format, open spec, and then you’ll have so different workloads from.

Drew Banin: Okay. Yeah, it certainly feels like a compelling model to, to be able to just from a governance perspective, manage all that data and who has access to it.

Reynold Xin: It’s extremely important because you they substantially simplifies what you have to think about if you have, so the one place that okay.

Drew Banin: Sure. I must tell you, I am peeking at the Slack channel. And there’s one question that I do want to bubble up to here, and we can certainly answer others in the chat, but the one big one here is from Josh Devlin and he says he doesn’t have that much experience with Databricks instead, focusing more on traditional warehousing options.

And he’s wondering what the case is for choosing Databricks over some of the other leading data warehouses in the market today. How would you argue for Databricks?

Reynold Xin: Yeah. So I think the good way to think about this Databricks is started on the more data processing site and then [00:23:00] focus a lot on data science and machine learning.

And we are now obviously taking on what we call the lake house architecture, which is hey you should be able to run all workloads on Databricks. Now, specifically when it comes to the data warehousing style workloads, a lot of what I think the in the past that didn’t run so well, I think Julia talked about is there was a lack of maybe a catalog, a lack of low-latency courage for BI.

We’re working on a lot of those and you can actually find a lot of benchmarks. I think we have stated they are the performance in a lot, as a matter of fact, we just won the TPC-DS official benchmark number one reason. For ability to process the complex SQL queries super fast there actually a through governance you could do a fine grain access control on all of your data.

So that’s what we worked on a lot in the past couple of years. Now, if you think about it for a S the maybe VU consolidation there, what the vantage you would get is you get basically a single layer that can actually serve the data warehousing workloads, as well as the machinery. And the data [00:24:00] science work roads, they are becoming increasingly more and more important.

You can run workloads from shipper, small scale to shoot for a lot of scale. I think probably look at it more closely, there’s really only one platform right now on the planet that could effectively do all of this workloads on one platform.

Drew Banin: Okay, got it. It’s a strong argument. I’ll tell you, we, we are impressed and grateful overall for the sort of pace of innovation and development in the data warehousing space. And just broadly, I’ll tell you, we totally see the sort of convergence that you’re talking about and it’s happening in different ways in different places, but it could look like AI or sort of data science capabilities in in some data warehouses or more other languages and others other languages, other data warehouse it’s too.

It’s certainly an exciting time to be following along. I guess coming up on the end here, and then we can jump into the Slack to discuss with others. I do want to take a moment to get a hot take out of you. There’s a lot of hype in general, around the sort of modern data ecosystem.

[00:25:00] What in your opinion is over-hyped right now?

Reynold Xin: I think the modern DataStax is actually solving some really concrete, real problems. And we’re really excited to be part of that. There are a lot of individual concepts that might be over-hyped. I think for example, DataOps, I think people have been doing that for a long time and people are trying to do that.

And I think go make tying back to what we talked about already is whether you need another database or not. I think a lot of domain special databases are probably very much over-hyped just because Facebook has to build something to serve. In specific ways. It doesn’t mean you have that prompt. Most problems when they are not as specialized not as large scale, you can perfectly solve it by just using so many existing implementation.

And then it actually goes back to the point of consolidation. It doesn’t bring you that much. Good to bring a lot of specialized systems everywhere.

Drew Banin: Okay, great. I will tell you, I read a white paper out of Google and a database called Priscilla. And if I ever get a couple of months off, I want to go implement it. [00:26:00] And it is a little Google student.

Reynold Xin: One of the main cochlea that project actually works at Databricks on the Databricks SQL right now. He’s trying to not create a specialized system, but instead a more general one, he’s just trying to improve their general infrastructure. So we can now do a shoot for low-latency queries.

Drew Banin: So if you’re able to share the inside scoop, is it as cool as the light paper?

Reynold Xin: I think for the problem it solves, Priscilla is really cool. Yeah. But the problem is it’s yet another system like a whole team operating and building it. Which I don’t think it’s very realistic for every other company to be doing.

If you have tens of billions of dollars of revenue hinging on that one thing. Yeah, absolutely. But I don’t know how many enterprises would actually have that problem with a $10 million, 10 billion or $10 billion product, depending on as one system.

dbt Learn on-demand

A free intro course to transforming data with dbt