Table of Contents

Building an Open Source Data Stack

Never heard of data pipelines? Building your first data pipeline? Already have a data pipeline and just want to scope out the new trends? This is the workshop for you!

We’re going to be explaining:

  1. why data pipelines are important
  2. what are the steps in a data pipeline
  3. which open source tools you can use to build your pipeline

By the end of the workshop, you should have the context you need to build your own analytics stack using only open-source tools.

Follow along in the slides here.

Browse this talk’s Slack archives #

The day-of-talk conversation is archived here in dbt Community Slack.

Not a member of the dbt Community yet? You can join here to view the Coalesce chat archives.

Full transcript #

Jeremy Cohen: [00:00:00] Hello and thank you for joining us at Coalesce. My name is Jeremy Cohen. I’m a product manager at dbt Labs. I’ll be your host for this session, which is titled building an open source data stack. We’ll be joined this morning early afternoon by Katie Hindson who is head of product and data at Lightdash. Product and data.

Two things. I like a whole lot. I’m particularly excited for this session, selfishly, because Katie and I have a few things in common. We spend a lot of time thinking about tools that can help analysts be more effective. We believe in the power and necessity of putting open source software at the center of the modern data stack.

We were both early employees at our respective companies and we live at either end of now honing counting tributaries. So all chat conversation is taking place in Slack, in the [00:01:00] #coalesce-open-source-data-stack channel. If you’re not part of the chat and you have time to join right now, go to getdbt.com/community.

And then once you’re in the space search for #coalesce-open-source-data-stack , this is different from an in-person conference. We encourage you to talk through. You can ask other attendees questions, make comments or react at any point in the channel. After the session, Katie will be sticking around to answer your questions in the Slack channel.

So we encourage you to ask throughout the session we’ll pin questions and she’ll follow up in the 10, 15 minutes, right after. With no further ado, I’m really excited for this one. Let’s get started. Over to you, Katie.

Katie Hindson: Cool. Yeah. So hello everyone. As Jerry mentioned, I’m going to be talking a bit about building an open source data stock and which tools [00:02:00] you can actually use to do this.

So I hope that you’re excited for this presentation because it’s going to be. Stacked with fund data stuff. Yeah, as Jeremy already gave a really good introduction, but I’ll just do it again. Cause it’s feels socially awkward not to, but my name is Katie and I work at Lightdash, which is a BI tool that connects directly to your dbt project.

And I’m here today to tell you that the modern data stack it’s great. So yeah, for those of you who have never seen the modern data stack, I’m going to give you an overview of what the modern data stack is and why it’s awesome. And for those of you who have a modern data stack and already know what it is, I’m just going to give you the content that you need to explain it to other people and to really get them on board, if you need.

So before Lightdash, I actually used to work at a bank, but don’t worry. It was like a modern, cool apple new bank in the UK. But we had the Ferrari of a data [00:03:00] stack. It was super modern. It was really sleek. It was really effective. But we started to hire more and more data people who’d worked at more traditional banks, their whole careers.

And we realized that this was like a massive change in the tools and the way that they worked with data in the past. So they felt actually held back by this modern data stack more than enabled by. And the more that I ended up talking to them, the more I realized that no one had really explained to how this new system worked or why we’d actually pick the tools that we did.

So without any of this context, the data infrastructure that we had felt a bit more like the image on the left, like really chaotic, really overwhelming, and really not that fun. So I decided to go out and spread the love for the modern data stack or more specifically to really spread the understanding of the modern data stack.

And I thought that this was a great idea because for me it sucks when I don’t understand things, I feel really left out and really frustrated. And [00:04:00] I’m way more likely to use something if I actually understand why it’s useful. And I also can’t really help improve a system if I don’t understand how the system works.

So in my mind, by, helping this new team to understand the data stock, this was like a no brainer. And I decided that I wanted to do the same thing at Cola. So I want to spread my beliefs that the modern data stack is way better at getting us from raw data to insights than what we used to have.

[00:04:27] Let’s set the scene #

Katie Hindson: And also open source software, lets us build a stack that’s modifiable, extensible, and reusable. So I’m going to be focusing on open source tools that allow us to actually build this stack. To understand like the power of the modern data stock. I want to set the scene of a company that we all might share some experiences, or maybe some traumas with, sorry about that in advance, but don’t panic.

Let’s imagine that we’re at a company without a modern data stack. Okay. So things happen in our app. [00:05:00] Sometimes they get adjusted and they end up in BigQuery because Francesca and finance has built some web hook that gets triggered anytime important payments happen, but sometimes it fails and no one really knows when that happened.

And last month, Francesca accidentally ordered 20% less stock than they needed because they’d stopped tracking orders coming in from iOS devices. And just, nobody had noticed now your CEO looks at the customer growth model using the model that one of the product managers built for them. And the model is like this really complex thing in Excel with plenty of view look ups.

And the sheet that it’s built on is called company growth scenarios V five final do not. The newest intern added a column a month ago and accidentally duplicated all the rows and it actually took a week for anyone to notice. So someone added do not edit to the sheet names since that happened.

Now there was also a release that happened last night, which ended up changing a big part of onboarding.

[00:06:00] And some people were worried that like it might affect user growth. You have actually the product manager that updates his user growth model every morning. And they use information from Stephanie. Who’s one of the engineers that just happens to know a bit of SQL and only Stephanie knows how to figure out how many users there are each day.

Stephanie happens to be off sick today. So there are no updates on the impact of the new release on the product.

[00:06:26] Enter…the modern data stack #

Katie Hindson: Now, this is a scenario where no one at the company really knew what a good data pipeline looked like. And even if they did, none of them knew how to build one themselves. So they’ve ended up in a situation where data informed decision-making just, isn’t an option. But you might ask, how do they fix this? Like surely this isn’t the past, the present and the future for them. And this is where the modern data stack comes in. So to solve at least part of this problem, you need to build a better data pipeline. And so for the rest of my time here, I’m going to talk about kind of three steps that make up [00:07:00] most of the foundation of the modern data stack.

And I want to talk a bit about how we can use them to actually build a data pipeline. So I’m going to cover some of the biggest problems that we have in setting up an effective data infrastructure, and I’m going to keep it all open source. Yeah. So here are the three steps that I am going to talk about for another 20 ish minutes.

The first one is going from app land to data land, or the extract and load step. So here, I’m going to talk a bit, how, a bit about how we get raw data into a data warehouse using rhetoric. The second step then is going to be finding America’s next top data model. I hope you’re enjoying these titles as much as I hope you are.

But yeah, this is where we’re going to do talk a bit about transformation and how we organize raw data using data models in dbt. And then the third step is keeping our data dashing, which is undeniably my favorite step. We’re going to talk about visualizing and analyzing data and how we do this using like that.

I [00:08:00] know some of this might be in a different order for you or some things might be missing in your very strong opinion about what the data stack is. But in my opinion, this is the backbone of the modern data stack extract and load transform, and then visualize and analyze. So what I’m going to do is go through the previous scenario and walk through all the problems and talk about how parts of the modern data stack actually solves each.

[00:08:24] From app-land to data-land #

Katie Hindson: So let’s kick it off by talking about how we stream raw data into our data warehouse using Rudderstack or as it’s more friendly, known from Atlanta to data land. Yeah. So in an ideal world, when a user does something in the app, you know about it. So if a user interacts with something in the app, we’ve built our features of, they actually spit out messages, that log information about what the user is doing.

And Rudderstack is an app that actually lets us do this. So Rudderstack lets us easily track actions that a user takes in the app.

Now, the way they do this is Rudderstack actually [00:09:00] gives you tools that you need to capture event data, then send it to your entire stock with just a few lines of code. So with Rudderstack, your engineers can build features that log information when a user interacts with them and then they use the Rudderstack SDK.

To move that information between your app and where you want it to go. So for example, Wani, you might want to know, when a user ran a query in your app. All you have to do is add some code from Rudderstack that generates an order process to event. Any time that this happens in your app or query process event.

And then Rudderstack actually lets you send this data to wherever it needs to go. So here you can see an output of one of these events loaded into big query. So we added this run query button clicked event in our code and then the output we put into. So when I say that Rudderstack lets us send our data wherever we want it to go.

It like really lets us send our data to where we want it to go. So once you started logging all of this, like juicy, new, exciting event data, [00:10:00] Rudderstack supports more than 150 out of the box integrations. So you can stream your events to like a ton of tools in your stock. So with Rudderstack now all of the data in the app gets logged or extracted from our app.

And then we use one of its tons of integration to load this data into our data warehouse. So now all of our beautifully standardized events are saved in our raw event, data storage. And that is how Rudderstack lets us extract and load our data or act as this bridge between app land and data land.

Okay. Cool. So we’ve got our data into our data warehouse and the next thing that we need to do is actually make some sense of it, so that it’s useful for everyone else at the company. And this is where data transformations come in. So I’m going to talk a bit about how you organize raw data using data models in dbt.

Yeah like I said, all of our raw data is now in our data warehouse and this is really great, like step in the right direction. But joining an analyzing a bunch of raw data like user [00:11:00] update profile completed. Isn’t nearly as nice and friendly as looking at transform data like a users table. So how do we actually go from this pile of raw data into all of the pretty sparkles and transformed data land?

And this is a great question. And the answer is that we use data models to standardize relationships between raw events. So we define abstractions on top of our analytics events, which we call data models. And these data models are basically SQL queries that transform the analytics events that we get from our.

Into reusable, multipurpose tables that describe product or business processes or entities. Now the output of these models are the tables that we know in love, like users or accounts.

And the way that we do this, as we actually write our data models using dbt and their framework. So dbt owns all of the transformation in general, happy [00:12:00] sparkle, lauding between our raw events, data and our transform data.

And when I introduced this stuff to the people where I used to work this was the one where I got a lot of pushback and the two questions that I kept getting asked. Why not just use equal and some version control, or isn’t this what Looker does. And surely implementing something like dbt is way too complicated for just moving models from SQL into something else.

And the way that I answered these questions or the way that I convinced people of the use of dbt is by talking about what would happen if we didn’t have. So that’s go back to a problem where we added a new column to the customer growth model and just try using SQL.

Okay. So there’s nowhere to document what this new column does or means.

So all the context is lost, but I guess we could create a document of all the columns and their descriptions and stuff and save them somewhere. So that’s like solvable. And then as we talked about it, we accidentally duplicated [00:13:00] every row in our database. Okay. Yeah. So I guess we could add some like SQL testing files to check if we’ve accidentally broken any of our major models.

But now we have to build some sample data sets. If we want to test our changes or we have to run it on and tied on our entire data set, which could take a really long time if you have terabytes or huge data sets, you’re running the song. But I guess you could build some code that auto-generate sample data, so you could test your changes.

Yeah, in reality, you might be able to fix all of these things individually, but dbt fixes all of these things at once. So it, and it’s like way, way more scalable of an approach to solving these problems. So dbt is the tool for data transformation and it makes it easy to manage relationships between models using this ref keyword.

Visualize relationships using their kind of lineage and documentation, and also assuring the quality of the transformations through testing and model validate. Yeah, we write data models using dbt and [00:14:00] their framework, which is basically SQL and some stuff I know it’s way more than that, but for the sake of this it’s that so this means that we use dbt.

Anytime that we add to an existing, or we build a new data model. So dbt wraps itself around all of our data transformations and it pulls in all the raw data that we need to build our data models. And then it helps execute all the data models and then it organizes this and sends it to our data warehouse so that it’s ready for analysis.

[00:14:29] Keeping our data dashing #

Katie Hindson: Okay. So all of our data is now back into our data warehouse and it’s looking on like sparkly and transformed, got a glow up and it’s ready for a data analysis. And this is where the last tool in our stock comes in, which is Lightdash. So I’m going to talk a bit about how we let everyone at our company explore our data and share their insights using light LinkedIn.

So we have all of our nicely transformed data and we actually want to let other people use this data to answer their own questions. [00:15:00] Now, I think that there’s really three ways that we can let people answer questions using data.

[00:15:06] Give everyone access to all the data #

Katie Hindson: So the first way is that we give everyone access to all the raw data and are like, yeah, you know what?

Just take it from here. Now this approach is really flexible. And if everyone at your company knows SQL, it’s actually incredibly self-serve and it can be pretty effective. But in reality, not everyone knows SQL. This is a ton of information. So you get a bit of information overload. So you end up with lost context, duplicated work, there’s bound to be errors cause people don’t understand relationships between data or something.

[00:15:37] Give complete answers to everyone #

Katie Hindson: And what you’ve done is you’ve made everyone into an analyst. So yeah, maybe not the best use of everyone’s time. Now the second way of doing this is to give complete answers to everyone. So this is actually how a lot of businesses traditionally share data insights. You have analysts and you ask them questions and then they just give you the answers.

The [00:16:00] great thing about this is that there are no data skills needed, except for the specialist data team. You have fewer errors because again, only the data specialists are interacting with the data. But the problem with this is that it really doesn’t scale. An analyst really quickly become a bottleneck.

And to be honest, it sucks doing this job, like not answering your own questions and only ever responding to other people’s questions. Isn’t super fulfilling and it’s not the best experiences now. And then we have this third approach, which is where we give people useful building blocks, or you build this semantic layer.

[00:16:37] Give people useful building blocks #

Katie Hindson: And this gives people just enough self-serve to answer their own data questions. Now, there are some downsides to this. It, you need a data team and these kinds of systems need maintenance, but the upsides to this, as compared to the raw data approach, there are fewer errors. And instead of everyone needing to know SQL, you just need some SQL experts.

And then if we compare it to the complete answers [00:17:00] approach, it’s way more scalable. It’s way more self-serve and you’ve ended up leveraging your analysts. So they become force multipliers in your business instead of being a bottleneck. So those are the three ways that we think that you can give data to people to answer their own questions.

And I’m guessing that you can probably. I guess which one we picked out like dash. So yeah, Lightdash gives users meaningful building blocks to answer their own data questions. So this makes it easier for them to self-serve and really answer their own questions using the data at their company. So within these kind of meaningful building blocks what do we mean by this?

And we think that there’s four key building blocks for answering questions. The first ones are these entities or kind of these curated tables like users or emails, which contain all of the data related to a business entity.

Then within these entities, you have metrics which are things like weekly active users and define aggregations or computations [00:18:00] over the data in your entities.

And these metrics have some unified definition. And then the third thing is that you can actually filter data so you can filter them into relevant data subsets using any of the metrics or dimensions in your table. And then finally, we want to be able to group things so grouping data into segments and giving other people access to these segment rules.

The great thing about Lightdash is all of these building blocks are actually defined in your dbt project.

If you imagine that you’re someone at an e-commerce business who has just jumped on the trend of selling these cool things called jaffles you could be at your Jostle shop asking questions, like what is the total order count split by status and with Lightdash, we’re actually giving you the building blocks that you need to answer this question with just a few clicks.

So you have your models defined in dbt and any model in your dbt project with the schema file just gets pulled into like dash automatically and becomes a table. You can explore. Then within [00:19:00] these tables, we have dimensions. So these are just columns that you’ve defined in your dbt schema file. And we pull in their names, their descriptions, and we automatically detect all the types and everything through your data warehouse.

Then you can also add metrics directly into your dbt project. So again, these are within your model schema files, and these just get pulled into Lightdash when you sync your project. So yeah, like dash stays in sync with the business logic in your dbt. And the idea is that once we have all the dimensions and metrics that we need to answer a question, we just click on a few fields that we want and kind of Lightdash creates this beautiful plot for us.

So with the question what is the total court order count stuff by status, we would get this pot as our answer. Yeah, no, I’m not sure how many of you folks were at Drew’s talk yesterday about dbt map. But I’ve never actually been so stoked, but a history lesson. Yeah, but for those of you who weren’t there, the really short version of it is that with dbt version 1.0, we’re going to get access to dbt [00:20:00] metrics, which is really exciting.

[00:20:02] Lightdash now supports dbt metrics #

Katie Hindson: So last week we actually decided to upgrade like dash to also support dbt metrics. So these metrics that I talked about before they can be defined either in the Meditech or using dbts metrics tag. Now this is really early stages, but we’re super excited to see how kind of the DBC metrics layer develops and how much more fire power we’re going to get by bringing this, to exploring your data in like dash.

Yeah. So to summarize Lightdash gives you these. Users your user is meaningful building box to answer their own data questions. So you have these curated sets of metrics and dimensions that you can use to cut your data in a bunch of different ways. And then you just connect Lightdash dbt project and all this stuff you have in your dbt project is just pulled into like dash And it keeps all of your business logic in one place.

And the other great thing is because we’re plugged into your transformation layer, we’re able to pull in all this rich metadata and kind of add it to dash to help you build trust in your data. So like dash [00:21:00] hooks up to our dbt project and our transform data and gives people the building blocks that they actually need to answer their own data questions using your company’s data.

Okay. So we have our data pipeline now, and we’re going to go back to her original scenario and talk about how much better it is with this new data stack. So now things happen in our app. And one of the engineers has used Rudderstack to make it to that. Anytime a payment is completed, it triggers an event.

This event gets loaded into our data warehouse destination also, thanks to Rudderstack and Francesca also set up the connection to Slack so that when critical errors happen, Rudderstack sends, this sends the critical error to Slack where Francesca gets notified and can deal with it. So that means there are no more payment problems, which is great.

Next. We actually use dbt to build data models that transform our raw event data into these beautiful transformed data sets. So now when the intern adds a new column to the user’s growth model, they can test their [00:22:00] changes and actually discover that they’ve duplicated a bunch of rows. So they go back and fix it.

Easy peasy, no worries, no breakages. They also add a description to the new column to the schema file. So when the CEO sees it, he knows exactly which growth scenario. Now the output of the customer growth model is saved in a table in our transformed data set and like dash connects directly to the customer growth model where Stephanie has added a metric that counts the total number of new users.

It also like dash pulls the most recent data entries from our database. So now the product manager has access to the growth model in Lightdash, and they don’t need to know SQL to see the trend in the total number of users. Over time, they build a chart, they save it and then they share it with the two.

And that’s it, the modern data stack makes everything great. Yeah. So after all that, I hope that you’ll agree with me that the modern data stack has all the main bits. We actually need to go from raw data to [00:23:00] insights and it’s way better than what we used to have, but it’s worth mentioning that I really send this down and only talked about three parts of the data stock.

When in reality, there’s like a ton of other tools and tasks you might actually end up needing at your company. Things like Reverse ETL or tools for doing notebook type analyses for data scientists. But basically what I’m trying to say here is that there are a lot of options and moving parts. So the question then becomes with all these moving parts and with the modern data stack just keeps getting bigger and bigger.

How do we actually integrate all of these modular parts together to create this overarching data experience? Now as data builders, we feel like super empowered by all of these tools because we understand each of their individual uses deeply. And we remember what it was like before these tools made our lives so much easier.

Yeah, we feel like we have this Ferrari engine powering our analytics. So we talk about all the improvements [00:24:00] that we’ve made to our systems uh, testing these super new and exciting ETL tools. The speed. But from the outside, a lot of people still see data that breaks a lot or they can’t find what they’re looking for.

So they end up going back to their old ways of kind of more spreadsheets and these keep cropping up more and more. And then this idea of self-serve becomes so much more of a distance fantasy. Yeah, it doesn’t matter if we have Ferrari engine, the only people who notice out of the data, people who are looking at the engine regularly, what matters is that our data stock actually comes together and feels like a Ferrari.

So until we’re actually able to bring all of these pieces together, all the improvements that we’ve made to our data stack still look and feel really academic to end-users. So they literally can’t appreciate how much more powerful it is than what we used to have. It’s like we’ve put a Ferrari engine inside this sort of mismatched janky looking car.

So we need to start thinking more about the next thing, which is [00:25:00] the more modern data stack or kind of more, how do we integrate all of these incredible parts together to create this equally incredible data experience? Yeah. I don’t have the answer if I did in the last three minutes, that would be pretty cool.

But it’s definitely something that I’m thinking a lot about and I want to push everyone else to start thinking about it too.

So on that note that’s the end. And if you’re interested in learning more about Lightdash or you have any questions or you just want to chat we actually have a channel in dbt it’s #tools-lightdash. We’re on Twitter and we’re in Git hub because we are open source.

Yeah. Thanks so much for listening and I hope that you learned something and that’s it for me.

Last modified on: Apr 19, 2022

dbt Learn on-demand

A free intro course to transforming data with dbt