Table of Contents

Smaller Black Boxes: Towards Modular Data Products

Stephen is a researcher with diversity of interests. He has expertise in image processing, statistics, and data visualization, and extensive experience in Python and AWS ecosystems.

Apps, dashboards, catalogs, metrics, syncs, streams, dumps, dictionaries, reports, alerts, pipelines. .. the “modern data stack” may take 30 minutes to spin up, but the work certainly doesn’t end there.

The greater availability of data and innovative tooling has created dozens of new ways for data teams to provide value, but it has also increased the complexity of managing it. The result may understandably be perceived by many stakeholders as a modern “black box” that does “data things”.

And even when we implement more precise terminology – such as data products – it can be a challenge to align across the company, much less market, monitor and govern the actual resources. Yet, making modular data products is one of the key challenges to scaling a data platform.

Tightly coupled systems can’t scale, and teams can’t separate responsibilities without product boundaries. In this talk, I’ll present some of the challenges that our team at Immuta faced when scaling a “monolithic” platform: New team members struggling to navigate a complex dbt project; data consumers not being aware of the many tools and assets available to them; executive leaders not having a clear insight into the accomplishments and concrete roadmap of the team.

I will then detail the steps we took to re-organize our operations to create better walls around data products, from ingestion pipelines to dbt models to the consumption layer. I’ll outline how we fleshed out our team’s product catalog and set metadata expectations for each product, such as ownership, regulations, and quality. Attendees will walk away with some recommendations for creating their own data product strategy that will be useful no matter what scale they are operating on.

Follow along in the slides here.

Browse this talk’s Slack archives #

The day-of-talk conversation is archived here in dbt Community Slack.

Not a member of the dbt Community yet? You can join here to view the Coalesce chat archives.

Full transcript #

Amada Echeverría: [00:00:00] Welcome everyone. And thank you for joining us at Coalesce 2021. My name is Amada Echeverría. I use she/her pronouns and I’m a developer relations advocate on the community team at dbt Labs. I’m thrilled to be hosting today’s session, Smaller Black Boxes: Towards Modular Data Products, presented by Stephen Bailey. Stephen is the director of data and analytics at Immuta where he strives to implement privacy best practices while delivering business value from data, he loves to teach and learn on just about any subject.

He holds a PhD in educational, cognitive neuroscience from Vanderbilt and enjoys reading philosophy. Stephen has three boys under five and says he has been upping his hobby game during the pandemic by learning to play guitar and to home brew. Over the course of the next 30 minutes, Stephen will outline how his team flushed out their product catalog and set metadata expectations for each product such as ownership, regulations, [00:01:00] and quality attendees will walk away with some recommendations for creating their own data, product strategy.

That will be useful no matter what scale they are operating. Before we jump into things, some recommendations for making the best out of the session. I’ll chat conversation is taking place in the coalesce-immuta channel of dbt Slack. If you’re not yet a part of dbt Slack community, you have time to join now.

Seriously, go do it. Does it get dbt.com/community and search for Coalesce Immuta when you joined Slack. We encourage you to set up Slack and your browser side by side. In Slack, I think you’ll have a great experience. If you ask other attendees questions, make comments share means or react in the channel at any point during Stephen’s session.

To kick us off, our chat champion Azzam Aijazi, senior product marketing manager at dbt Labs has started various threads and shared many dad jokes to break the ice and thank you all for [00:02:00] chiming in and adding in your own. So let us know where you’re calling in from and contribute to the Slack in other ways. After the session, Stephen will be available to answer your questions.

Let’s get started. Over to you, Stephen.

Yeah. Gotcha. We’re lucky Slack is very fun right now. So hopefully I think that’ll get us through until we have Stephen again. Thank you.[00:03:00]

Okay. There we go. Thank you for saving me from an embarrassing art tour, Stephen, right on time.

Stephen Bailey: Sorry. I think the outage was coming after me. Like my thing, my internet, my house internet went down, so it’s just, it’s everywhere. I think we’re good now.

Amada Echeverría: Thanks again. And we’ll hand it over to you.

Stephen Bailey: Awesome. All right. Thank you everybody. For bearing with the outage and for joining me today, I’m really excited to share the content of this talk. It’s a topic that I’m passionate about and that we at Immuta and on the data teamthat I’ve learned a lot about the past year or so as we’ve really started to think about it [00:04:00] more vigorously.

I’m going to have some Slack prompts throughout the the broadcast and would love for you to like hop in there and provide some of your opinions on this question. So to start with. How big is your dbt graph? Like when you think about all of the nodes in it, like how big does it get? Is it in the tens? Is it in that hundreds and thousands? Let’s just think about how big of a platform we’re maintaining. Amanda already introduced me. We’re going to break this talk into three sections. There’s the the first section, which is what is the problem that we’re talking about. I call it the modern black box problem.

[00:04:32] The Modern Black Box Problem #

Stephen Bailey: And we’ll talk about the data product thesis, which is out there too which I believe is a solution, a potential remedy to the black box problem. And then finally, how our team at Immuta really started taking this product centered thinking and reorganizing our data stack to just support that thinking.

To motivate this problem, I want you to think to yourself, what would other people in your organization say that the data team does? I [00:05:00] know, I think the data team does, I think the data team creates this platform that can support a community of data-driven insights and activities and can create this bastion of intellectual freedom in a desolate wasteland of like bad data.

I think the work that we are doing is just with the modern data stack and with the stakeholders that we collaborate with, it’s exciting. And one reason that I personally see so much promise in the current modern data ecosystem is I have seen it done poorly. I’ve seen it from, when we first started building out the data team at Immuta and it was just spreadsheets and PowerPoint presentations and a laptop.

To implementing the monitor data stack and how it was such a breath of fresh air for myself and the other folks, working with data to have a clear pattern and pipeline and great tools, and then producing powerful visualizations so quickly. [00:06:00] And it scales. So six months after you implement it, it’s still going strong.

The operational concerns are not too heavy. It just feels like you’re you start spreading the love to new systems and adding new components to it that make it more powerful. You start integrating more communications channel and it’s just, you feel good about it. And even a year later, as you start bringing on the team, you’ve got good workflows, you’ve got good patterns.

And all of a sudden this data stack that started out so humbly is now able to accommodate multiple ETL sources. It’s able to use all of the different nodes in the dbt graph. You’re able to do integration, testing and slack channel and Reverse ETL. And pretty soon we can do metrics layers, and we’ve got a product analytics second, and just as a data professional, it’s just amazing. It’s so exciting and cool to work in this ecosystem where it feels like there’s so much potential for data to do incredible things for the business. But there is something you got to keep in mind. And then [00:07:00] I always have to be reminded of is nuts. Most people in our organization are not data professionals.

Our estimates, I think go from, 70% of the people in the company to 90% of them don’t use data in a sort of power user way. They don’t read Ben Stansell thought leadership articles on Friday night. I don’t think they should obviously, but they aren’t us righ, us data geeks. But surely the value of the amount of data the second sort of intrinsically obvious to everyone, no, it’s not intrinsically obvious to everyone.

Stephen Bailey: And I was reminded of that earlier this year I had, we had a great new executive come on the team was really excited to meet him. We start having this great chat and then about 15 minutes into it, he says he pops well, Stephen what exactly do you do here? Like when should I come to you for an answer to something?

And I hesitated a second because what don’t we do on the data team? We’re in everything. But what ended up coming out of my mouth was [00:08:00] something like very vague. Like we end up taking data from over here and we extract the value from it and we put it over here but we do it in like the best possible way. It’s amazing the way we do it. It’s so modern. It’s great. You can see the wheels turn with this executive and he sound like, ah got you. Dashboards.

Stephen Bailey: And that’s it’s so frustrating for a data professional to have that sort of conversation, but you also can’t blame them, right?

Because the data world is chaos and there’s no concept of a data. The thing that other than a dashboard that goes from company to company. But there’s this missed potential because we envisioned something that can be reality. But then when we hear other people talk about it, it seems to come out like their perception seems to be more of like this BI monolith that has the swirl of technologies that no one can even pronounce and understand. And it’s all, supporting dashboards. It’s not good for anybody. It leads a [00:09:00] lot of value on the table. But I think we as data professionals are sometimes part of the problem because we love our tools. I love our tools. I think they’re great. And there’s innovative people who are doing, creating great new tools every week. But if 90% of the company are not data people then talking about the stack, having conferences on the stack does not bring them into the conversation. Instead, it actually pushes them out.

Stephen Bailey: it creates an opaque layer between what the data team does and the value that the business needs to get out of data. That opacity, I think it manifests itself in a one-sided relationship with stakeholders, and it’s vulnerable to dysfunction. We have our tool, it gets bigger and bigger over time.

It’s sophisticated, it’s technical. It has logic in it. It generates good things, but it’s easy for there to be value like [00:10:00] a misalignment on what is the value of the things that are being. Who owns it, do the lines of business on it, does the data team own it. And then what is actually happening in there and are we sure that it’s happening at the level of accountability and up to the standards that we’ve really needed to account to?

So I think there’s a gap. And I think one of the fundamental reasons for that gap is this inability to communicate. There’s no concept that we can share between the professionals and the business who wants to engage with the professionals. Fortunately, I think the community is really starting to orient around an answer to this.

[00:10:36] The Data Product Thesis #

Stephen Bailey: And I think that answer is the data product, right? I think the data product thesis is fairly approachable. It’s that, data professionals create data products. And data products deliver business value. And so if you hire more data professionals, you can create more data products. If you have more data products, you should be getting more business value out of them.

There’s an [00:11:00] abstraction here and it’s not the data professionals themselves that are delivering the business value. By way of creating these systems that can then continue to deliver business value while the data professionals build more things. What it really amounts to is this shared concept that the data team can produce something the the business can consume it and understand it and then take ownership over it. And so if you can, if we can break down that big monolithic black box into pieces that we can hand over and talk about and individually, then we can start building alignment on data ROI. How much did this data product ROI, this data product delivers. Who owns this data product?

Can the data team produce something and then truly hand it off to the line of business? Can the line of business actually come over and start making data product themselves? And then finally, can we build meaningful standards that say, Hey, a mature dash-boarding data product[00:12:00] has these five things. All of the data products should adhere to GDPR regulations.

[00:12:06] What is a Data Product? #

Stephen Bailey: We require this sort of shared concept in order to answer those questions, but when it comes to reality, there is a wrinkle. What is a data product? There are lots of good answers, lots of Medium articles. I could come up with one you could come up with one. There’s no shortage of definitions of data products, but when it comes to your organization and implementing a framework of data products at your organization, what matters is that what you produce on your team and what your business wants is aligned and is understood. So we think about data product, I think there is consensus that it’s a thing. It uses data in some way it’s bounded.

So a data product is not all data in the world. But it also has a purpose. So you build a data product in a certain way because you’re trying to meet a certain [00:13:00] end goal. And it’s also categorizable. There are different kinds of data products and data products within a certain category probably share similar traits.

So it’s standardized to a degree. But ultimately, when we think about implementing a framework at your organization, you’ve got to make the decision and call of what your team produces. At Immuta, we don’t build huge ML models that we’re then deploying to end users. So that’s not the data product we create.

[00:13:27] The Immuta Data Team’s Product Catalog #

Stephen Bailey: It could be a data product that someone else creates. So in the last year we did an exercise to go through and think about the last two years of data effort. And we took all of the projects that we did and we split them into two classes. The first class was platform work and the second was product work.

And then we said, all the platform work is considered to be a supporting element of the data product creation process. And then what was leftover and the product work was eight categories. And [00:14:00] those categories broke out roughly actually exactly into this data replication jobs like extract and load processes, data integration, jobs like Reverse ETL syncs, data as a product, data products like, final dbt models and marts, those tables that we’re exposing and we put a lot of thought into organizing, interactive applications like Looker Explores and key dashboards like Looker dashboards, like our company status report that’s really maintained by the data team and has a high SLA. And then data alerting and then strategic analysis.

So when we get commissioned to do a deep dive into a certain problem in the business, to analysis, then finally on demand consulting services that might call anyone from our team. So the goal of this exercise and really of adopting this framework was to take this technology focused black box and instead, turn it into a product focused catalog.

And you can [00:15:00] see here there’s some similarities, right? All of these products are powered by tools in our stack. That’s great, but there’s a higher alignment to the output of the work than it is to just the system and like all of these arrows between between technologies.

And in fact, I could just take away most of the technology logos on this data product catalog. And we’re fine, right? Because the only thing that matters to the business and to getting business value is where the data is exposed in the end. That Looker dashboard that someone can actually go to immediately, as soon as they learn about the application. That table and Snowflake that they go to and query, as soon as they learn about the existence of it.

Those are where the the data stack in the business meet. And by focusing our work on that part of the generation, the stack, we can drive better alignment. [00:16:00] So if you believe in the data product thesis, that data teams create data products, then really, everyone in the company should know how a data product is defined by your team, how it’s managed throughout its life cycle.

And then also what standards it adheres to. If you have GDPR requirements or CCPA requirements. Can you look at a set of tables and Snowflake and know that the correct access controls are in place, that the correct masking policies are in place? If it just lives in a black box, it’s really hard to get that sort of product level insight.

All right. So now I want to share some learnings from the last from the last year of really trying to reframe our stack in this product focused wins.

[00:16:45] The Product Centered Stack #

Stephen Bailey: I’m going to share three of them yet about challenges that we faced before that we’re able to be ameliorated by this product driven strategy and really thinking in terms of what data products are recreating, creating, and maintaining. [00:17:00] And the first one is misalignment between your user stories and the actual data on the team.

Tell me if you see yourself in this, it’s the start of a new fiscal quarter. Every department is figuring out what their priorities are and the data team I’m waiting to hear, like what’s going to come down the pipe. What can we help? And the people team executive comes over and they say Hey, we’ve decided that we want to really get a dashboard together that can give us insight into our KPIs.

And so we want to have one place to go to for an up-to-date view into hiring trends, attrition, diversity initiatives. Can you do that? And so me as the team leader, I’m thinking, yes, this is perfect. It’s high priority. There’s a clear artifact that’s coming out of it. It’s timely and it aligns well with how we do work.

So we get to work. What does doing this user story, creating this dashboard look like in practice? It looks like a requirements. And then, because the HR software isn’t in stitch [00:18:00] yet we write a Python package for data extraction. We recreate some dbt models on top of that, we write some data quality tests. We schedule all those pipelines. We create a LookML view and then Looker Explore and then write integration tests. And then we explore the data for issues and trends, CD quality issues to create a draft dashboard. Great. And then we review that and g et some feedback. We implement it.

We register those assets in the data catalog because we want to make sure they’re usable in the future. We apply sensitive data tags and access control policies, and maybe even have to go and figure out what is our access control policy, because we’ve never had this type of data in it before you could actually implement that and use a platform management.

We write a blog post about it. We release the dashboard. We celebrate it’s exciting. And then we maintain it indefinitely forever. Now my point here is not that building a dashboard is hard. My point is that it’s important to build it in the right way so that it’s scalable and then it’s [00:19:00] robust long-term.

But when we look at the alignment between the user story and the work that’s actually done, there’s a discrepancy. The user story is just capturing the dashboard. That’s all that it’s talking about. But we are doing way more than just building a dashboard. And the stakeholder doesn’t really have any insight into why we’re doing all of these things this way.

And like, why there’s a disconnect between the work that needs to be done and user story. And it could lead to frustration if it’s taking long. But observe how this changes if you take it from a product focus lens. So we have this conversation, we say, yes, let’s do this. But instead of just going back and start like starting the tinker, we say, Hey, so this is a great idea but actually what you’re asking for is not a dashboard. You’re asking for four distinct data products, and we’ve got to build with all four in order to deliver that final dashboard. [00:20:00] The work will probably get done and get green-lighted. But now the executive kind of understands that, Hey, this is a bigger ask than what they originally thought.

But also, if you look at this breakdown of, for example, the amount of time each of these products is going to take to make, you see that there’s a disconnect between how much time it’s going to take to actually just get the data into the warehouse, versus how much time the final product is going to take to create.

So really the engineering work is the longest part of the cycle. But it’s also the furthest from the actual user story that we’re describing here. Secondly the long-term value of the source data and the application data is actually the highest value part of this project. Yet it’s not represented at all in the user story.

The HR dashboard is gonna be great for the HR team, but everyone else in the business can benefit from a really high quality employee profile. We can enrich Salesforce, we can enrich it workflows off of it. It’s [00:21:00] awesome. It’s awesome to do this. But if you don’t pull out that work and make it distinct, it’s easy to take shortcuts as you try and get that dashboard out.

So pulling it out helps build alignment between the work done at the engineering level, at the application model level, at the explore level and the actual value that you’re producing for the company.

All right. The second challenge that we started to have as we grew was tangled dependencies between sets of dbt models. I’ll say that the whole platform, benefited from it, product focus, and really trying to modularize everything in the platform. The loose coupling is just, it makes it for a more scalable platform, but dbt in particular had some issues as we grew.

So our dbt project started our small. It’s a very complex business. So we grew from a hundred nodes to more than a thousand. And during that process, challenges arose. What models can individuals build on top of. I remember one day I came in and I started looking at the graph of the new [00:22:00] PR and someone had made a decision to build on top of a model that I was planning.

There was no principle out there that said they couldn’t do that. But we hadn’t have a conversation around what should we be building on top of what’s stable and what’s not stable. How do we organize individual projects when we have new projects? Should that just go through and face staging final should go all the way through like that.

How do we orient new users to the project? So if someone’s coming in and even if they’re technically proficient, just seeing that thousand node dag is not it does not tell you that what is happening in the project. The way we had originally organized the dbt project was through this space staging final framework.

And that was great, but it treats the whole dbt package, the whole dbt project as a monolithic thing. So it says, there’s this input layer, and then there’s anything can happen in the staging layer. And then you’re going to have an output layer that is going to be consumed. Overall, that’s a great pattern.

But if you think about [00:23:00] end-user say in sales, hitting that data mark, if these, like the Salesforce opportunities mark model is actually blended with other systems, by the time it reaches Looker, then they start to have questions about where, what is actually happening in the state?

I don’t know that I fully trust it because the lineage gets mixed up in the middle. The logical lineage I should say, gets mixed up in that staging layer. So what we did instead was we said, all right, what are all of our data products that are actually existing in our dbt project? And let’s insulate those.

So let’s take all the logic and push it down as close to the source data product as possible. So Salesforce became its own sort of virtual package. No logic from outside of Salesforce was put into Salesforce. Pardot, same thing, Segment, our internal CRM. And then anything that we wanted to build on top of an integrate across sources.

That became its own data product too. And so within that product, it was its own module. The unified contacts, unified product analytics. [00:24:00] Packages became their own modules. And the rule that we implemented was each of those packages has a final layer, and you can only select from that final layer.

So it allows the logic to start being modularized at the whole project level. And once great side effect of that is, if a new person comes in, they can focus on one data product and they just have to understand the logic for that data product. At the Looker level, if I say something is Salesforce opportunities, we can be confident that we don’t have other systems leaking into that into that exposed model.

So the lineage becomes clearer at a sort of meta level when you can modularize these products.

And finally as we scaled, we started having more systems come in, more systems, doing things and operating in our data that introduces some systemic complexity. But even if you have a really good observability model, it can be challenging to to prioritize what is a real failure [00:25:00] and what is not a real failure.

And what I mean by that is tables are not the only thing in your data graph. You’ve got the SQL processes that run and generate models. You have tests, you have data loading, you have Reverse ETL things that are sending data from your data graph into other systems. You have maybe policy or sensitive data detection, processes running and tagging data.

You might have model training and other more complex data science workflows that go through. There are ways of introducing some observability and framework on top of that, so that you can integrate so that you can look at the complete data graph with those jobs. We chose to use something called open lineage.

It’s a great open source specification, very simple and accessible. The idea behind data open lineage is just that you have your data and every data source that is connected to another one is connected via a job, some sort of process. So a dbt model that’s compiled SQL can be considered a job.

And then you run that job multiple times in order to refresh the graph and these jobs can fail. They can [00:26:00] be retried, etc. Here’s what it looks like in practice. So here’s a Meltano data loading job. It loads three tables, and this is considered a data replication data. We might also have dbt tests on that, on those three tables and some Immuta policies that are doing access control workflows on these three tables.

And so all of these are data products.

If we extend the lens a little bit maybe we have a dbt model that’s creating a sort of aggregation between table one and two. That would be exposed as the data as a product, data product. Then we might have a data integration of sync with Hightouch and maybe a Looker Explore connected to table four.

We’ve got some more dbt tests. You’ve got Immuta policies, you’ve got Spectacles tests. So you can see this whole, when we take the big picture lens of what’s happening in the platform, it can become quite complex, even in a graph with only four tables in it. If we extend that lens to the size of our platform, we have a little over a thousand tables and our dbt project also.

[00:27:00] If one of these jobs fails, we know what data sources might be affected immediately. And we also know what data products can be affected because we know which tables are in a data product. So it’s useful from an observability standpoint to have a model like this. If we extend the lens to a larger set of of models, maybe your whole dbt graph, you might get something like this.

If you summarize it all up, you’ve got 1100 data sources. I’ve got 1800 different processes of all different types going on that. And then if I come in the morning, I look at what failed. I’m going to get a number like 16. It’s never going to be zero and it’s probably not going to be like a thousand.

It will always be in this gray, annoying gray area. And you as a data engineer or data platform lead have to ask, like, how do I feel about this. Do I care about these 16 failed jobs, the 16 jobs in the forest where no one can hear them or are these priorities. Now, if you take a tech focused perspective [00:28:00] on it you might split these jobs out by the technology.

What type of thing is happening? And you might see, okay, we’ve got 16 failed dbt tests, but you haven’t answered the question. Do I care about this? Is this affecting business value? Because these dbt might be failing that, you don’t care about. But if you take a product lens to it and you say, oh I see that these are, I’ve got four failures on my Salesforce replication table.

I’ve got some 12 failures on the Salesforce data as a product table. I know that my revenue operations team relies on these data products. Well, I’m going to prioritize that. If there was something that was not important and that’s where the failures were, you might deprioritize it or you might just send a note and say Hey, we know about this. We’re going to address it tomorrow.

So it’s all about aligning what’s happening in the system with the business value so that you can make better decisions. And you can better [00:29:00] communicate with the stakeholders who are relying on your data products. That’s what the product lens is meant to do.

And we’ve become fans of this on our team because we found that building backwards from the shared data product concepts helps us market what we do, helps us to maintain it better and build on it better. It helps us to monitor better. And it helps us to just understand what’s happening in this complex data product environment.

It also makes us open to adding new things. If we want to add a new metrics layer, we might add a new product. We can associate tables with those individual data products that we create that are in that metrics category. And we don’t have to spend a lot of time. Hey, this is what we’re doing.

All right. So I started the talk with this question of what do other organizations and or what do other people in your organization think you do? We covered the fact that we do complicated things. Oftentimes we aren’t able to succinctly [00:30:00] tell others exactly. We often fail to get across the complexities of what we do and also the possibilities of what we do to others.

[00:30:08] What Data Professionals Envision #

Stephen Bailey: But it’s really important that we prioritize being able to. I really do believe in this vision of building a data platform and a community around data that enables us to make better decisions, to learn more, to learn daily to operate and communicate better. And I do think that the tools that we have are allowing us to lay that foundation. But when it comes to practice, what we are lacking right now is we just have to be able to build alignment on the value of data and exactly what is being delivered between the system and the business.

We need to be able to facilitate shared ownership of the data, and we need to have confidence. That what we’re doing in this complex platform. When we add a new tool in what we’re doing [00:31:00] is adhering to our legal and ethical standards that we are expecting of ourselves. And this last point is one that is really near and dear to my heart.

We had Immuta spent a lot of time thinking about how do we take this complex regulatory environment that we’re in and that’s changing all the time and map it into what is actually happening in the modern data stack. I think it’s not enough to just have the regulations and pass them down to the data team.

We have to be able to integrate these two worlds and build concepts and ways of talking about. , ways of operationalizing the concepts that actually get enforced when we’re doing data stuff. And one of the problems with making that real is there’s just a conceptual gap.

And we as data professionals have to be able to lead that conversation around the responsible use of data product. I think it all starts with everyone [00:32:00] understanding what we do and building a shared understanding. Around what does the operational life cycle look like? And as we add new things, how are we ensuring that we’re going to do that, deploy them in the right way that everyone can understand them.

All right. Thank you all very much. I appreciate the patience on the front end on the technical issues. And it was really my pleasure, really, an honor to be here today. Thank you for your time.

Last modified on: Apr 19, 2022

dbt Learn on-demand

A free intro course to transforming data with dbt