Table of Contents

Stay Calm and Query on: Root Cause Analysis for Your Data Pipelines

Francisco is Head of Product at Monte Carlo.

Originally presented on 2021-12-09

We all know data systems are becoming increasingly complex, which means data issues can surface in a variety of ways across code, source data, or even operational environment issues.

By leveraging best practices from DevOps and software engineering, we’ve found that data teams can resolve broken pipelines faster, and long before stale dashboards, null values, or duplicate data issues impact downstream data consumers.

During this workshop, we’ll share a five-step process analytics engineering teams can use to conduct root cause analysis in a collaborative, quick, and effective manner.

Browse this talk’s Slack archives #

The day-of-talk conversation is archived here in dbt Community Slack.

Not a member of the dbt Community yet? You can join here to view the Coalesce chat archives.

Full transcript #

[00:00:00] Barr Yaron: Hi everyone. And thank you for joining us at Coalesce. My name is Barr and I work on product here at dbt Labs. I’ll be hosting this session. The title of this session is stay calm and query on. Root Cause Analysis for your data pipelines. And we’ll be joined by Francisco Alberini , who is head of product at Monte Carlo.

First some housekeeping. All chat conversation is taking place in the #coalesce-root-cause-analysis channel of dbt Slack. If you’re not part of the chat, you have time to join right now, visit our Slack community and search for #coalesce-root-cause-analysis when you enter this. I’m so excited to hear Francisco discuss this topic today.

He was previously senior product manager at segment where he worked for five and a half years. Now at Monte Carlo, he focuses on leveraging best practices from DevOps [00:01:00] and software engineering to surface and solve data issues, faster, something we all care about here. We encourage you to ask other attendees questions, make comments or react at any point in the channel after the session, Francisco will be available in the Slack channel to answer all of your questions, I’m going to pass it over to you and let’s get started. Thank you, Francisco.

[00:01:23] Francisco Alberini: Awesome. Thank you so much, Barr for a great introduction. I’m also super excited to be here and tell you a little bit about what we’ve been learning from the conversations we’re having with our customers, at Monte Carlo.

So just a, little bit more on myself. So I’m currently head of product. I would say it in the majority of my time is spent talking to our customers around data quality issues that they’re having, how to improve the quality of data over time, reduce, we call it data downtime and all of that. And the idea of increasing observability to help data and data platform works.

A preface to that. I was at Edmonton. I was [00:02:00] at Segment I was leading a product called protocols. All around data quality and how to improve the quality data coming into Segment so I spent many years talking about the same issues that I’m talking about now with our customers. Also the father of two boys just had a son two weeks ago.

[00:02:16] Root cause for a crying baby #

[00:02:16] Francisco Alberini: And so you’ll see my presentation might be slightly tailored towards what’s top of mind for me. And I moved to Brooklyn last year. So I’m a hipster in training. So yeah so the, idea of Root Cause for a crying baby. You might be thinking, this is weird. Why, is Fransicso talking about crying babies?

But one, it’s half of mine for me, it’s something that I’ve been thinking a lot of. And I think there’s a lot of analogies into determining why baby’s crying around how to actually determine what’s going wrong with her. So normally if my baby’s crying there’s, like a laundry list of things that I go through and maybe it’s diaper is dirty, maybe he’s cold and maybe he needs food.

And then of course there’s a long [00:03:00] tail of other things that could be going on. But I think the majority of time that list is, enough to figure out and help a baby stop. And I think it’s really similar in the data world. Actually. I think there’s like the, gist of what this presentation will be about is understanding that there is usually just a framework or a model that you can put in place.

[00:03:21] Good pipelines, bad data #

[00:03:21] Francisco Alberini: It doesn’t have to be super complicated. It doesn’t have to be extremely intense to be able to determine the root cause for your day-to-day. At Monte Carlo, we like to say good pipelines bad data. I’m proposing that we change it to good parents, that data and that sometimes, yeah, maybe we’re good parents.

We have the best intentions but we don’t always figure out the issue. But anyways, So let’s jump into a little bit of context setting around data quality. And how do we think about things here at Monte-Carlo? I’ll probably start by saying the most obvious statement of 2021. And perhaps one of the more obvious things that we’ve heard a lot in terms of [00:04:00] themes here.

[00:04:02] Everyone relies on data #

[00:04:02] Francisco Alberini: That everyone in the company is relying on data more and more from finance to marketing, to product engineering success, sales, etc. The, real key differentiator among companies today is their ability to use data. And I think that’s just a helpful framing because when that data goes wrong, a lot of times I think we’re left feeling like I don’t know what to do.

If you’re a data engineer, analytics engineer feeling like I’m not sure where to start, especially if you’re new to the team. This is a really key differentiator that we’ve seen a lot in that I talked to a lot of data engineers that are, have the entire data platform in their head. So you might ask them, okay, there’s an issue over here or this particular table wasn’t updated if they know immediately where to go.

But as data teams, we’re growing, we’re thinking data engineer three through 10 or analytics engineer the 12th date analytics engineer on the team. They’re not as familiar with these systems. So it’s much harder for them to know and determine the root cause. Especially if they’re on.[00:05:00]

So I think all of these systems are predicated on that that idea of what do I do? Where do I start? And then I’ll note here, this whole concept, I think just from a vocabulary perspective is what we call data downtime. I think it’s helpful to, to name it and to say, okay, yeah this is how we think about it.

[00:05:19] Data downtime #

[00:05:19] Francisco Alberini: The idea of data being partial, erroneous, missing and it’s, a real. Cool. Enough of talking about babies and setting the stage here, let’s jump into the meat of the presentation here in what we want, we’re going to go through. So I think one of the things that I’ve found to be really, helpful is, looking at the dev ops world and cyber liability engineers or engineering as providing a lot of context on how to think about fortunately for us, this is an industry that’s been around for close to 20 years, thinking about how do we improve the quality of software and our systems that we are using to [00:06:00] actually run that software.

[00:06:02] What can we learn from DevOps and SRE #

[00:06:02] Francisco Alberini: And there’s all these different frameworks and models and great lessons that we can pull into the data world and borrow. Things like source control or version control. So super common among the teams that we talked to in the data world, obviously using code. So it makes sense that we would want to borrow these concepts.

Really simple anomaly detection models. This is super common. I’m on a dev ops world. Something that we spend a lot of time here at Monte Carlo, thinking about the idea of incidents of. So this is something that I’ve seen less commonly used among data teams, but just to finding that an incident has a severity level.

So something is a set zero it’s like everyone in the company should probably drop what they’re doing and figure out how to solve that issue. Whereas if something is except for, we can say okay, maybe we don’t need to fix that now, or maybe we don’t ever need to fix it, but just defining severity levels is actually really helpful way of prioritizing when to fix something or when how, urgently something should be.

Not something I’ve seen [00:07:00] commonly used, but I think it’s a really helpful for model to, to incorporate testing frameworks. CICT the ability to actually deploy code continuously and having the, kind of the safety and confidence to do that is super important. And then lastly, the measurement model.

So there’s, a lot of great models and frameworks out there for how do we measure the quality of our software over time? So thinking about SLIs, SLOs, SLAs, we’ll touch more on that. But these are all really amazing concepts and frameworks that we can borrow from in, in in in the data world.

Cool. And I, and another thing here I’ll touch on is this is somewhat of a complicated concept, but I’ll try and explain it really well. The idea here that code and most issues can be attributed to either code or operational environment issues in software engineering. So it’s either something in the code is broken or something in the operational environment broke.

[00:08:00] And because of those two things, what we can say is we can often start to understand that if there is an incident, let’s say in the operational environment of one team these really clear swim lanes that we’ve created in the software engineering world allow and create this world where the blast radius of a particular incident, doesn’t expand too broadly.

Just that one team. So if we have this incident, we can say, okay, that the actual issue itself is contained because we have these clear swim lanes. We have these APIs, we have all these, modularized models for how code is developed and deployed. And that makes for kind of a nice containment zone for, any of these issues.

However, in data for all of the efforts that we have around data mesh and like product decisions. It’s really hard to maintain that same level of rigor in a kind of defining these different swim lanes in different containment zones for issues. So if we look at this diagram here, whereas [00:09:00] before we have each team with their set of APIs and their code, that’s running in the operational environments here.

We have three different data teams. Let’s say they also have code. They also have operational environments that they’re accessing. They’re two models or whatever it may be. They’re all interacting though with different Schemos within usually essential data warehouse. Yes. This is like a full abstraction.

I realized that this is different for every team and not a lot of teams don’t follow this exact model, but I think in what we’ve seen in data mesh world They, the ideal case is that we have these really clear schemas for each different team and clear areas of ownership. But what I’ve seen is that often starts to break down because one team and say, actually, I’m going to pull one table from this third schema and then maybe right to pull data from there and right to mind schema line we created a little bit of a rat’s nest for a mess here that’s difficult to contain. And then of course we add in the analytics team, which now with things like dbt are able to also. So they’re creating a little bit more complexity, more [00:10:00] distraction, more new tables being generated, and then we have the BI layer.

So we have all these different data consumers, either through BI tools and maybe directly hitting the data warehouse doing their job, using data to do their job. So we might say similar to the earlier example of the software world, where we have an incident in this one particular teams operational.

And we may think, oh yeah, of course, that, that team only interacts with schema four. So we know that only really the BI users are going to be impacted by this. But because we have this kind of like reference references across. That incident could be impacting. I’d be single team. Basically it becomes organization-wide there’s a really great blog post that I’ll happy to share after it, in the Slack channel that explains this concept really clearly.

But I think it’s so important to understand that we’re not just dealing with code and operational environment Root Cause we’re also dealing with data and I understand. The complexity of data is increasing the complexity of this problem pretty significantly. And I [00:11:00] think that a blog post I mentioned is from locally optimistic by Ryan.

Again, I’ll share that in the notes after. So hopefully that sets the stage. I think one other example that I want to share, I was talking to one of our customers. They’re a big media company. And he was basically saying it’s like the, challenge that I deal with and it is crazy. Is that we have for example, a data science team, they’re now in your power to go and create their own data sets.

And yes, they might create those data sets in their own namespace schema. But because it’s in the data warehouse, someone else can very easily come along and reference that table and now build their own sub-process off of that and their own set of new tables and schemers, and like the, sprawl of these tables and the data itself becomes very, hard to meet.

And his problem was like, I can’t control that yet. I’m responsible for that. I when, that breaks, when that, table that, data scientist created for the. Specific use case breaks in someone else’s referencing [00:12:00] it. They come to me, they come say, Hey, what’s wrong with the data here? Like those initiatives have no way to control that.

I didn’t create that table, that my team is not responsible for it. And I, equate this to asking a dev ops engineer to say, okay, not just the master branch. Now it needs to be tested and validated and correct. But every other branch that anyone has ever created. In the code repo also needs to be vetted and it’s that would be insane.

That’s very difficult, but we’re somewhat in a way asking data, platform teams to do that to basically vet every piece of data that has been ever into or out of the data warehouse. And that the complexity of that is really quickly. Cool. So again, hopefully that’s helpful context setting.

I find that concept just super fascinating and I think a lot about how do we help our customers navigate that. But yeah, let’s actually talk into some what we do about this one option. Of course, we could always try harder. That really [00:13:00] works, unfortunately. So let’s let’s jump in with an example.

So let’s say you get this wonderful Slack message. I’m sure many of you who have seen this, if you have like a, an analytics channel, it’s the, okay. I’m I have to run this report there’s some, seeing a huge number of a huge increase in volume for this particular table is super important.

I need this fixed right away. So there’s a graph here. This is something that you would see in Monte-Carlo. Okay. We have this, big jump in. The way to think about Root Cause that we like to do is start to break down the problem into these three categories. This is why I spend so much time describing these things.

The data working we look at in the data, what can we look at in the code? What can we look at the operational environment to very quickly determine the Root Cause? So the. First step on the data side. If we dig into pat is to profile the data. So actually understanding what about these new anomalous rows?

These extra anomalous rows is unique. Is there a particular field that [00:14:00] has a decrease in the uniqueness? Are we seeing additional fields that have a lot of null values? These types of clues are really helpful in for you to determine what is the root cause. It’s if we see that there’s a huge decrease in the the uniqueness of a particular field.

[00:14:16] Profile your data #

[00:14:16] Francisco Alberini: Oh yeah. It’s probably likely that duplicate values or duplicate records were added. That gives me a first Cleveland to determine the root cause. And I usually have a better sense of where to go, but say, if there is no distinctness issue, if there’s no Knowles, then okay, then let’s move on to the next thing.

But this is a great place to start. And then another thing here is actually going one step further beyond that, and actually looking at the correlation between those extra rows and the fuel values themselves. So I think we’ve seen this. I, worked, I started my career in marketing and I think this happened many times where we would run a particular campaign and then see this huge increase in volume from a particular source, let’s say Facebook in this case.

So you might [00:15:00] see that, that those, all those extra records have come through. We know that we ran a marketing campaign. So then we can say actually, yeah, that was a fully expected. It wasn’t an anomaly. It was different than the past, but it was, he was expecting because of the actions that we took.

So this is like a prompt, more sophisticated analysis that you can run, but something that you can do to, better understand the, anomalous records and see if there’s any, values there. A lot of this can be done through SQL. So just running like very basic SQL queries, I’ve seen some customers do that.

And then in addition to that, Monte Carlo does a lot of this within our platform for. And then the other piece here that I think is really a fullest to look at like the idea of lineage or understand your concept of lineage and how tables are referenced across your, pipelines. There, this is a, graphical representation of that, but what we’re seeing here.

This table events, a table at the very end on the right side has an anomaly, or has a bunch of incidents associated with it. [00:16:00] And if you look upstream, we can also see that there were changes. There were issues that were effectively causing this downstream issue. And I see this a lot for example, a schema change.

Let’s say I have a dbt model where I have a where clause filter that references another table someone comes along, removes the, reference to field basically deprecates that particular field or column. And now my dbt model fails or changes because of the fact that, where clauses no longer active, it’s no longer actually filtering the results set of records in the table and the downstream table.

So understanding and being able to look at your lineage, understanding whether changes upstream that could have caused this ripple effect down. Is it really, important part of the data Root Cause? Okay. So just a recap of the data piece. So we have profiling or anomalous records. So like looking at percent, nulls percent uniques, any changes there evaluating the field [00:17:00] values among an almost record.

[00:17:01] Recap #

[00:17:01] Francisco Alberini: So can we do what kind of almost like a deeper level of now. To look at the correlation of, those field values towards the anomalous records. And lastly looking at lineage itself. So what happened upstream that may have caused this? Like I said, a lot of these are things that we propose or that we support him onto Carlo, but a lot of it can also be done in your own environments.

Like you can run the SQL queries, you can define what the queries are that should be running as part of it, like a run book or playbook. And then look at things like lineage, even within dbt airflow, where. Cool. Hopefully that’s helpful and clear. Let’s move on to the code. So we’ve like now systematically are creating our checklist, similar to what I do with, that, with the crying baby.

This, the checklist here is to actually look at the code itself. So where are there particular changes in a lot of times th the nice thing about data is. The codes in many cases, at least [00:18:00] specifically with dbt obstructs into SQL and then SQL will usually provide a lot of clues around what happened here.

[00:18:07] Look at your code #

[00:18:07] Francisco Alberini: Did something change. Did did someone add a new filter, new air clause on a particular query that’s populating this table? Or did someone change maybe the Python code in a particular airflow job, and then that’s caused this ripple effect. These are. The table sticks for understanding where, code may have caused the actual issue itself.

And then another piece here that I found really interesting. This is a kind of like a heuristic that we came about just in our own internal talk fooding but being able to actually map the query character length and look at that on a graph is actually really helpful and quick way to to, rule out a query change, being the cause of a particular.

So let’s say that we we have this dbt user executing, a dbt job that runs every day. It’s consistent, then all of a sudden we [00:19:00] see in this example here with these little triangles represent that the three yellow dots here. Now we added maybe like 15 characters to a particular query dbt query that ran.

And now that is actually the cause that can be the root cause of you. We had like I mentioned that dog food ex example of this, where some, one of our engineers were adding added up basically aware of clause two or limit. I believe it was to a particular dbt query accidentally push that into production.

And then all of a sudden we see the volume of this particular table go from 20, 30,000, which was expected to about 500. Immediately of course you’re thinking like 500. Yeah. Maybe that was the trigger to say that’s a weird number that all of a sudden the records would decrease by must be the limit.

But the fact that crew change was actually the, indicator, the reason that the issue happened. So I think this is something that’s really helpful, a really a good heuristic [00:20:00] for for determining if your code is actually. Great. So from a code perspective, recapping, I’m looking at the SQL code again, or the Python code or the recent changes in the actual, in the code that’s generated the data itself.

And then also that, trick of querying of graphing that query character length to determine if the changes that actually did occur were like for the most part, these, those quizzes. Very frequently. So any types of small queer changes are usually a good place to start in your costs or. Okay. Cool.

[00:20:35] Look at your operational environment #

[00:20:35] Francisco Alberini: And then we’ll lastly, touch on the operational environment piece. Within your operational environment the idea here of looking at errors what errors could have occurred within the jobs that ran where there permission changes that occurred. We see this all the time in a park, in, let’s say in a warehouse, a particular table or schema, the permissions are changed and therefore causing issues with the the dbt user or [00:21:00] whichever user is being used, actually execute the queries. It could be network changes port issues, scheduling changes. Let’s say that someone goes in and changes the schedule of an upstream. Unknowingly changes and creates this downstream kind of ripple effect where now that the upstream job was expected to be completed at 10:00 AM.

Therefore my secondary job starts at 11:00 AM, giving me one hour, but someone changes the schedule. All of a sudden we have this, kind of ripple effect of issues. And of course, performance, I think if there are performance issues query execution times taking one or the unexpected these are all things that can cause outages in your operational environment.

And just going through the list is a really, helpful way to to figure out if that is the. And then another piece here that I think is incredibly helpful and it’s a, it’s an interesting one. The idea of just documenting what actually happened within the incident itself. I think we see [00:22:00] this a lot in and retros within the DevOps world or a software engineer.

[00:22:06] Leverage your peers #

[00:22:06] Francisco Alberini: And I think the reality, at least in my experience, as a product manager, that these are usually pretty painful to do, because you’re like, oh, all this work I have to go and document minute by minute. What happened? What are the steps? And I think it, at the moment, it’s hard to see the value of that because at the moment there’s no value, right?

Like the fact that I’m documenting this, I’m not getting value from that. It’s really only becomes valuable the next time that issue happens again, that you’re saying, oh, of course this happened two weeks ago. I don’t remember exactly what. Great. I don’t have to reinvent that. We all have to go through all of the Root Cause piece that we’ve talked about earlier.

So I think this is just an incredibly important part to build into your workflows. As a data team is to say, okay, how do we build some some, a lot of process around documenting what happened and why it happened and what were the steps taken to resolve that? And I think going back to my earlier point around the growth of data [00:23:00] teams, so it’s not just that first data engineer, the second data engineer that has the entire data platform in their head, all of the subsequent folks that joined the team, like this is an incredible amount of value for them to be able to go and reference and say, oh, look, yeah, I see that this incident or this particular type.

There has been 10 incidents in the last week. They all seem to be related to this one like crumbling piece of infrastructure that we need to, resolve great that law allows us to focus our efforts on resolving that particular thing. So just a really important piece of the puzzle is documentation.

[00:23:34] Set SLAs, SLOs, SLIs for data #

[00:23:34] Francisco Alberini: And then the, beauty of the, SLA SLO SLI model. And it’s something that we’re really, excited. I think we’re seeing a lot of customers are to adopt this. This is actually a screenshot of something that we’ve seen. One of our customers build. The idea here of starting with SLA, which is a service level.

Agreed. I wouldn’t say this is like the holy grail piece. I don’t think it’s something that people should start with. I would usually [00:24:00] say start with the SLIs, which are basically the indicators. What is an indicator of assistant beat broken? What can you programmatically measure and start to build graphs upon and metrics.

Okay. Is this something the system particularly broken once you’ve defined those once you’ve gotten the indicators moving on to the SLO or the objective. So now internally as a team, you can say, I want to create an objective that this particular system has X amount of uptime. And then once you’ve internally set your objectives and you’ve started, you’ve been able to start to hit them.

Then you can go to the rest of the organization and say, here’s the SLA that we agreed to. Which basically becomes a contract between you and all of the data consumers to say that this data will be X, Y, and Z, and have meet these particular specials in terms of uptime and reliability. But this is just, again, I’m not inventing anything new here.

I’m just saying that there’s, a lot of value in applying these, this model to, to the data world and spending the [00:25:00] time to think through how, it can be. Cool. And then another piece here that I think is helpful. Very simple metric here thinking about data downtime. I touched on this earlier.

But one way to measure that is to actually just count the number of incidents that you’ve had and then how long it took to detect that the issue occurred and how long it took to resolve that particular issue. We focus this presentation primarily on resolution. So the time to actually take to resolve it.

But time to detection is a big piece. And I would say that’s actually the area that you likely have the most opportunity to decrease. I think from our original example here, we’ve talked about someone, a data consumer saying, this table looks off, I see this increase in volume of this particular table that is usually weeks late and, that can happen weeks after a particular incident has started.

What we found is that if you can reduce that two hours, right? So actually to be able to detect that an issue that happened within an hour, This particular data downtime goes from [00:26:00] like days, weeks, months, you like it goes on and on to just a handful of hours, which is a massive, improvement.

And I think that the beauty of being able to measure this, is also internally when you’re trying to advocate for resources, for it to say, we need to invest in these systems. Being able to say my expectation is our data downtime today. Basically a month, let’s say, which is massive.

If we can reduce that to a number of horrors, I think any executive sponsors can be like, yeah, of course. That sounds amazing. That’s a huge improvement. So I think these are really helpful tricks and things that we’ve seen work well with our. So we’ll recap here quickly data again, profiling and almost records evaluating field values.

So looking to correlations and looking for upstream data issues. So that’s like our data bucket or code bucket, looking at the actual code, the code change and then some, kind of futuristic checks there for that. And then lastly is the operational environment. So we like sequentially going through our checklist [00:27:00] here, looking for log errors, network, permission issues, performance issues.

Documenting our postmortems and then defining those clear metrics. I realize that’s not part of the RCA, but I think it’s a helpful piece to include here. Within the model putting it all together, this is what mentioned kind of couple of times throughout. We spent a lot of time in Monte-Carlo thinking about how to do exactly what we just described, but from a kind of in a software perspective.

So being able to understand lineage, being able to under. Create these postmortems and have that somewhat be auto documented. What actually happened in the incident. When did it start? What other tables downstream were impacted by that? All of this documentation comes together. It in the month. And to recap, here are key takeaways.

[00:27:49] Key takeaways #

[00:27:49] Francisco Alberini: Modern data pipelines are messy. That’s good. I think this is like actually one of the things that makes data so interesting is that we’re empowering everyone to use data, to actually manipulate, to change the, [00:28:00] cleanser. But it means that we have to be really thoughtful around how do we ensure that the quality stays high?

This is like a really big problem. Perhaps one of the biggest problems at least that I’m excited about in the digital world is solving that RCA gives us a lot of tools, it’s really not like defining a clear set of checklists. It doesn’t have to be super complicated, super fancy.

For the most part, if you have four or five items in your checklist, you’re probably gonna figure out what the majority of your issues and be able to determine what. And breaking it down, I think by data code and operational is also really helpful. It creates a clear, concise model in your head and also helps you communicate across the teams as well.

So was like, oh yeah, this was data issue. Great. That, that kind of is the shared language. And of course, measurement, it’s always helpful to measure. It helps you validate it helps you advocate for, more reasons. That is all. I really hope that this talk was helpful. It’s just been an incredible experience.

Being able to watch all these, a lot of these talks at [00:29:00] coalesce and I am so excited to, to be a part of it. I’m looking forward to the Slack chats and happy to share anything else that we’ve learned from, our conversations. Thanks again, and have a great rest of the week.

dbt Learn on-demand

A free intro course to transforming data with dbt