Culture & teams

Data futures

Frameworks & best practices

Tool deep dives

Workshops

Coalesce 2021 Frameworks & best practices How dbt Enables Systems Engineering in Analytics

How dbt Enables Systems Engineering in Analytics

Jorge is an engineer and analyst, with specializations in data analytics, data visualization, databases, and statistical modeling. He is passionate about understanding open-ended problems, and creating implementable solutions for their component parts. He has exercised data-driven analytics projects within the energy, shipping, airline, bioinformatics, commercial banking, and sustainability fields.

How I was able to adopt dbt coming from an industrial engineering background, and how it can be leveraged by end users with a quality control mindset to solve or diagnose common business problems.

Browse this talk’s Slack archives #

The day-of-talk conversation is archived here in dbt Community Slack.

Not a member of the dbt Community yet? You can join here to view the Coalesce chat archives.

Full transcript #

Kyle Coapman: [00:00:00] Hey there dbt community. Thank you again for joining us for Coalesce. My name is Kyle and I’m the head of training over at dbt Labs. And I’m really excited to host this session. We’re going to talk a lot about testing. And so the title of the session is How dbt Enables Systems Engineering in Analytics.

And I’ll be joined by Jorge Cruz Serralles and all chat conversations is taking place in a Coalesce dbt systems engineering and dbt Slack. If you’re not in dbt Slack, be sure to go to community.getdbt.com and then look for the channel Coalesce dbt systems engineering. And when you enter that space, we encourage you to connect with other attendees, ask questions, make comments, of course, share memes, all the things.

And after the session we’ll be available to answer your questions in the Slack channel. So feel free to throw them up in our chat and we’ll pin them so that we can raise those towards the end and in the live session right now, if we have time. And so let’s get started. Jorge over to you.[00:01:00]

Jorge Cruz: All right. Hi everyone. I am Jorge and thank you for coming to my talk on how dbt enables systems engineering. Before I start, I would like to thank Erica Louie of dbt Labs for her mentorship and guidance in this process. I’d also like to thank Chris for flavoring the sub and the rest of the data team at Updater for everything it taught me about how to think about data.

Now about me. I was born and raised in San Juan, Puerto Rico. After that, I went to Georgia Tech for my bachelor’s degree and studied industrial and systems engineering. While in college I entered into in data roles for Delta.

A lot of analogies that I use in this talk are inspired by my experiences at Delta. Currently I’m a data analyst for the Updater data team. You might ask, what’s Updater? [00:02:00] Updater’s a company that offers different B2C and B2B solutionsthat are moving really tossing less? In other words, we’re in the moving space.

Currently we have a double digit market share of us moves every year on our platform. We have more than 8,000 properties, about 2 million housing units and over a hundred thousand brokerage agents. To keep up with this growth, and so right now, we’re currently at about 200 employees after having doubled in 2021.

And we expect to double again to 400 in 2022. My team, the data team has gone from six to 15 employees this year. Speaking very abstractly, the data team’s recent of being is improved business outcomes by using data. We do it through data democratization, which entails arming everyone at the company with the education, the guidance and the tools that they need to [00:03:00] extract insights from this data.

We also do those through driving company strategy, which for example, could be done by developing data products or by analyzing user behavior in order to draft product or conversions. Most importantly is building and maintaining the data platform, which consolidates all of our business data and insights to facilitate data analysis and exchange.

[00:03:26] The Data Platform #

Jorge Cruz: Right here. You have a diagram of what our data platform looks like on the left. You have a series of data in quotes which are captured using Fivetran and Kinesis, which are then transformed, loaded into Snowflake using dbt in order to be used in different data products. And so coming from a systems engineering background myself, I saw this diagram and I was like, wow, this is a lot like a piece of physical infrastructure.

A lot, like an airplane. And so speaking of airplanes, let me [00:04:00] ask you a question. Would you you want to fly a plane that hasn’t been thoroughly tested?

I think the answer is no, you would definitely not because it could simply fall out of the sky. Likewise people won’t want this data product that hasn’t been tested thoroughly, either it’s like with flying it’s all or nothing. And so when people run into issues of the data product, it can have negative consequences.

It can lower people’s trust in your data platform. It can lead to lower usage of it. People may begin to reverts off and people may begin to revert to obsolete. And so how do we stop this from happening? How do we prevent our plane from falling out of the sky? The [00:05:00] solution is systems engineering, which is a general framework.

I will explain it and discussing it. Specifically, we’ll go over what systems engineering is a general framework for. How it’s imperative in building and maintaining a large project. How to think about testing, how to approach testing at scale and how to reduce the human impact outside monitoring.

[00:05:34] What is Systems Engineering #

Jorge Cruz: What is systems engineering? It’s basically an engineering field, not more of a mindset really. Let’s focus on deciding, integrating, managing, and iterating over complex system. This right here is a V diagram, common to systems engineering on the left. You have your design elements, what you want your device to do, what external requirements [00:06:00] are, how its different components work together and how they’re linked on the right.

You have the subset map on the things like say validating that your airplanes wheels are correctly. That the wiring is working correctly and that in the end, your airplane and the people in it can get safely from A to B. Does this sound familiar? This is what the waterfall model in software development is based off of when you’re building a dbt project. You were using the same thought process that you would when you’re building a physical system.

[00:06:40] The Paper Airplane #

Jorge Cruz: As I just said, building a dbt project is like building an airplane. At first, your environment starts out like a paper airplane. It is manageable, lean, and easy to follow. It’s the aerospace equivalent of a simple dbt project [00:07:00] right here. We have a simple dbt project. This right here is the data engineering equivalent of a paper airplane.

It is the minimum viable product that runs well and is easy to build, but it’s limited in functionality. Systems thinking may not seem as important when your data environment looks like this. You can get away with not having many tests, because there are a few repercussions. If things. And it’s easy to fix.

However, over time, things can get in the way between you and your dbt project as tech that accumulates your business grows and priorities change. The innards of your airplane can begin to look like this as well, your data [00:08:00] environment. As the saying goes, complex systems build complex systems. And as their systems become increasingly complex, our errors and bugs, likewise become more complex.

And so how do we go from paper airplane to jumbo jet. To go from paper airplane to jumbo jet, we must continuously test and validate that our components and sub-components are performing as intended. dbt is what enables us to do that.

[00:08:35] Broken Components, Broken Trust #

Jorge Cruz: At scale, to give more context, we would sometimes find ourselves catching different miscellaneous errors, such as broken integrations or seal credentials or broken workflows, impacting user experience.

It would often take us a couple of days, then manually catch it. But then the damage had been done. Whenever this happened, we would, [00:09:00] of course, try to find a way to prevent these issues from happening. Again, that being said, one of our focuses has been to make catchings areas faster and easier

dbt isn’t able to do it by making it easier to test a component. dbt macros make it very easy to create generic repeatable units so that you can use over and over at Updater. We had a problem of missing or misfiring events. And so what we did is we built a macro that takes a table name, a column name, a date offset, and a count threshold, which you can see on the left.

And so for example, on the right, all three of these tests are set to air if less than 2000 records are generated. Within our repository, we use this single generic test more than 30 times across tables that are updated [00:10:00] on daily basis. Now that we’ve built these tasks, how do we scale them?

Before we implemented dbt at Updater, we had no testing. We simply an Airflow executed a very large SQL files sometimes, 900 to 1000 bots. And in our previous state, as we migrated to dbt what we did is we basically split up these very large SQL state statements into separate bottles.

And we had one DAG per function. So for example, a TV-related DAG or an insurance related DAG or a move-related DAG, etc. And this previously is when we first introduced testing. In our current state, we basically shifted from having a Docker function to a Docker reschedule. And so let me tell [00:11:00] you, how you can use that game of tag to scale your testing. So we basically built tooling that made it possible for our job or trader Airflow to automatically schedule and execute a new dbt model and its associated tests and advise you tribal tasks by simply tagging a model with the appropriate schedule.

This allows us to schedule tests, model runs and other tasks without ever having to touch Airflow code directly because things are automatically scheduled. They’re more reliable. And so in the screenshot, you can see some of our different data models and the different schedules of their tags. Here, we can see the standard schedule, which is just a regular daily schedule. An hourly schedule, which is not really scheduled, and late schedule, which is basically the same thing as the standard schedule, but it just runs on a 3:00 AM BTC. [00:12:00] And so this is what one of our new schedule bags looks like on Airflow.

We actually have custom operators that enable Airflow to interact with other applications more easily here. You can see the two custom operators that we have defined, dbt run operator and dbt test operator.

[00:12:26] Smart Alerting #

Jorge Cruz: Once you have enough of these tests in place, managing them can begin to get quite difficult at Updater. We made a Slack bot running on a Flaska, but allows us to basically interact directly with our field from within Slack, allowing us to see when the test ran, what it’s compiled SQL was, and even allowing us to snooze the tasks, if need be. This has made it so much easier for [00:13:00] us to run and validate our field to us personally, I think snoozing is a super useful feature and lot it lets you deprioritize noisy tasks.

And so it’s a really low effort way of being able to ignore something for, or being able to ignore an alert for a couple of days without having to edit go directly. We’ve also leveraged the same Slack bot to create alerts for freshness tasks. And so from within a repository, now that we’ve implemented it, it’s as easy as adding these warrant after, or air after blocks and your sources file. And so once you have that set or once we have that set, we’re able to re receive Slack notifications that alert us of when a data source has gone stale. And so [00:14:00] this makes it much easier to know when our data sources have gone stale or broken in some for m. And so something that I think is worth noting as that, my team uses emojis on these alerts. We use the eyes emoji to denote that someone is looking into the alert. A green check mark, like the one that you can see in the screenshot basically means that someone has looked into the alert and wrap up their investigation.

The main takeaway, that I would love for all of you to leave this presentation with is to think of your dbt project as a jumbo jet. If anything fails, it could negatively impact a lot of people. Thank you. Thank you so much for listening. Here’s my contact info and I’m now saving the rest of the time here for any questions.[00:15:00]

Kyle Coapman: Thank you. Awesome. Thank you, Jorge. I really appreciated the showing off actual code that you’ll use in production, particularly that custom generic tests. It’s really cool to see that. We’ve got a few questions. I’ve queued up. I’ll drop them in our private chat too. .

[00:15:12] Q&A #

Kyle Coapman: From a business perspective, how do you justify, especially with execs or business leads the costs via in dev hours, new hires or SAS of developing such a test system? It seems like you could spend a lot of time on testing. How do you justify that? And more often than not, especially at smaller companies, this type of extremely necessary endeavor we put on the back burner or sent to the backlog where data tasks go to die. What do you think?

Jorge Cruz: In terms of justifying all the testing that you’ve implemented? You have to get buy-in for that? No, not really. The way I think about it is it’s if you work in oil company or you work in an airline, it’s something that you absolutely need to do. If you’re developing a product that, tens or hundreds of people, or even thousands of people are touching every day, it’s an absolute necessity to be able to conclusively say [00:16:00] that all of your data is correct that all of your data’s passing tests and that everything that you’re showing is a source of truth.

Kyle Coapman: Excellent. How do you on a, implementation level, how do you make sure your team is doing that? Is there like a check points using a poll review process, a pull request process. How do you make sure you do have testing moving forward? Are you ready, Jorge?

Jorge Cruz: Sorry about that, thank you. Yeah, so my team yeah, it’s basically started pull request process. What we’ve done in order to clamp down and tighten on a lot of these tests is we’ve implemented a lot of different circles, CI tests in order to basically force the developer to include some of those.

Kyle Coapman: Oh, excellent. So you’re testing on pull requests also before you merged that code in. Beautiful. I had a similar, so you mentioned the emoji reactions. We use those internally here at dbt Labs. Eyes, I’m looking at this, check. This has been addressed. I’m [00:17:00] not sure how large your team is, but how do you, once you get that notification, what is your triage process look like? You’re monitoring. How do you make sure that gets covered at the right time?

Jorge Cruz: Yeah. So over at Updater, we have a team of roughly 12 to 15 people now. And so we have people spread out over multiple different time zones. We have people in the East Coast and Central, Mountain, and Pacific.

The way that we’ve usually split it out so far is usually whichever your person is the most closely tied to that model or to that data source that, aired or notified, during that time if they’re online, then they touch it or basically check on it. Otherwise if it’s urgent, somebody else will check up on it. If it’s not a super urgent alert, we might wait until the next morning for someone who’s more familiar with it to check upon.

Kyle Coapman: Excellent. Great. That makes sense. Go to the person who wrote the code for it. You a hinted at going back to your presentation, you hinted that migrating a bunch of SQL files from your [00:18:00] previous state pre dbt.

Could you tell us more about that process? How did you tackle that? Because I imagine a lot of people are moving from a previous system into dbt.

Jorge Cruz: Yeah. I came in right after most of that work was done. So I’m not the most, tied to it or experience with it. But yeah, it’s all about being able to take these bigger models and breaking them down into more usable units of code. For example, one heuristics that we stick to is we try to make all of our models at most, maybe a hundred, 120 lines of code. And just sorry, I think that’s a pretty good baseline. If you have a model and it’s 600 lines of code, it should probably be split up into three or four.

Kyle Coapman: Yeah, totally. And then I’m curious to use a femoral models at all or different materializations on that front. We’re getting a little way from testing.

A new, fresh question in from [00:19:00] Jacob. Is dbt the best tool for testing data. What is done when a few rows of data fail validation? Does the whole pipeline stop?

Jorge Cruz: Yeah, I do think dbt is best tool out there for testing data. Yep. So for example, let’s say you’re running a data pipeline with five models on stacked on top of each other.

If most upstream models fails or, okay. Sorry. If that most model are running and I guess the test attached with fails because he says or attached to the model and just run right after the model. As soon as that first model feels that, or as soon as that test associated with that first model fails, that’s something’s wrong in your data pipeline.

And it doesn’t stop, if it is something like a Slack alerting system, a chat attached to this, then as soon as that the first test that feels associated with that model fails, you’ll be notified immediately. And so the [00:20:00] pipeline will continue running, but you’ll know as soon as that first model and that first sees the tests associated with it are done running that something’s wrong.

Kyle Coapman: Yeah, definitely. Somebody pointed out for folks that might be due to dbt is slightly different from unit testing, like it’s build my data sets or my models. And then you test on top of it. So if you’re coming from a software engineering perspective, just a little bit different. A nother question that came to mind for you.

You’re testing sources with source freshness, you’re testing models. Are you doing anything to test it? Like the BI layer downstream from dbt.

Yeah.

Jorge Cruz: So it’s interesting that you asked that. We have been looking into doing that. We haven’t implemented anything correctly on the dbt layer, but so our PI layer is Looker. And so we’ve begun implementing a lot of different cleanups. And change it as to how we work with Looker in order to tighten down on that.

We are looking into [00:21:00] using Spectacles, I believe to clamp down tighter on data quality.

Kyle Coapman: Awesome. Yeah. I know Dylan Baker. I’ve taught a few courses with him. Cool stuff happened over there at Spectacles. Anything else you’d like to just tell folks about how to get started with testing? Like a final why is it so important? Close out with some wisdom from you. Okay.

Jorge Cruz: Yeah, I guess my really my biggest piece of advice, if you’re beginner is documentation on dbt Labs website is incredible. Very easy to follow, very readable. That’s where I got started and I felt like it was great. And also a lot of last year’s Coalesce talks are super interesting. Very good. A lot of people either demo it to presenting, new, interesting ways of working dbt or using dbt or yeah, I really just recommend doing a deep dive of everything that’s out there and getting your feet wet.

Kyle Coapman: Totally. I think you’re speaking to [00:22:00] like the power of community, right? So many folks like yourself are willing to share what they’ve learned. And you learn so much in your specific context. Sorry to cut you off.

Jorge Cruz: No, sorry. Last thing I was going to say is join the Slack. Slack is great and you can search on it into if you can think of a question, somebody has probably already asked it. It’s all a matter of being able to, tailor your query in the Slack chat to find what you need.

Last modified on:

Previous: Optimizing query run time with materialization schedules

Next: Operationalizing Column-Name Contracts with dbtplyr

Table of Contents