Table of Contents

So You Think You Can DAG: Supporting data scientists with dbt packages

Emma is a Data Scientist at Civis Analytics.

Originally presented on 2021-12-07

A dbt package is not only a way to make a particular process easier or more robust or more efficient – it can also be a way to productionize/share data engineering expertise with non-data engineers.

Follow along in the slides here.

Browse this talk’s Slack archives #

The day-of-talk conversation is archived here in dbt Community Slack.

Not a member of the dbt Community yet? You can join here to view the Coalesce chat archives.

Full transcript #

Amada Echeverria: [00:00:00] Welcome everyone. And thank you for joining us at Coalesce 2021. My name is Amada Echeverria. I use she her pronouns and I’m a developer relations advocate on the community team at dbt Labs and thrilled to be hosting today’s session So You Think You Can DAG: supporting data scientists with dbt packages, presented by Emma Peterson.

Emma began her data science career several years ago at Civis analytics on the applied data science team, working on client projects and has since moved to services, surveys, product team, where she works on survey methodology and infrastructure beyond data. Emma is a former public relations specialist with degrees in political science, as well as computational analysis and public.

How did Emma get started with dbt? She inherited a pipeline that somehow used SQL, Python and R and she knew there had to be a better way. And perhaps the most interesting part about her, she has become somewhat of a beauty influencer on Instagram sharing photos [00:01:00] of her eye makeup for her own page, and for the occasional brand. Over the course of the next 30 minutes emma will convince even the fiercest of data scientists to apply the software engineering best practices of the dbt ethos, and start to think a little bit more like a dbt user.

Before we jump into things, some recommendations for making the best out of this session. I’ll chat conversation is taking place in the#coalesce-dbt-packages on dbt Slack. If you’re not part of the dbt community slack, you have time to join now. Seriously, go do it. Visit getdbt.com/community and search for #coalesce-dbt-packages.

When you arrive. We encourage you to set up slack on your browser side by side. And I think you’ll have a great experience. If you’re asking other attendees questions, making comments sharing means and reacting in the channel at any point during the session. To kick us off our chat champion, Leslie Green, solutions architect at dbt Labs [00:02:00] has started a thread to have you introduce yourself, let us know where you’re calling in from and tell us whether you think you most identify as a data scientist, data engineer, or data.

And after the session and will be available in the stock channel to answer questions, let’s get started over to you, Emma.

Emma Peterson: Cool. Thanks so much. Hello, everyone. Welcome to So You Think You Can DAG: supporting data scientists with dbt packages? My name is Emma Peterson and I’m a data scientist at Civis analytics, working on our survey tools and methodology.

A bit more about me before I jump in my journey up to this point has been somewhat meandering. Like Amada said, I studied political science in undergrad and then worked in public relations for a hospital for a couple of years before realizing that data science seemed interesting.

So I went back to school and then joined service first as an applied data scientist working [00:03:00] directly with. And later moved into my current role on our survey product team. And in that time I’ve found that I’ve really gravitated more toward kind of the engineering side of data science. And like I also mentioned, I had inherited a pipeline that used SQL, R and Python, and I definitely don’t want to take full credit for refactoring that.

Actually worked with a data engineer to transition that pipeline to dbt. And that experience has really informed a lot of what I’ll be talking about. Final note here before I get going, I wanted to give a shout out to my colleague, Evelyn. She helped me come up with this wonderful title. And she’s also speaking later today about inclusive design in dbt. That’s at 0315 central time. So I would definitely recommend checking out.

So just to set the stage here, I should clarify that despite the title of the talk, I’m not going to be [00:04:00] focusing specifically on DAGs. So You Think You Can DAG was more of a play on the idea that data scientists are sometimes called upon to do work. That is data science adjacent, like building a data pipeline.

And there’s a really wide range in the type of work that data scientists do. And obviously a lot of overlap between data science and data engineering and even software engineering. But for many of us building a data pipeline might mean stepping into a role that’s new and maybe a little uncomfort.

So I’ll go into a bit more detail about that problem and discuss one particular solution to it which involves taking advantage of dbt packages. Then I’ll show an example of that solution and how it’s working for us. And I’ll wrap up with a few tips for implementation.

[00:04:49] Data scientists often have to do some data engineering before they can do data science #

Emma Peterson: So again, the problem that I’ll be focusing on is that data scientists often have to do some data engineering before they can do data science.

Emma Peterson: and particularly [00:05:00] for data scientists who are operating as consultants, either internally or externally, This can sometimes mean that a data pipeline is a bit of an afterthought. It’s not the deliverable that your stakeholder is expecting, but rather is just something that you have to get done before you can start on your real data science that they are expecting maybe a model or some kind of.

And so because of that sort of incentive structure, a data pipeline that’s built by a data scientist in that situation is maybe more likely to be built with data science, friendly tools. For example, relying heavily on R and Python . It may be less robust or lacking in validation and it may be less efficient.

And as a reminder, I’m not trying to bully data scientists. Again, I am a data scientist just highlighting some gaps in experience that may exist for some of us. But it can be really tricky to get data engineering tools and best [00:06:00] practices into the hands of data scientists without creating a significant burden for.

On that note. I wanted to share a quick anecdote. My dad works in motion graphics and he was telling me about a time when a coworker of his was explaining to another coworker, how they could automate a particular task in the software that they were using. But the second coworker wasn’t familiar with the approach.

He decided to stick with what he knew and forgo the more efficient option and kind of infamously said, I’ll just do it the hard way. It’ll be easier. And that’s three really resonated with me as someone who works with data, because I find that often the more efficient approach is actually significantly less efficient.

But I think there are some ways that we can remove barriers to taking the more efficient approach at this point. Some of you may be wondering whether we should [00:07:00] hand this type of data, pipelining work over to a data engineer, analytics engineer, someone who has the right expertise for that type of a task.

And my answer to that is maybe slash sometimes. But you may not have people in these roles at your organization or on your team, or you may not have enough people in these roles. And perhaps more importantly, I think data scientists can bring really important context about what the data is and how it’s going to be used.

That is super valuable when designing a pilot. So it’s important to find ways to bring data scientists into this process because they are often the primary users of the data.

[00:07:42] One Solution #

Emma Peterson: One way to address this is by packaging up an opinionated view on how a particular category of pipeline should work. And we can use dbt packages to do that.

And with this approach, a dbt package is not just a way to make a particular [00:08:00] process easier or more robust or more efficient, but it also can be a way to share data engineering expertise with non data engineers. By building that expertise into the package. And in particular, it is possible to build a dbt package for non dbt users and in doing so, you’re able to say, you can configure this with YAML and SQL, which is a lot more palatable for many data scientists than to say, you’re going to have to learn dbt if you want to contribute to this world.

To make this a bit more concrete. I’m going to talk about a particular dbt package that I built for processing longitudinal survey data as to. And while this is not a talk about longitudinal surveys specifically. I did want to take a quick detour here to cover what longitudinal surveys are and why the data can be tricky to process, to set the stage for why I [00:09:00] wanted to build a dbt package.

So a longitudinal survey is one that is fielded on some regular cadence over time, for example, weekly or monthly, and depending on the client at Civis, we may be tracking things like political opinions, brand sentiment, likelihood to get vaccinated, etc. So because you’re tracking that information over time.

A lot of the content of these surveys is consistent over time, but other pieces of content may be added, removed or changed. And it’s that sort of constantly shifting nature of the content that can make the data surprisingly difficult to process. There are some specific needs that a pipeline for a longitudinal survey needs to address.

Some are pretty common and others are pretty specific to longitudinal surveys. So I’m not going to get too into those details, but in general, the needs are centered around pipeline efficiency over time. Flexibility as client needs [00:10:00] shift, preparing data for both analysis and reporting and preventing data discrepancy.

And each of these are things that a data scientist with no data engineering experience might struggle to implement or to fully consider or they may not have the time to gather an understanding of the best practices, but at service, our survey users, our data scientists. So we wanted to find a way to make it easier for them to manage their own survey data.

There are two key problems related to processing longitudinal survey data. And again, this is not really a talk about longitudinal surveys, but I wanted to give a couple of examples here to help make this a little bit less ambiguous. The first problem is related to question labeling. So here we see a question that was fielded three times in three slightly different ways, but the underlying meaning of the question didn’t change, and this can happen because of human error.[00:11:00]

And if anyone here has set up a survey in Qualtrics, you know how easy it is to make a mistake there Or this maybe because of requests from a client, but either way, in this example, we want to treat these three versions as the same question for the sake of trending analysis. And we also want to record of the fact that they were technically different.

And if there isn’t some mechanism for handling this built into the pipeline, it can become extremely time consuming, error, prone, and just difficult to track and manage over time.

The second issue with longitudinal survey data is related to derived fields. It’s arrive field as opposed to a raw survey field is one that we generate by transforming the survey data in some way. So here we have a derived field called age bucket, which is just taking the respondent’s age and bikinis bucketing it into a few groups.

So these [00:12:00] transformations are usually pretty straightforward. Like this one. But there can be hundreds of these over time and some are more complicated than others. For example you may need to join in an additional table in order to perform some of the transformations. Additionally, when the logic behind the derived field changes your previously collected data also needs to change to reflect that.

[00:12:24] Example #

Emma Peterson: And all of that needs to be easy, efficient, and version control. So getting back to the package itself my goal was to solve those problems once and make that solution accessible for future projects in the form of a dbt package with data scientists in mind as the primary users of that package.

I’ve called this package longitudinal dbt, which I’m not in love with. If anyone here can think of something that’s maybe a little bit more fun while also being descriptive, I would definitely love to hear some ideas in the slack channel. With this [00:13:00] package, our data science users can easily set up a pipeline that looks like this with all of the structure, permissioning and validation built in and ready to.

The prototype of this pipeline was the result of a collaboration between me and a data engineer working on one particular pipeline for one particular client. And each of us brought different, but useful context to the process. And then I took that and generalized it into something that would be useful across product.

So with this package setup, we keep decisions about things like pipeline, structure and data validation in the hands of data, engineers or engineering focused data scientists. While the project specific code stays in the hands of the data, scientists who know the data and the business context best, and they are then enabled to take ownership over the pipeline and maintain it and build upon it over time.[00:14:00]

I also wanted to show what this looks like in the standard dbt graph format that many of you may be familiar with. You can see here that it’s not an enormously complex pipeline in terms of having a bunch of different data sources, but there’s enough complexity here that it’s been useful to define this structure once within the package, instead of putting data scientists in the position.

Having to come up with a structure or replicate it on their own for each project.

Some of the benefits of using this package include faster run times. Our pilot project ran about 12 times faster after transitioning to this. It also means less duplication of code. And I don’t know about anyone else here, but I personally love deleting code. So it was very cool to see one project go from 2300 lines of pipeline code written and maintained by data scientists to just 40 lines [00:15:00] and the rest of all of that logic and data modeling and everything is just built into the package.

It’s. The package also makes it easier and faster to spin up a pipeline for a new longitudinal survey project. And we’ll see that process in a moment. And it also makes it easier to collaborate when each longitudinal survey pipeline is basically the same.

In terms of what this looks like in practice users run two scripts in Civis platform to get set up.

And so this platform is our data science software. The pipeline that you saw a few slides ago can be up and running with no new code and then project specific customization can be done with just YAMLand SQL. And we’ll get into that in a more.

The first group that users run generates the three YAML files that are required to run their pipelines.

So here user specify the name of their project, a list of users and groups that should [00:16:00] have access to the resulting tables and the location of some projects, specific survey metadata. And when they run this, it’ll give them these three files with all of the required information already filled in and formatted correctly.

Ready for them to add to whichever repository makes the most sense for their project.

Next user is run a second script. And this one does two things. It creates a script that they can use to run their pipeline. So they’ll use that every time they feel their survey and need to process new data. And this also creates a service in platform that hosts the documentation that’s auto-generated by dbt.

So they’re able to take advantage of some really comprehensive documentation for their pipeline without having had to write it or figure out how to access it.

So I wanted to make sure that users could get a pipeline and it’s [00:17:00] that human station up and running really quickly and easily in order to encourage people to actually use this package. So the setup process I described on the previous use sites can be done in about five minutes, but as anyone who works with clients will know, every client has some slightly different needs, many projects that are going to require some level of customization.

In this case, for example, maybe joining in additional tables, adding project specific tests and so on. And because I knew some level of customization would be necessary. I wanted to find a way to enable that within the structure of the package, so that users wouldn’t end up doing ad hoc transformations outside of dbt after their pipeline has run.

So to address that the structure of the pipeline is designed to allow certain specific types of customization while also maintaining some level of standardization across projects. And then the structure of the package [00:18:00] code itself is set up in such a way that users can implement those customizations without really learning dbt.

To do that. I really took advantage of the dbt project YAML file in particular, the variables here. This is where users can define their derived fields specify any additional tables they need to join in or project specific tests they want to run. And another nice thing about having this in a package is that it allows us to share derived field definitions across projects, so that the SQL for something like H bucket, for example, doesn’t need to be written over and over.

All of this has done via YAML and SQL, which are already familiar to our users. And that goes back to the point from the beginning where the more efficient thing is often, much less efficient at first. This is definitely not how I [00:19:00] would structure this for a single pipeline. And it’s not how the prototype pipeline was structured because enabling all of this customization here via YAML means that some of the modeling code is actually a lot more complicated than it really needs to be.

But the trade-off is that the barrier to entry to use this package is pretty low because users for the most part can just live in this one. Yeah. We’ll file.

[00:19:26] Tips #

Emma Peterson: Finally, I wanted to wrap up with a few tips and things that I learned as I was setting up this package. The first is to be opinionated and to avoid creating too much flexibility.

Data scientists, myself included, generally know where their data engineering skills fall short. And they’ll appreciate if you build your expertise into the package, even if that means they need to make some adjustments on their. And similarly flexibility can be good, but it also means increased complexity for the user.

It’s an increased number of things that they [00:20:00] need to learn or decisions that they need to make as they’re getting set up. And again, many data science users, aren’t going to be perfectly happy to trade, complete control over every aspect of their pipeline in exchange for all of the other benefits that your package.

As an example of being opinionated. The longitudinal dbt pipeline has an incremental model containing all current and past survey data for the project. And we chose to store that data in long format rather than wide. And there’s an illustration here of long versus wide data featuring me and my dog Louise and this long format can make it harder to query, but significantly more efficient when the pipeline.

The previous iteration of this table was in wide format because the format had been optimized for analysis rather than pipeline performance. So we chose to be opinionated here and use the more efficient structure, even though that [00:21:00] means users have to adjust the way they query the data. And of course, as promised since I mentioned Louise, I wanted to pause for a moment so that we can all just appreciate her cute little face.

Emma Peterson: Okay. Sadly, moving on from Louise. The next tip is to provide an example project so that data scientists can see the packaging. And this can be a really helpful addition to any written documentation that you have. We actually have two example projects. One that shows the sort of bare minimum longitudinal dbt setup, which is just the three yamal files and one that uses all available options.

And these example projects have also been really useful as a testing tool. Ours are set up to run weekly just to make sure that everything in the package is working as expected.

My next step is to make it easy for data [00:22:00] scientists to run their pipelines. This is not true for all data scientists, but it’s something to keep in mind that they may or may not be super comfortable with the command line, or they may just not have the time to learn the rice in texts. So it’s important to recognize your users’ comfort level at the command line, and you may want to build some accompanying scripts.

In our case, we built one to generate the required files and one that allows users to run their pipelines and access their documentation. Many of our users are actually quite comfortable with the command line, but enabling this setup within our platform, UI just meant that they didn’t need to learn anything new in order to get started.

My final tip is to find the right level of detail when it comes to documentation. I found this pretty tricky because I wanted to give users a sense of what dbt is and why we’re using it without overwhelming them with a bunch of unnecessary [00:23:00] information about how it works. So my recommendation here would be to make sure you’re documenting enough that your users are able to set up and run their pipelines, but try to avoid getting too into the weeds about dbt.

And one way to avoid overwhelming your users is by keeping any documentation around how and why the package works the way it does completely separate from your user-facing ducks. And because you’re not getting too into the weeds about how dbt works. It’s a good idea to include some really thorough troubleshooting advice since they are likely to encounter dbt errors, that will be difficult to interpret without knowing much about dbt.

I would also recommend taking extra advantage of the documentation that dbt autogenerates. I think that’s pretty much true for any pipeline, but particularly within a package like this, every user is going to be able to see and use that [00:24:00] documentation about what each table is and what each column in each table is, what tests are being run and so on without having had to write any of that themselves.

Finally I think it may be useful to give a training. In addition to your written documentation. When we initially released our package I put together some slides and delivered a training and then those slides and the recording of the training are now linked in the package read me for future risks.

That training covered some background and context around the problem, an overview of the package and a very brief overview of dbt . And then we showed an example walking through, setting up a project I’m using the package from start to finish. I also shared some troubleshooting tips there and also describes the limitations of the package so that users are able to make good decisions about whether it’s the right fit for their products.[00:25:00]

Before I wrap up. I want to clarify that, although I’ve spent a lot of time here talking about the longitudinal dbt package specifically, it is unfortunately only available internally at Civis at the moment. But if it’s something that would be useful to you in your work, definitely feel free to reach out to me on slack.

So that is it for me. Thank you so much for joining. I’m excited to see some of your responses in the #coalesce-dbt-packages, slack channel and I’ll be hanging out over there for the next 15 to 20 minutes or so. Thanks. .

dbt Learn on-demand

A free intro course to transforming data with dbt