Table of Contents
- • No silver bullets: Building the analytics flywheel
- • Identity Crisis: Navigating the Modern Data Organization
- • Scaling Knowledge > Scaling Bodies: Why dbt Labs is making the bet on a data literate organization
- • Down with 'data science'
- • Refactor your hiring process: a framework
- • Beyond the Box: Stop relying on your Black co-worker to help you build a diverse team
- • To All The Data Managers We've Loved Before
- • From Diverse "Humans of Data" to Data Dream "Teams"
- • From 100 spreadsheets to 100 data analysts: the story of dbt at Slido
- • New Data Role on the Block: Revenue Analytics
- • Data Paradox of the Growth-Stage Startup
- • Share. Empower. Repeat. Come learn about how to become a Meetup Organizer!
- • Keynote: How big is this wave?
- • Analytics Engineering Everywhere: Why in the Next Five Years Every Organization Will Adopt Analytics Engineering
- • The Future of Analytics is Polyglot
- • The modern data experience
- • Don't hire a data engineer...yet
- • Keynote: The Metrics System
- • This is just the beginning
- • The Future of Data Analytics
- • Coalesce After Party with Catalog & Cocktails
- • The Operational Data Warehouse: Reverse ETL, CDPs, and the future of data activation
- • Built It Once & Build It Right: Prototyping for Data Teams
- • Inclusive Design and dbt
- • Analytics Engineering for storytellers
- • When to ask for help: Modern advice for working with consultants in data and analytics
- • Smaller Black Boxes: Towards Modular Data Products
- • Optimizing query run time with materialization schedules
- • How dbt Enables Systems Engineering in Analytics
- • Operationalizing Column-Name Contracts with dbtplyr
- • Building On Top of dbt: Managing External Dependencies
- • Data as Engineering
- • Automating Ambiguity: Managing dynamic source data using dbt macros
- • Building a metadata ecosystem with dbt
- • Modeling event data at scale
- • Introducing the activity schema: data modeling with a single table
- • dbt in a data mesh world
- • Sharing the knowledge - joining dbt and "the Business" using Tāngata
- • Eat the data you have: Tracking core events in a cookieless world
- • Getting Meta About Metadata: Building Trustworthy Data Products Backed by dbt
- • Batch to Streaming in One Easy Step
- • dbt 101: Stories from real-life data practitioners + a live look at dbt
- • The Modern Data Stack: How Fivetran Operationalizes Data Transformations
- • Implementing and scaling dbt Core without engineers
- • dbt Core v1.0 Reveal ✨
- • Data Analytics in a Snowflake world
- • Firebolt Deep Dive - Next generation performance with dbt
- • The Endpoints are the Beginning: Using the dbt Cloud API to build a culture of data awareness
- • dbt, Notebooks and the modern data experience
- • You don’t need another database: A conversation with Reynold Xin (Databricks) and Drew Banin (dbt Labs)
- • Git for the rest of us
- • How to build a mature dbt project from scratch
- • Tailoring dbt's incremental_strategy to Artsy's data needs
- • Observability within dbt
- • The Call is Coming from Inside the Warehouse: Surviving Schema Changes with Automation
- • So You Think You Can DAG: Supporting data scientists with dbt packages
- • How to Prepare Data for a Product Analytics Platform
- • dbt for Financial Services: How to boost returns on your SQL pipelines using dbt, Databricks, and Delta Lake
- • Stay Calm and Query on: Root Cause Analysis for Your Data Pipelines
- • Upskilling from an Insights Analyst to an Analytics Engineer
- • Building an Open Source Data Stack
- • Trials and Tribulations of Incremental Models
Implementing and scaling dbt Core without engineers
Originally presented on 2021-12-08
A story of how 2 motivated actuaries implemented a solution to fix our broken actuarial data pipelines using the dbt CLI, without engineering or developer resources, and (so far) have had 20 people contribute to the project.
Browse this talk’s Slack archives #
The day-of-talk conversation is archived here in dbt Community Slack.
Not a member of the dbt Community yet? You can join here to view the Coalesce chat archives.
Full transcript #
[00:00:00] Amada Echeverria: Welcome everyone and thank you for joining us at Coalesce 2021. My name is Amada Echeverria and I’m on the community team at dbt Labs. I’ll be your host for today’s session, Implementing and scaling dbt Core without engineers, presented by Elliot. Elliot is an actuary on the actuarial modernization team at Symetra focused on improving actuarial data.
Prior to dbt, Elliot used Excel so much early in his career that he once had the Excel grid burned into his computer monitor. Can you believe it? He has since performed his excessive Excel habits. If you’re curious about what brings him here today, it is in part due to his discovery of Coalesce in 2020, which led to his organization adopting dbt.
Fun fact, Elliot and his family live on a half acre farm with 17 ducks, five pigs, two geese, one frog, and exactly zero dogs and cats. [00:01:00] When I asked him which animal was his favorite, he said, it’s Willow, a neighbor’s cow. Perhaps his most important contribution to the dbt community, elliot has been credited with discovering how to put gits into documentation, which he did upon realizing that dbt docs use markdown, and you can in fact embed gifs in Markdown. Over the next 30 minutes, Elliot will tell the story of how two motivated actuaries implemented a solution to fix our broken actuarial—their broken actuarial data pipelines using the dbt CLI without engineering or developer resources, and so far have had 30 people contribute to the project.
Before we jump into things, some recommendations for making the best out of the session: all chat conversation is taking place in the Coalesce-dbt-core-without-engineering engineer’s channel of dbt Slack. If you’re not yet a part of the dbt Slack community, you have time to [00:02:00] join now.
Seriously, go do it. you can and should do that, you can make comments throughout Elliot’s presentation, share memes, throw emojis on other people’s responses. Have a good time with it. To kick us off, we want to hear about you. Our chat champion, Mike Feeler, a solutions architect at dbt Labs has started a thread to have you introduce yourself and let us know where you’re calling in from. When this presentation wraps up, Elliot will be available in the Slack channel to answer questions.
Let’s get started. Over to Elliot.
So we cannot hear you. I believe you’re at your mic, my feed, you did…
[00:02:44] Elliot Wargo: Thank you. I was just criticizing your pronunciation of gifs. Hi, I’m Elliot. I’m an actuary and I’m the lead of analytics engineering at Symmetra on the data analytics. [00:03:00]
And today, as Amada said, I want to tell you a story.
I’ll tell you a story about how we started the project just over a year ago to fix our broken actuarial data pipelines, and I’ll go over four things. A problem you’re trying to solve, strategy that we use to come up with a solution, the solution we actually implemented, and outcomes we’ve seen, one of which was actually that this initially small project has greatly expanded in scope and become central to how we manage data, not only in the actual department, but also in the data groups.
And also I think interesting enough, it was the reason for creating this analytics engineering team a couple months ago in the first place.
[00:03:45] What are actuaries? #
[00:03:45] Elliot Wargo: But first, what is an actuary and why are we doing analytics? So I’m an actuary by training. I’ve been at Symmetra my entire career. The first several years, I was in different actuarial departments doing traditional actuarial [00:04:00] work, and the last couple of years I’ve been on the data analytics team, really focused on analytics engineering type of work. To give you an idea of the scale here, Symetra is about 2000 employees. We have about 100 actuaries and there are about 40 people on the data analytics team, so quite a few people doing data type work.
My favorite way of explaining what an actuary does is this: actuaries make sure the insurance companies can keep their promises. In other words, the insurance companies are financially able to pay out claims as they come in. So for example, a couple of years ago, I bought life insurance policy and it’s a 30 year term life insurance policy, which just means if I die anytime in the next 30 years, that insurance company will pay out my family a quite significant amount of money to support them. Now, what is my hope in that besides [00:05:00] not dying in the next 30 years? My hope in that is that if I do die, that insurance company is still around and still has the money available to pay out that claim, and actuaries do a lot of modeling and a lot of analytics to make sure that can happen.
And what do you need for all of this analytics and modeling? You need data. Now, someone in chat mentioned there’s a lot of preparation to become an actuary. There’s a lot of training and studying, and having gone through that training myself, I know that most of that content is on the analytics and the modeling—the background, the methods on the techniques.
There is not very much training on how to get the data required for all of that, and I like to think that there is some magical world out there where these data fairies put together these clean, tested, well-documented data sets and [00:06:00] serve it up to you on a silver platter. We don’t live in that world at Symmetra.
I’m guessing you don’t either, and so what do we have to do? We get messy data and we have to clean it up, and how do actuaries take messy data and clean it up? Ah, that brings us to our problem. The problem is that actuaries are very creative, and not having the training, they come up with all sorts of different ways to clean up that data and get it ready for their modeling and analysis.
[00:06:36] The problem #
[00:06:36] Elliot Wargo: And my basic takeaway from spending a lot of time looking at this is that the way we did that, to put it, bluntly was just broken. The way we worked with data did not work well and needed to be fixed, so let me show you three examples, and I think you’ll understand why.
First [00:07:00] example, this is a screenshot, an Excel file that has 68 tabs and each tab has a SQL query. Those SQL queries are used for generating a data set that’s used for doing an analysis on our annuity lapses and annuity is a type of insurance product. A lapse is just someone voluntarily ending the policy. To run this data prep, you go to the first tab and you copy it and you paste it in SQL server management studio and run the query, and then you go to the next tab and you copy it. You paste it in SQL server management studio, and you run the query and yeah, you go back and do this little dance, like copy-paste-run, copy-paste-run. I asked the person that runs this "well, what happens if the underlying data changes?"
And he said, oh, you just have to run it all over again. And actually my favorite part about this whole thing is [00:08:00] that each of these SQL queries is parameterized by an input tab, which further complicates the model. Second example. These are a set of access databases that generate the data needed to do something very important in the insurance industry: calculate the reserves for our life insurance policy. The reserves are the amount of money that an insurance company keeps on hand to pay out those future claims I mentioned. Now, when I look at this example, my favorite thing to ask is "what access database is it? Is it the copy too? Is it the v5? Is it the v7? And why are the instructions last updated six years ago?" And interestingly enough, this is one of the processes that we’ve reworked into dbt, and when we dug into this, what we found was dozens of queries and macros and just [00:09:00] tons of business logic embedded in those access databases.
And honestly, nobody really knew where it came from, who created it, or exactly what it was supposed to be doing.
And then third example, this is a diagram one of my colleagues put together after looking through a process that one of our employees runs to get data together, to do analysis on our assets. This is actually just one of six pages, and this is this employee’s full-time job to run this process. It’s a monthly process. It takes her weeks and after she completes it at the end, she has to do it all over again the next month.
There’s something interesting about these three examples. When we’ve gone through these and talked with some of these people that use these internally, we’ve heard the argument that these processes are not [00:10:00] broken, and the argument there is that these processes do create the data that you need for these analysis and models.
They do get the data together. However, in my mind, they are still broken, and here’s my analogy why: so a couple of weeks ago, I went out to this closet at my house that has our water heater in it, and I opened it up and there was a drip coming down. I didn’t even know about the drip because our water heater was still working.
Like, I can go take a shower and I can go wash my dishes and I get hot water. Everything seems fine, but when I go into that closet, for a quick fix, I stuck a bucket under it. And every couple of days I have to go out and empty the bucket out, so technically my water heater still is working, but also it is still broken.
And honestly, I don’t know what I’m going to do to fix it. So in a similar way, [00:11:00] in actually in the same way, I think our data pipelines, they technically work, but also they are broken and need to be fixed and to get a little more specific, here are four things that we identify as the primary problems with these. First, these data processes are not transparent.
So using the access database as an example, there’s so much business logic in there that’s heavily embedded when we redid this and put it in dbt, it actually took two people, 300 hours to rework and document and test all of that logic. Number two, it’s not scalable when we add new products, or we add new features, or [when we have] to make updates, is very difficult to do. Also, I think it’s clear that this is high risk. Having undocumented logic on tested logic can result in, and I’ve seen it result in financial impacts to the company or just bad errors, and then, maybe most importantly, this is just very unsatisfying [00:12:00] work. As an example, we have new employees coming in that have PhDs, and statistics, and then we’re sitting them down at a computer and saying "hey, can you copy and paste this stuff over and over again?"
That is not satisfying.
And to give you an idea of the scope here, we estimate that we probably have 10 to 20 actuaries doing this type of manual data work full time. That’s out of a hundred, and that’s a significant amount of time that we’re spending [in] this. That doesn’t make them better actuaries.
[00:12:38] The strategy #
[00:12:38] Elliot Wargo: How do you go about tackling this problem? Where do you start? There’s actually a multi-year effort at Symmetra to rethink how actuaries do their work, and data is a big piece of that. And there are different ways you can tackle this problem. Actually, we had consultants come in a couple of years ago and they looked at this problem and they came up with a solution and their [00:13:00] proposed solution is what I like to call the "big bank solution."
They propose, bring in external consultants, dedicate internal actuarial and IT resources, bring in a new set of technology, and just rework all of your data pipelines at once. That is one way to go about it. However, those talks fizzled out because that is not only a big monetary cost, but [it] also takes a lot of time from individuals.
So what do you do? How do you go about tackling that sort of thing? So a year ago, a colleague and I were talking about this. We both had a couple hours a week that we could spend on trying to tackle one piece of the solution, and we were trying to decide what should we do. And so what we decided to do is just simply write down what we had available, what our constraints were, and then we came up with a strategy. And the reason we came up with a strategy before a solution is that we [00:14:00] did want this to be something that we could build on long term. We didn’t just want to do a couple little fixes here and there. We really wanted to implement something that really addressed some of the core issues of these data pipelines.
And here’s what we wrote down: what we had available. So we actually had some really good things available to us. A lot of our actuaries were already familiar with SQL. They were querying databases. They were updating their SQL. Several actuaries had really driven this and even done some training sessions.
We also had a lot of data available that we used on a SQL server that was loaded there by data management, and we did have support from actual management to think long term, not just to do some quick fixes, but really do something that could support us for many years. On the other hand, we did have some constraints, although we have a hundred people in our [00:15:00] actuarial group, and some of them are very technically skilled, we don’t have any IT or engineer resources in our group, and we could get a little bit help from other IT teams, but not much. Also, this was a small team. This started as a side project with myself and my colleague, and we didn’t really have any budget for new tools. We weren’t going to bring in some fancy ETL tool that we were going to use.
And so once we wrote that down, here’s the strategy that we came up with, and it is start small, focus, and then learn iterate. So the idea here, at least our thought here was if we can just choose one little thing to solve and do that one little thing really well, then that should be something that we can build on and add more things or pivot off of if we need.
And a pattern we had been noticing is that while actuaries were using SQL and writing SQL, often times [00:16:00] those SQL scripts were living in files on their computer, or in Excel files, or in access databases, and so the one small idea we came up with was "hey, if we could deploy SQL scripts as views to a production database, then this suddenly opens up a whole bunch of possibilities."
We could version control those views. We could document those views. We could share those views between multiple people. And if you’re familiar with dbt, yes, this is starting to sound an awfully like dbt. And so that is what we went ahead and implemented. I’d be very curious in Slack if anybody has other strategy that they’ve used to tackle similar problems, let me know.
[00:16:51] The solution #
[00:16:51] Elliot Wargo: Okay. So let me, oh, one note, we did look at several different solutions besides dbt Core. Specifically, [00:17:00] we looked at a solution from our data management team, how they deployed views. We looked at DacPacs from Microsoft, which is a way to version control database objects, and we looked at dbt Core. DacPacs were essentially out from the beginning, cause it was more of an application database backup.
Our data management approach was really built for data engineers, and it wasn’t something that we could feasibly teach to our analysts and actuaries, so we went with dbt Core. So let me talk through the solution I’ve found it helpful as we’ve gone through this to see how other people have implemented this.
There’s a lot of this on in Slack, but also in this course, and I want to talk about it in two pieces. The technology piece and the people piece. Because yes, there’s the technology piece of actually implementing this open source project that we’ve done, but then there’s training people how to use it, and I think actually the people piece, training people how to use it is the harder part. [00:18:00] Before we actually implemented it, thinking that this was going to be a long-term solution, we actually asked around to make sure that we could use this. So dbt Core is an open source project. We want to make sure we could use that at our company. Insurance is a highly regulated industry. All of our data is pretty sensitive. So we asked around, we found out three teams that we need to ask. Data management, because we were using their databases to deploy to. Architecture, who was interested in thinking through "is this open source package, sustainable long term?" And information security, who wanted to make sure we are meeting all of our relevant company policy.
[00:18:44] The tech stack #
[00:18:44] Elliot Wargo: Then we went ahead and implemented this and here are some details on our tech stack. Something I would note here is that this was all existing technology that we had available internally [00:19:00] besides dbt Core, and our intent here was that we didn’t have budget, we didn’t have resources to really implement anything new, so we wanted to build off technology we already had available. So specifically we have three Microsoft SQL servers, we use git to version control, and that’s hosted on a Microsoft service called Azure DevOps, as your DevOps also has this concept of pipelines, which is similar to github actions where we actually run our production dbt deployments.
For someone working on this on their computer, they’re typically using SQL server management studio to write scripts and then copying those individual studio code to interact with dbt, and then we finally did a couple months ago get our docs site hosted on S3, and that’s been a great way for people to look at our project.
Now on the people side, this is the [00:20:00] typical workflow that an analyst using our project is doing. So of course, first, they set up, they’re installing Python, they’re installing git, they’re installing VS code, they gain access to all the required groups, and then it [they have it] go through this software development workflow.
And I think what’s interesting here is that this is a very different workflow from the workflow that these analysts and actuaries are used to. So the typical analyst or actuary at our company is used to the Microsoft office tool,some actuarial modeling software, and SQL. And so this workflow is very different.
I won’t go through all of this in detail, but I will note here that this actually isn’t specific to dbt, so creating a development branch, modifying code locally, and running it in the command line, documenting, committing those changes, creating [00:21:00] a pull request, and then automatic deployments, all of that is not actually dbt specific.
And so this is actually something that we’ve used in our other-
So we have a parallel sort of Python project that we’re trying to move actuarial modeling into that project. A couple other notes here, some of the most valuable things we found here is number three, the documentation, by incorporating documentation into the same place that people are writing code and having people write their documentation at the same time they’re writing their code.
Honestly, I think that’s the only way that documentation is going to get there, and then on number six, just a couple of notes of things that we check for. We are pretty specific and have actually honestly, pretty high requirements for what we checked for, so we’re checking for descriptions on all of our models, tests, and all of the key columns owners define for all models.
And we’re [00:22:00] checking for, we have a style guide that we have implemented as well.
All right. So now onto the outcomes, and again, I want to talk about this in terms of the technology learnings and the people learnings, and one particular focus. Yes this effort, this project, has led to great outcomes for us, but also here, I want to focus moreso even on the things that we’ve learned and the new challenges that we’ve come across while doing this, because I hope by sharing those, this can be helpful to other people that are in a similar spot.
Okay. What have we learned about technology? The good and the challenges. On the good side, we think DBT Core was the right one. It’s done. It did what we initially wanted to do: deploy SQL scripts as views, but then it’s done a whole lot more and we’re using a [00:23:00] whole lot more of the functionality of it.
Of course, it’s materialized some of our data pipelines. I mentioned documentation. This has been big. So previously, a lot of times when these processes were created, there just was not a place to do documentation. Now there is, and this is big, in a way that we hope this is sustainable long term. Also, having a standard framework for managing our data.
Recently, our data management team purchased Snowflake. We were on SQL server. We are using Snowflake for some new use case. We already had a dbt project. It was really easy to just copy that, set it up to work with Snowflake and from the analyst’s perspective they’re used to using dbt SQL server. Using it with Snowflake is essentially the same thing.
Now, on the other side, we have had some new challenges. We have these Azure DevOps pipelines that run our production [00:24:00] deployments. Those are definitely not a setup and forget it type of thing. Those break, probably a couple of times a week and we have to go fix it. Now I will note a lot of those break-ins are dbt tests failing.
And honestly, I think those failing tests are just things we weren’t checking before. Those, are things that would have just flown under the radar. We wouldn’t have seen them. We’ve also run into some runtime issues. So at some point our pipeline was taken up to two hours long. We did implement the slim CI, which lets you only deploy new or changed objects, and that has greatly helped that.
Again, we initially started with everything as views, obviously, as you add layers of views one top of views, eventually you reach problems there and you have to materialize something to tables. We have not figured out a great strategy for our table [00:25:00] or incremental load materializations.
And then needing more data. So once people have a great way of working with data, of course they want more and we’re still working with our data management team on what does it look like to get data quickly and efficiently into our databases. Now, on the people’s side, we’ve done a lot of training on how this new way of working goes, and we found some things that have gone well and again, some things that have not. What’s gone well, are one-on-one or one-on-two training sessions. We’ve tried some large group training sessions and those just don’t work. People sit there, they hear us talk about this thing, and then they just go and forget it. If you don’t use it, you’re not going to remember. Also, using actual use cases, data that people use in what they work has been much better than using example use cases or fake data. And then one of the most interesting things that I found [00:26:00] is that some of the people that you would assume would be the ones that should be the best at using dbt or be the best at doing analytics engineering, those assumptions are probably wrong. So there’s been people I’m like, "oh, this is perfect for [them], they are going to be a big contributor to this project." And they weren’t.
On the other hand, we have one person in our finance department who we did not intend this project to be used for, but he’s actually been the largest contributor outside of the core group. In using this project, he’s deployed dozens of views.
And we’ve done a whole bunch of other efforts. We’ve done branding. We named our project ‘Swiftmart.’ Our acompany logo is a swift recruiting data marks. We thought we were very clever. We’ve partnered with other teams. We contribute some hours. We train them up on how to use it.
I hold a weekly [00:27:00] office hours. I send out a newsletter every couple of weeks. I love documentation, so we’ve written an extensive Wiki on how to use this project. I’d be very curious, actually, if anybody else has ways they’ve taught people to use dbt Core. Other ideas, feel free to post those in Slack. And we’ve had over 30 contributors.
And this includes not only actuaries, but all sorts of other people. My favorite is that we had a project manager actually contribute a pull request.
[00:27:41] What we haven’t figured out yet #
[00:27:41] Elliot Wargo: And then finally, I want to leave you with this. So we started out with this big project, or we started with a small project. It turned into a big project. This is actually now my full time job, essentially to maintain this and keep scaling it out, and as we built it out, we keep coming across problems. [00:28:00] Pretty much, we’ve been able to figure all those out. Right now, here are some of the things that we’re trying to figure out. If you have any brilliant ideas, I want to share them with me. Please feel free to reach out. But again, on the technology and the people side. So on the technology side, I mentioned this earlier, but how do we consistently and quickly get the data we need loaded to databases?
Our data management team has been built around loading data from applications. For analytics, often you need data from all different sources, and much more quickly than we are used to. In terms of the pipelines, we define tests and those tests fail, which is great. That means we’re checking our data, but how do we effectively prioritize those?
How do we effectively address them? There are many different places that there could be a problem that caused that test to fail. And then what is the right level of documentation? As I mentioned earlier, we have pretty high standards. [00:29:00] Some cases we’ve had to discuss, should we lower those standards to move more quickly?
And then on the people side, again. This new workflow is very different from what people are used to, and we continue to have challenges training people how to use it, convincing people to use it. So we think through a lot, how do we make this more of a core part of people’s workflows rather than just something they contribute to once, try out, and then don’t use again. Related to this, who should be using dbt? We have a couple analysts that say, "hey, my time is better spent doing analysis rather than learning this new workflow." That could be a valid point. How do we think through that? And then how do we get people to move from beginner to expert? So we have a lot of beginners and just a few experts.
How do we get more people to move up there and become bigger [00:30:00] contributors? So that’s our story. I hope that you can learn something from it, and I will mention that we are hiring. I think that’s obligatory to say here. If this sounds interesting, feel free to reach out to me. You can do that in Slack and I’d be happy to talk.
I’d also be happy to hear about your experience. If you’ve implemented dbt Core, how are you using it, how you are training people, what challenges you’ve come across,.Just reach out. Thanks.