Table of Contents
- • No silver bullets: Building the analytics flywheel
- • Identity Crisis: Navigating the Modern Data Organization
- • Scaling Knowledge > Scaling Bodies: Why dbt Labs is making the bet on a data literate organization
- • Down with 'data science'
- • Refactor your hiring process: a framework
- • Beyond the Box: Stop relying on your Black co-worker to help you build a diverse team
- • To All The Data Managers We've Loved Before
- • From Diverse "Humans of Data" to Data Dream "Teams"
- • From 100 spreadsheets to 100 data analysts: the story of dbt at Slido
- • New Data Role on the Block: Revenue Analytics
- • Data Paradox of the Growth-Stage Startup
- • Share. Empower. Repeat. Come learn about how to become a Meetup Organizer!
- • Keynote: How big is this wave?
- • Analytics Engineering Everywhere: Why in the Next Five Years Every Organization Will Adopt Analytics Engineering
- • The Future of Analytics is Polyglot
- • The modern data experience
- • Don't hire a data engineer...yet
- • Keynote: The Metrics System
- • This is just the beginning
- • The Future of Data Analytics
- • Coalesce After Party with Catalog & Cocktails
- • The Operational Data Warehouse: Reverse ETL, CDPs, and the future of data activation
- • Built It Once & Build It Right: Prototyping for Data Teams
- • Inclusive Design and dbt
- • Analytics Engineering for storytellers
- • When to ask for help: Modern advice for working with consultants in data and analytics
- • Smaller Black Boxes: Towards Modular Data Products
- • Optimizing query run time with materialization schedules
- • How dbt Enables Systems Engineering in Analytics
- • Operationalizing Column-Name Contracts with dbtplyr
- • Building On Top of dbt: Managing External Dependencies
- • Data as Engineering
- • Automating Ambiguity: Managing dynamic source data using dbt macros
- • Building a metadata ecosystem with dbt
- • Modeling event data at scale
- • Introducing the activity schema: data modeling with a single table
- • dbt in a data mesh world
- • Sharing the knowledge - joining dbt and "the Business" using Tāngata
- • Eat the data you have: Tracking core events in a cookieless world
- • Getting Meta About Metadata: Building Trustworthy Data Products Backed by dbt
- • Batch to Streaming in One Easy Step
- • dbt 101: Stories from real-life data practitioners + a live look at dbt
- • The Modern Data Stack: How Fivetran Operationalizes Data Transformations
- • Implementing and scaling dbt Core without engineers
- • dbt Core v1.0 Reveal ✨
- • Data Analytics in a Snowflake world
- • Firebolt Deep Dive - Next generation performance with dbt
- • The Endpoints are the Beginning: Using the dbt Cloud API to build a culture of data awareness
- • dbt, Notebooks and the modern data experience
- • You don’t need another database: A conversation with Reynold Xin (Databricks) and Drew Banin (dbt Labs)
- • Git for the rest of us
- • How to build a mature dbt project from scratch
- • Tailoring dbt's incremental_strategy to Artsy's data needs
- • Observability within dbt
- • The Call is Coming from Inside the Warehouse: Surviving Schema Changes with Automation
- • So You Think You Can DAG: Supporting data scientists with dbt packages
- • How to Prepare Data for a Product Analytics Platform
- • dbt for Financial Services: How to boost returns on your SQL pipelines using dbt, Databricks, and Delta Lake
- • Stay Calm and Query on: Root Cause Analysis for Your Data Pipelines
- • Upskilling from an Insights Analyst to an Analytics Engineer
- • Building an Open Source Data Stack
- • Trials and Tribulations of Incremental Models
Upskilling from an Insights Analyst to an Analytics Engineer
Have your insights analysts ever wished that they could just “code the table themselves”? Do you ever think that there’s got to be a better way than sending requirements back and forth between the data engineering and insights analyst teams?
Learn how Degreed is upskilling their insights analysts with analytics engineering skills! We’ll talk through example topics to start your team on their upskilling journey, as well as hear about the many benefits of having an upskilled team, such as moving from an in-series process to a more in-parallel approach to model building.
Follow along in the slides here.
Browse this talk’s Slack archives #
The day-of-talk conversation is archived here in dbt Community Slack.
Not a member of the dbt Community yet? You can join here to view the Coalesce chat archives.
Full transcript #
[00:00:00] Amada Echeverría: Welcome everyone. And thank you for joining us at Coalesce 2021. My name is Amada Echeverría and I’m on the Community team at dbt Labs. I’ll be your host for today’s session, Upskilling from an Insights Analyst to an Analytics Engineer presented by Brittany Krauth. Brittany is the manager of Analytics & Insights supporting Degreed’s upscaling platform.
Brittany’s passionate about building a company-wide data-driven culture. She has worked in various analytical roles from a focus on predictive analytics to data visualization, to process improvements. In addition, she holds a bachelor of science in industrial engineering from Georgia Tech in her spare time.
Brittany competes in dog agility and runs a donkey rescue. Have your insights analysts ever wished that they could just code the table themselves? Do you ever think that there’s got to be a better way than sending requirements back and forth between the [00:01:00] data engineers and insight analyst teams? In this 30 minute session, you will learn how Degreed is addressing these topics by upskilling its insight analysts with analytics engineering skills.
Before we jump into things, some recommendations for making the best out of this session. All chat conversation is taking place in the coalesce-analyst-to-ae channel of dbt Slack. If you’re not yet a part of the dbt Slack Community, you have time to join now. Seriously, go do it. Visit getdbt.com/community and search for Coalesce Analysts to AE when you arrive. We encourage you to set up Slack and your browser side-by-side in Slack. I think you’ll have a great experience if you ask other attendees questions, make comments, share memes, or react in the channel at any point during Brittany’s session.
And if you’re new to the dbt Community, or this is your first [00:02:00] time at Coalesce, please don’t be too shy. It’s very important to hear from your voices. So please chime in if you feel comfortable doing so. To kick us off, our chat champion, Sanjana Sen, analytics engineer at dbt Labs, has started a thread to invite you to share photos of your pets.
Don’t have one? If you could have any pet in the world, what would it be? And I see someone already dropped in a pony called Little Sebastian. So this is going well. After the session, Brittany will be available in Slack to answer questions. Let’s get started. Over to you, Brittany.
[00:02:36] Brittany Krauth: Hi, I’m Brittany Krauth. Today we’re going to be talking about upskilling from insights analysts to analytics engineers.
So as that great introduction was, I am Brittany Krauth. I am a donkey trainer. I’m a dog agility handler, but most of all, I love data. I work at Degreed. I am a manager of analytics. So at Degreed, if you have a [00:03:00] question that you want to solve with data, there are two teams that you’re going to be working with.
One is the data engineering team. They’re the ones who have your data warehouse. They do orchestration, your access, and then you’ve got the analytics and insights team. So that’s my team. We’re the ones who are building your dashboards, doing key performance indicators and anything ad hoc. When you want to do an AB test or a regression analysis, you’d come to us, the analytics and insights team.
And so before dbt, there was this disconnect between the raw data coming in from the pipelines and the datasets that we needed to power the dashboard. And once we started working in dbt, we realized that this middle ground wasn’t quite so complicated. All we needed was a model to build that curated dataset and be that bridge between the raw data and the dashboard.
So our data engineering team [00:04:00] had just as much work to do bringing in that raw data. They still had a ton of pipelines that they were building up and building all those lovely raw datasets. But now we needed to also create curated datasets. And so the analytics and insights team said, Hey, can we learn to do that? Can we learn to code these models? We know what we want. We know what format the data needs to be in to make the dashboards work. So what upskilling can we do to build these models? Now, of course, it’s a little bit of a gap on the dashboarding side. So we lean heavily on the data champions around our organization.
So working with departments to help them increase their data literacy so they can start building their own data. But what we’re going to focus on today is really this middle portion. How do we upskill a team of analysts to be comfortable in dbt building models and becoming [00:05:00] analytics engineers.
[00:05:03] Creating an individual experience #
[00:05:03] Brittany Krauth: So I want to start off with something that I think is incredibly important, which is as you start this upskilling journey, it’s a very individual experience.
First, you’re going to have to work with each member of your team and understand what are their career goals? What are their dreams? What are they looking to accomplish? And if you’re on this journey yourself, you might ask yourself these same questions. Where do you see yourself headed? And you can combine that with, what skills do you need to get there?
What are the critical skills to be able to support your team as analytics engineers, support your company and support your personal career. So today we’re going to be going through a lot of skills. So this list of critical skills, I’m going to give you a great starting point today of specific skills that you can learn.
Once you have the list of skills, and you’ve talked about [00:06:00] career goals, there’s going to be gaps. You’re not going to be perfect on every single skill. And we got to figure out how to fill those skills. And while we could take you to of course at a college or get you a certificate, what really seems to work the best is something more in the flow of work, finding projects and shadowing and upskilling opportunities within your day to day role.
So today when I go through my list of skills, we’re going to do it in a little bit of a fun way. Because we’re going to talk about them in the flow of work, through an example project. And as we step through this project, you’ll see all the different skills that you’re going to need to finish this project from start to finish.
[00:06:48] Example project: New dataset for issue tracking #
[00:06:48] Brittany Krauth: And our example project is pretty much just a generic new dataset. You need a new dataset for your stakeholder to be able to answer their business question. How we’re going to walk [00:07:00] through this project is we’re going to start by talking about the skills associated to answering the request from the stakeholder, what do they need and when do they need it. Then we’re going to talk about all of the skills associated with the research you put in upfront. You’re going to have to do some research on what tools make sense to use in this situation. Can we build a dbt model from this, working with analysts to understand the business logic?
How does this all fit together? But then most of your time is probably spent on your recommendation, building a dataset that you think will work to solve their business issue. Now, and with this recommendation, this is something that is incredibly important for analytics engineers, because if a stakeholder comes to us and they want a new dataset with five fields, doesn’t always mean I’m going to give them back five.
I might look at it and say, Hey, I [00:08:00] recommend that you also need these other two fields. Hey, I recommend that you consider these filters. The analytics engineers are really serving as that recommendation and that teacher of knowledge to help them grow and not just providing them a dataset. Okay. And then finally are skills associated with the final review.
So we got to make sure our dataset is absolutely perfect and is what we want to be showing. And so there’s a lot of things to get it over that final finish line. Oh, one thing you might’ve noticed is I didn’t break it out into documentation model, building and validation. Oftentimes that’s the way you see us breaking out how we’re doing things because in the old way, we were very in series.
You document as much as you can upfront, and then you have a handoff to someone else and you lose momentum that way. With analytics engineering, you’re probably getting more and [00:09:00] more used to a parallel development where you’re able to jump between documentation, doing some validation, updating your model, very in the flow. And so that’s why, instead of breaking this out linearly, we’re going to break it out this way linearly because the request happens first, then the research recommendation and filing the review. So let’s dive in to some stuff.
So request arrives, number one, communication and specifically communicating between a business stakeholder who’s very familiar with probably the business terminology and trying to translate that into the database terms. What tables are you going to need? What fields actually represent what they’re speaking about. So acting as that bridge between the business and the more technical database jargon. And our next skill is storytelling.
So becoming that [00:10:00] storyteller and knowing your audience. So the kind of information that a sales rep needs is probably different than what the chief revenue officer needs. And so understanding how summarized, how deep you should go is really important. And then finally project management, because at the end of the day, they’re probably going to ask you what’s the date? When can I get this by? And you have to summarize all of the deliverables and the blockers and dependencies on other teams and somehow provide that date.
And so to go with this project management, I want to flow into something that’s a little bit more software engineering, which is agile methodology. So this is something that your analyst is probably not very familiar with, but they’re probably already doing it a little bit. And we could talk for days about just agile methodology.
So I’m just going to focus on their sprint work, their iteration, this idea of taking an overall project and splitting it out into smaller [00:11:00] sections. So we might say, Hey, I know you want three dashboards, but we’re going to do one dashboard at a time and put those into chunks. And it might be split up because there’s different deadlines.
Maybe the blue dashboard needs to be done in two weeks, but the yellow one can wait and be done next year. That’s a logical place to break out this project.
[00:11:21] Brittany Krauth: So we break out our projects through Jira. So every time we take our project and break it out into sections. We’ll put it into epic. So each epic will be, maybe it’s a section, maybe it’s a version. Maybe you have a version 1 and a version 2 of the project. And then that allows us to break out that project section into the individual tickets, the individual pieces of work, the issues or stories that we have to do to be able to accomplish that project. And so oftentimes you have a lot of priorities, both as an analyst and as an analytics engineer and with all these competing priorities, sometimes it gets [00:12:00] hard to say which one’s more important.
And so I always pushed to say, what are the quick wins? What’s easy. What data can we put out today, or as quickly as possible to drive change in the business. What actually will cause action to happen so we can make the business more money.
[00:12:21] Research begins #
[00:12:21] Brittany Krauth: So requests came in. We’ve got it figured out, but now I need to do some research. How are we actually going to do this? And this all starts with dbt. So dbt is the core of what you need to be able to get started. So instead of some really complicated store procedures, dbt lets your analytics engine really be able to dive in and feel comfortable very quickly. One of the things that makes dbt so awesome is the deployment environments. So when you’re starting to do your research, you can dive into dbt right away and start fiddling with models in your local environment. So we have [00:13:00] local environments for every single one of our ends.
And so we can feel comfortable having a safe sandbox that we’re not going to break a dashboard or break a finished dataset. We also have review environments. Our review environments are automated in Git, they just come with the pull request. They are fantastic. It allows us to do really detailed data validation on exactly what we’re going to push to production.
So when we do push something to production, we’re happy with it. And we know that it’s good to go.
So SQL is such a core skill. But there’s so much to SQL. So I want to call on just three pieces of SQL that I think is the most important areas to learn. First of all, is select statements. You could talk about create tables and update tables and temporary tables, but at the end of the day, what do you use the most? Select statements.
Select statements are what gonna drive your dbt models and select statements are what’s going to drive your data. [00:14:00] Validation is what’s going to drive your ad hoc. So become good at those select from where group by statements. Next, when you’re moving from an analyst, that’s just querying once or twice to moving to an analytics engineer, that’s creating something that’s sustainable.
You’re going to have to start thinking about making your SQL as efficient as possible. So should you use union or union all. How early in the flow can you apply that filter to take the million rows down to half a million rows? Having more efficiencies is going to be a big win, both monetarily because you’re not running the code as long, but also you can get data out quicker every morning.
And then the third part of SQL I want to emphasize is having a consistent style. So if I always use snake case and you use camel case, and I put the comments in the front and you put the comments in behind, the more differences we get, the [00:15:00] tougher it’s going to be for us to switch and edit each other’s models and still feel comfortable.
So we’re trying to build out a style guide for our SQL to say, these are our expectations internally with the thought that lines of code are cheap, but brainpower is so expensive. So I want to be able to take a look at your model and know instantly what it’s doing and how data is coming through. So comments, extra lines, extra CTEs, that’s what I really look for. So there’s a lot of tools that you’re going to need to be familiar with and learn in analytics engineering.
Where’s my data warehouse at? Ours is in Snowflake. Who are you using for orchestration. Where are you actually typing in your code? And then your BI tool because even as an analytics engineer, you might not be building as many dashboards, but you’re still gonna need to know what format that data needs to be in, to work for a dashboard.
[00:16:00] And I would say what’s important on this is less to be familiar with anything in specific. For example, you could say, oh, I need to know Domo, or I need to know Tableau. I need to know a specific BI tool. And I would say that’s not necessarily the case. What you need to know is broadly, what do BI tools like? So pick one, whichever your company uses, dive into that, understand it.
And it’s very easy for you to transfer that knowledge then to other creators of those tools. So I’ve talked through quite a few really technical analytics engineering skills. So I wanted to mention teaching strategies right now. And I think what’s important about teaching strategies is that you have a lot of options.
So maybe you have a setup guide to get them all set up, but then you have a separate first-time walkthrough. Maybe you have articles and you have videos. You have internal, you have external, maybe you do send for some formal courses, but then maybe you also do some pair programming. I know pair programming is extremely effective, [00:17:00] especially if the person driving is the one with less experience.
We’re also really focused on having internal documentation. Things like frequently asked questions. If I keep being asked a question I should be able to send you a link to where it’s written down with the answer. We try to reduce the number of times that we’re just sharing things that are in my brain, word of mouth, and instead moving more to documenting everything. And of course, great teaching strategies, attending conferences like today. So you’ve already got a leg up just by being here. Okay.
[00:17:34] Making a recommendation #
[00:17:34] Brittany Krauth: So now we’re going to pivot to building your recommendation and we’re going to drop right back to talking about dbt a little bit because when we’re creating our dataset, our model we have three files that we actually use.
So you might be familiar with the SQL file. That’s where you have your configuration. That’s where your actual query exists to build the model. But we also require two other files to come with it. One is your MD file with your [00:18:00] model description, and then the other is your YAML with column descriptions and tests.
We’ll talk a little bit about how your model description feeds into documentation in a bit. We’ll talk more about testings and a better as well. So now we’re going to pivot completely to something that’s a skill that’s much more ambiguous. So creative problem solving is one of those skills that you were like, yeah, I need to be a creative problem solver.
How do you teach someone to be a creative problem solver? And so we’re gonna walk through a little mini example of one of the ways that I like to teach this through something like a dbt macro. So say you have a raw dataset, it’s a timestamp. And if it doesn’t know what date it is, it’s going to throw a 1900 data in there. Now, so your BI tool can’t handle that. The timestamps going to look all wonky. You want it as a date field, and that 1900 day’s just going to throw the chart way off. So you want it to be no. And so you [00:19:00] have these two changes that you need to make. And so you, the analyst might come back with something that looks like this, a case when statement that takes care of both of those issues. And that’ll work for a single field, but maybe you have lots of fields that have this same formatting issue.
Maybe you have lots of fields in this model, but also in other models that it keeps coming up. And so every time you copy them, there is a chance that it might get a little off or something might not work correctly. And that’s when our peer review process really shines. Because you can look at peer review from a, does it work perspective? Is it okay? Technically, yes, that would work, but can we make it better? And it’s oh we could create a macro that would make it super consistent. It would make everything better. And so we’re constantly pushing our analysts through our peer review [00:20:00] process to think of things as okay, to good to great.
And sometimes the suggestions in peer review are things that don’t make sense. They are so far that there’s a version 2, version 3, version 20 and that’s okay. Because it gets us always pushing our standards higher and higher. And so through peer review, we might put a comment on this and say, oh, I think there’s a better way with the macro.
Go ahead and take this away. Do some research, do some creative problem solving and see if you can come up with macro that will work for this. And they’ll probably come back with something like this. So very basic macro, just taking that case when statement, but now every time we have a timestamp that we have to format to that kind of date, we can just call that macro and it makes the code really easy to read which is also one of the things that I love because I can very quickly understand exactly what’s happening to each of those times.[00:21:00]
[00:21:00] Review the dataset #
[00:21:00] Brittany Krauth: Okay. So we’ve got our research, we’ve got our recommendation. And now we’re starting to validate and review this final dataset. So one of the things that often comes up, when do we release things? When do we get to publish and say something’s done? And so we’re really big on not having specific release dates.
So we don’t say, data comes out on the first of the month. Instead we ship it when the data is ready. And so I’m really careful about using this term "ready" here, because what I think is perfect is just, understood. Perfect doesn’t mean there’s no holes or data’s always going to be messy.
It’s never going to be clean. And if we understand the data, if we understand that, Hey, there’s no data for April 1st. And that’s just the way it is. We can document it. We can label it and the company can still drive decisions on the May [00:22:00] data or the June. And that’s okay. And so in like software engineering, data engineering, they’ll talk about the CI/CD about how things are shipped.
And so being familiar generally with how this works and how it integrates into analytics engineering is good.
So data governance is probably one of my favorite topics because I think there’s a lot to growing people’s confidence and data quality. If you have a button in a product, in an app and it doesn’t work and then the next release you fix it. So it works. People are going to be like, yeah, awesome. I’ll click the button now.
If you have a bug in a report, so the data is just not right, people lose confidence very quickly and they’re likely to just never run that report again. So having high standards on everything that you’re putting out is really important. And we do that by having consistent sources, having things come from the same raw datasets, having consistent terminology and consistent business logic.[00:23:00]
So for example, macros is a great way to have consistent business logic because you’re not copy pasting it. The multiple models, you can have one source of truth for that business. We also build confidence through having different guides. So we have our data dictionary which lists every field, every model. Everything that’s driving a published date dashboard in our BI tool is in the data dictionary.
And if you’re using dbt Docs today, then having those MD files and those YAML files filled out will give your dbt Docs a great start to your data dictionary. We’re also big on style guides and mentioned the SQL style guide. We also have a BI style guide. Our data engineering team has style guides for Python.
We have lots of style guides that it really helps create that consistency and confidence within our own team.
[00:23:52] The importance of data validation #
[00:23:52] Brittany Krauth: So I put data validation here at the end because I think it’s such a core part of the routine. But I typically [00:24:00] start my process with data validation. So when I’m doing the beginning research on, do we even have the raw data to get started? I’ll start my data validation then. And so sometimes this is just querying and your data warehouse.
Sometimes this involves Excel. So being familiar with Excel is super helpful. Sometimes you gotta dive into Python. And so while knowing Python is probably not necessary for every company’s analytics engineers, your company might be one of those that it is. So that’s probably a good example of a skill that may or may not be critical to your team.
And when we’re validating data, I think what’s most important is looking at, does the dataset meet expectations? So if you have a left join instead of an inner join or vice versa, you’re going to get something that doesn’t match, probably what you documented. And so clearly understanding, this is what I wrote down in the data dictionary, this is what I wrote down in our initial requests with the stakeholder, is our [00:25:00] stakeholder going to like this. And part of this is going to be documenting how you cleaned the data. As I said, if April 1st has no data, what are you going to do? Are you going to just not include that date?
Are you going to include some kind of average on that date so the month still looks whole? I think there’s lots of options and there’s no cut and dry, yes or no answers for cleaning data. So practicing it, understanding what your stakeholder wants, and then documenting that. For example, if you have some rows that are such bad outliers, do you filter them off and not include them?
Which could be good, but also could hide something and could potentially be hiding an issue. Or maybe just mask them, something like a not applicable or not available. So understanding what makes the most sense to do is something that just doing it over and over is going to help you grow this.
One thing that an analyst is probably not as familiar with is taking your testing and getting it [00:26:00] automated. So moving from being very reactive to data quality issues to proactive, I would say most of the time I’m notified that a dashboard has an issue by a user saying, Hey, I tried to pull this and it didn’t work.
And so in our BI tool we’re able to add some alerts on dashboards and they’re like, oh, this is so much better. Now we can finally know at 9:00 AM and the dashboard refreshed is something. And then once we started moving in analytics engineering, we said, wait a minute, dbt tests are even better.
Now we’re upstream. Now I know at 1:00 AM if the data doesn’t load. And so we can really get a jump on updating and identifying the issues long before the users ever noticed.
[00:26:45] Question the data #
[00:26:45] Brittany Krauth: So the last skill I wanted to cover today is around questioning the data. It’s another one that’s a little bit ambiguous. Some people call this just data curiosity because finding someone who’s really curious by [00:27:00] nature feels ambiguous. And I think this is something that can be taught. So whether it’s a new dataset or you’re investigating a bug, pushing your analyst to spend time with the data. So they might look at it and say, Hey, this dashboard is missing some rows, okay. The dashboard is missing some rows. What are we going to do about it? What does that affect? It doesn’t give us enough information to get started. So pushing the analyst to say, just go do some queries, go do some research, ask questions, find the patterns, where are the irregular.
And so tracing back, oh, I realize it’s not the whole dashboards missing rows. It’s a specific table. And I figured in that specific table, it seems to be missing data for just April 1st. And I wonder if anything happened in April 1st, let me hop in my orchestration tool and see if there were any errors that day and there are.
And so now suddenly my analyst is taking this two, three steps deeper, [00:28:00] and we can understand how to tackle this issue. And it really grows their skill of creating that curiosity just to brew, also solving the bug.
So we’ve talked about so many skills today from the initial request with the stakeholder to starting your research, working in dbt using those local environment that you’ve got set up, you built your recommendation, peer review is such a big part of our team. And at the end, it’s really important to do your reviews.
So adding your automated testing, adding data governance is all super important. But I want to flip back to what we talked about in the beginning about creating this individual experience, where the upskilling focuses on each person. And so taking the skills that we talked about today and understanding what does each of my individuals need for their career growth and for our team, find those skill [00:29:00] gaps.
And then hopefully the next time that you have some projects, you can identify what makes sense, to help, not just for the company and the project, but also for their upskilling journey. Awesome. Thank you. My name is Brittany Krauth and I’m happy to answer any questions in Slack.
Last modified on: Oct 13, 2022