Table of Contents
- • No silver bullets: Building the analytics flywheel
- • Identity Crisis: Navigating the Modern Data Organization
- • Scaling Knowledge > Scaling Bodies: Why dbt Labs is making the bet on a data literate organization
- • Down with 'data science'
- • Refactor your hiring process: a framework
- • Beyond the Box: Stop relying on your Black co-worker to help you build a diverse team
- • To All The Data Managers We've Loved Before
- • From Diverse "Humans of Data" to Data Dream "Teams"
- • From 100 spreadsheets to 100 data analysts: the story of dbt at Slido
- • New Data Role on the Block: Revenue Analytics
- • Data Paradox of the Growth-Stage Startup
- • Share. Empower. Repeat. Come learn about how to become a Meetup Organizer!
- • Keynote: How big is this wave?
- • Analytics Engineering Everywhere: Why in the Next Five Years Every Organization Will Adopt Analytics Engineering
- • The Future of Analytics is Polyglot
- • The modern data experience
- • Don't hire a data engineer...yet
- • Keynote: The Metrics System
- • This is just the beginning
- • The Future of Data Analytics
- • Coalesce After Party with Catalog & Cocktails
- • The Operational Data Warehouse: Reverse ETL, CDPs, and the future of data activation
- • Built It Once & Build It Right: Prototyping for Data Teams
- • Inclusive Design and dbt
- • Analytics Engineering for storytellers
- • When to ask for help: Modern advice for working with consultants in data and analytics
- • Smaller Black Boxes: Towards Modular Data Products
- • Optimizing query run time with materialization schedules
- • How dbt Enables Systems Engineering in Analytics
- • Operationalizing Column-Name Contracts with dbtplyr
- • Building On Top of dbt: Managing External Dependencies
- • Data as Engineering
- • Automating Ambiguity: Managing dynamic source data using dbt macros
- • Building a metadata ecosystem with dbt
- • Modeling event data at scale
- • Introducing the activity schema: data modeling with a single table
- • dbt in a data mesh world
- • Sharing the knowledge - joining dbt and "the Business" using Tāngata
- • Eat the data you have: Tracking core events in a cookieless world
- • Getting Meta About Metadata: Building Trustworthy Data Products Backed by dbt
- • Batch to Streaming in One Easy Step
- • dbt 101: Stories from real-life data practitioners + a live look at dbt
- • The Modern Data Stack: How Fivetran Operationalizes Data Transformations
- • Implementing and scaling dbt Core without engineers
- • dbt Core v1.0 Reveal ✨
- • Data Analytics in a Snowflake world
- • Firebolt Deep Dive - Next generation performance with dbt
- • The Endpoints are the Beginning: Using the dbt Cloud API to build a culture of data awareness
- • dbt, Notebooks and the modern data experience
- • You don’t need another database: A conversation with Reynold Xin (Databricks) and Drew Banin (dbt Labs)
- • Git for the rest of us
- • How to build a mature dbt project from scratch
- • Tailoring dbt's incremental_strategy to Artsy's data needs
- • Observability within dbt
- • The Call is Coming from Inside the Warehouse: Surviving Schema Changes with Automation
- • So You Think You Can DAG: Supporting data scientists with dbt packages
- • How to Prepare Data for a Product Analytics Platform
- • dbt for Financial Services: How to boost returns on your SQL pipelines using dbt, Databricks, and Delta Lake
- • Stay Calm and Query on: Root Cause Analysis for Your Data Pipelines
- • Upskilling from an Insights Analyst to an Analytics Engineer
- • Building an Open Source Data Stack
- • Trials and Tribulations of Incremental Models
From 100 spreadsheets to 100 data analysts: the story of dbt at Slido
Originally presented on 2021-12-09
Let’s do a quick thought experiment. Suppose you join a young SaaS startup that would like to improve its data game, with the vision of building a data-informed culture. Suppose further that the only existing pieces of “data infrastructure” are hundreds of scattered Excel sheets and that you are given a shoestring budget of 500 USD per month.
What would you do?
As it turns out, this was the situation the nascent Data Team at Slido found itself in about three years ago. Our answer at the time was to build a pretty standard ETL pipeline based on AWS Athena, orchestrated by Airflow, and visualized via Superset. As there was not much Athena-specific ETL tooling available at the time, we have built in-house tools for the management of materialized views, testing the validity of the resulting datasets, and even ensuring the “test coverage” of our materialized views doesn’t decrease over time.
This setup was way better than nothing, but not without its set of issues. There was no ability to do templating and so having a separate staging and production environment was not feasible. We have invested significant effort in the documentation but keeping it up-to-date in a separate tool (Coda. io) has proven very difficult. And while we knew what data was available, we had very little visibility into where it ended up being used (i. e. what visualizations and decisions did it power).
During all this time we have been keeping our eye on dbt as it seemed like an answer to most of our troubles. Using it with AWS Athena was not possible for quite some time but with the emergence of externally provided adapters, there was little reason not to switch. The introduction of the Athena adapter also coincided with the formulation of our internal Data vision, in which everyone will need to analyze data as part of their job, whether they realize it or not.
To reach that vision, however, we needed answers to at least a few questions:
- How do we make sure there is a single source of truth for all our data?
- How do we ensure the data documentation stays up-to-date?
- How do we keep track of where the datasets prepared by the Data Team are getting used?
In our talk we would like to walk you through our journey from our custom-built transformations to using dbt as the framework and platform for all our data needs. We would also like to present some of the dbt-enabled tooling along the way, which we believe may be interesting for the Analytics Engineering community at large, such as tools that report docs, as well as test coverage or that allow us to automatically link the usage of various dbt models to their respective Superset charts/dashboards via Exposures.
Browse this talk’s Slack archives #
The day-of-talk conversation is archived here in dbt Community Slack.
Not a member of the dbt Community yet? You can join here to view the Coalesce chat archives.
Full transcript #
[00:00:00] Sara Leon: Hello and thank you for joining us at Coalesce. My name is Sara and I’m a customer support engineer at dbt Labs. I’ll be hosting this session, which is entitled From 100 spreadsheets to 100 data analysts: the story of dbt at Slido. We’ll be joined by Daniela, Michal, and Andrej from the Slido team. The team will take us through their journey from a small group, with a dream of cultivating a data informed culture within their spreadsheet dependent startup to a 100 person data team deploying a full suite of tools, including dbt. Before heading handing it over to the team, I wanted to call out a few. I’ll chat conversations are taking place in the Coalesce dbt Slido channel in dbt Slack. If you’re not part of the community already, you have time to join that. Now, just go to get getdbt.com/community and search for the Coalesce dbt Slido channel once you enter that space. We encourage you to ask other attendees questions, make comments, or react to any point in the channel.
After the session, the speakers will be available in the Slack channel to answer your [00:01:00] questions. However, we encourage you to ask those questions at any point during the session. I also wanted to call out that the team will be using their application, Slido, to do live polls during the session. If you’d like to take part in that, have a device that can read QR codes ready, or by going to Slido.com and entering the code dbt. With that, let’s get started. Over to you Daniela.
[00:01:20] Daniela Barokova: Thank you, Sara. And actually, thanks everyone for joining us today. As Sara mentioned, we’ll be speaking about the story of dbt at Slido and our journey from hundred spreadsheets to hundred analysts. I’m Daniela and I work as a data engineering slider, and I’m here today with my two colleagues, Michal and and
Yeah, Andrej. Michal is working as an analytics engineer and Andrej is here working in Slido as a data engineer. Now, [00:02:00] before we begin, I would like to ask you a little bit more about you, so if you could please go to Slide.com and the QR code dbt. Alternatively, you can also use the link that Sara shared on Slack, or you can also scan the QR scan the QR code on the right-hand side to join using the QR code.
And we have a question for you, actually. We would like to know what is the current stage of your data infrastructure? Are you in the dbt age, spreadshee age, or data warehouse age? So yeah, let’s wait until some votes will come up. Michal, do you remember those times when we were in the spreadsheet age?
[00:02:51] Michal Kolacek: Unfortunately, it still haunts me in my dreams I think. However, it’s been awhile. Sorry. I know you can [00:03:00] get some good sleeps overnight. There’s been some tough times back then,
[00:03:06] Daniela Barokova: Yeah. I see that the votes are coming. We have a petty strong dbt community, of course online. For those who are in the spreadsheet age, I can relate to Michal, but also to all of you, I believe that we can give you a little bit of during these times. For those, for in the data warehouse age, actually, if you’re thinking about switching to dbt, we’ll explain what impact dbt had in our organization. And for those who are in the dbt age, we have prepared, I believe, pretty interesting demo how you can leverage an open source tooling that we’ve built along the way.
So I hope all of you will find something to relate to during the talk. Go directly into it and please keep your phones ready. We will have another poll [00:04:00] in a couple of minutes. Now, just to give you a little bit of background about Slido, Slido is an audience interaction platform that helps leaders to better interact with their employees.
With Slido, it’s possible to crowdsource questions that drive meaningful conversations, but also it’s possible to run live polls like the one that you have seen. I made it before to capture valuable feedback and insights. Slido was founded nine years ago. And last year we were acquired by Cisco, and during during all these
like we, during Goldie’s nine years, we’ve seen quite rapid growth.
Last year we have crossed a milestone of 1 million events that have used Slido to create better interactions, and this year actually we have crossed another million of held events that have already used. We had a chance to work with [00:05:00] clients and conferences like South by Southwest, Web Summit, Pulse, Money 2020. and from the conferences then, Slido made its way to the company’s end meeting rooms and is now used at I believe at least 40 fortune 100 companies. But of course, as you can imagine, it wasn’t all about sunshine and rainbows. Of course there were many challenges along the way, but today we are here to speak about the ones we had as data team.
Back in 2019, as Sara mentioned, we practically didn’t have any data analytics infrastructure, and all reporting set in Google spreadsheets. In practice and mandate, there was a ton of tasks done manually. Just to give you an example, to be able to work with data, we had to manually extract data from our CRM every Sunday, you might be asking why Sunday, [00:06:00] because actually we have our Monday morning meetings every Monday, so we needed to have fresh data there, but also actually it wasn’t even possible to download bigger batches back then.
And another example was, for example, a tedious task of tracking ending subscriptions where our ending customers to know like which customers are ending, not just for the percolation of retention numbers, but also actually for running a continuous business with these clients, and yeah, of course we have to do all this manually.
So these are just two examples, but when I’m reflecting on these, I can believe that we were able to survive those times. Anyway, as you can imagine we knew that something has to change. Wwe knew that in order to improve [00:07:00] our data game, we want to build a data informed culture.
That was also at a time when our current head of data, Meric, joined the team and we were given a shoestring budget of $500 per month. Now get ready. Here’s another question for you: what would you do if you were given a task like this? Would you start crying, or would you take a short sabbatical? Or would you roll up your sleeves and actually start and dive into the work?
I I hope that you have your phones ready and you can start voting. How about you, Andrej? What would you do?
[00:07:47] Andrej Svec: I don’t know in this pre-Christmas time, I’m a very tempted into saying that I would take at least one month sabbatical maybe, so that I can start working after [00:08:00] Christmas in January. So I guess I vote for a sabbatical, but only for one month.
[00:08:06] Michal Kolacek: And Andre if it was summer, would that be also a one month sabbatical?
[00:08:14] Andrej Svec: In the summer is not much better actually, but if I remember it correctly, I believe we did the very first data warehouse implementation during the summer. So, I guess I’d roll up my sleeves. It was summer.
Yeah, this actually a funny slide, but
[00:08:34] Daniela Barokova: I believe that everyone would eventually accept the challenge.
So let’s have a look at what was our answer. Our solution at the time was to build a pretty standard ETL pipeline based on AWS Athena, orchestrated by Apache Airflow and visualized by an open source tool called Superset. As there was not [00:09:00] much Athena specific tooling around at the time, we built also an in-house tool for management and materialized views, testing the results, and even if we had a tool that ensured that the test coverage of our materialized views, how we call them our models, doesn’t decrease over time. And if you’re asking what was the distribution of costs of those $500, here’s a rough rundown, of course. $150 for replica of production database, $200 for storage and query service, a hundred dollars for servers for Apache Airflow and Superset and $50 for other miscellaneous things.
So that was it. We finally had something in our hands. We hit a play where we could scale our data, and since Superset was actually an open source platform, we had a chance to provide data to the whole Slido team. So it’s it’s [00:10:00] great pretty great, right? Like compared to the spreadsheet era.
But as you can imagine, and although this was way better than what we had before, some problems came along the way and it was, again, there were some challenges. So let me give you a few examples, first of all, because it wasn’t simply possible to set up a staging environment, we were quite often encountering fails in production.
These were happening because of basically two reasons. First of all, we were in able to locally test the changes that we were introducing to warehouse, and all changes were pushed directly to production. So if we made a mistake, it was there. And another thing that happened quite often was that downloaded or excerpted data were corrupted to the warehouse. It would break during the nightly jobs or when it was [00:11:00] refreshing. So as a result of these fails, our CEO, but also other colleagues were not getting the fresh data digest the following morning, and even though he was very nice and not making a big, deal a bit like you don’t want your CEO to be the alerting tool you have in place.
So that was the first one. Secondly, another set of issues came along with documentation. We wanted to share the knowledge. Of course that will be required within the team, but also with the rest of the company, but it was difficult to keep it up-to-date and scale when it was separated from the code itself and it was necessary to copy paste all the information on multiple places.
It wasn’t version controlled and it wasn’t reviewed. We were not able to make sure that it’s always accurate and updated. Last but not least, we were missing a visibility. Where, and how our data was [00:12:00] used, which often led to broken dashboards. For example, upon a change in the table, there was no scalable way of knowing that basically some charts that it’s dependent on.
This data set needs to be updated. And again, we were only able to find these manually or some colleague to contact us that something is broken. So it was not scalable.
[00:12:30] dbt, the missing puzzle piece #
[00:12:30] Daniela Barokova: But during all this time, we have been keeping an eye on dbt as it seemed like an answer to most of these struggles, but using it with AWS Athena wasn’t possible for quite some time.
But then with the introduction of externally provided adapters, there was little reason not to switch. So we actually did. Today, it’s almost eight months ago, and now I would like to also walk you through the impact it had on our data [00:13:00] team and analytics infrastructure here in Slido.
First of all, those fails and productions were addressed. Thanks to staging environments, running dbt locally separate testing schemas enables creation of this staging environment, where we can basically very safely fail and run standardized data tests. Thanks to this, we have a higher reliability and the errors were reduced to a minimum. Also the great benefit is that the data warehouse is much more stable. There’s almost never a downtime where the data would be corrupted. At worst, we have stale data from yesterday. but it’s not so bad in our case. Another set of issues that I’ve mentioned was documentation. Now with dbt, when the documentation sits right next to the [00:14:00] model itself, and also it’s version controlled and can be reviewed along the pull requests and the changes you’re introducing to the model, but also, thanks to Jinja templating and the ability to reference columns on multiple places, it’s, basically just much easier to, keep it up to date and like available for everyone. Plus we were able to make sure that our new columns that are undocumented or untested are automatically reported, and also to push these documented columns to Superset, our visualization tool.
These are the toolings that my colleagues will showcase in a minute. Now with better documentation, people are less afraid to work with data, which is great. And last but not least broken dashboards that I mentioned thanks. To dbt exposures and also internal [00:15:00] developed data the lineage tools, these Superset dashboards are now put into dbt docs as exposures with underlying dbt models referenced.
So now we have a better visibility where our data are used and we can easily check when model is updated, if we need to change something also in the individualization layer. Now, I would like to give way to my colleague Andrej, who will show you how you can improve the coverage of your documentation and we will walk you through this part.
[00:15:37] Andrej Svec: So having the documentation finally in one place with the actual queries, we also wanted to have more visibility in what is actually covered and what is not. And for this we created a Python based package called the dbt coverage, and the package can basically compute all the documentation [00:16:00] and test coverage of the individual columns in your dbt.
So without further ado, let’s take a look how it actually works.
[00:16:11] dbt-coverage #
[00:16:11] Andrej Svec: Here I have the sample project, most of, it’s called the Jaffle shop. This is from the dbt, or it’s probably by dbt. And I have the Python package installed. So let’s take a look how we can do it. First of all, we have to materialize the individual models, which is done on dbt run.
When the models are created, we need to generate the documentation.
And once this is done, you can see that there were some files generated. These are standard dbt files we basically we,
we use inside the project and [00:17:00] we can compute the coverage from them. So let’s take a look how it works. We can run the dbt coverage script, and let’s first compute documentation coverage for these project.
So let’s run dbt coverage compute, and we can see that I did something, and then it printed out this nice table telling you how many columns do you actually have covered in your dbt project. We can see that the customer’s table, for example, is covered six out of seven columns in the table.
But but for example, some payments, or all payments are not covered at all. But since it’s very hard to actually create some some documentation at once to document your whole project at once, we also wanted to ensure that the tool gives you the possibility to continuously improve.
And [00:18:00] for this we can we can see by, for example, modifying the the customers query, we can, for example, add a new column. Let’s, imagine we need the new column here. Let’s when hit dbt run, and let’s say you forget about the documentation, we just added it and it didn’t come up to our minds. So yeah, you materialize the model and run the dbt docs generate, documentation is generated, but the new column is basically missing. Now, second the second time let’s compute coverage and see if anything changes. We can put it inside another file, coverage, not coverage.json but coverage-after.json, and then we can see that the customer’s table has actually dropped in coverage. It was seven out of eight, but now it’s six out of eight. And what the tool gives you [00:19:00] is another command, the dbt-coverage compare, and you can compare the coverage after the changes you made with the coverage before you made the changes.
Now let’s hit run, and you can see that this is like a standard view in many coverage projects.
It tells you that there is a in the coverage. There’s actually zero. No tables have changed, but one column was added and it is actually a missed column. But also, from this summary, it will be pretty hard to find what is the column that you missed, and this is why we also print you the column that you forgot to document. We can see if it was non-existent before and now it’s uncovered. So it’s a new miss, and this is what we want to report and what we want to avoid. Normally, this is something you could do on your pull request pipeline. You can actually, by using this tool, you can make it much harder, or almost impossible for anyone to merge anything without the document.
[00:20:00] And since the documentation is not decreasing, it’s either staying on the same level or improving, and this way you can hopefully slowly get to perfection.
[00:20:09] dbt-coverage is open source #
[00:20:09] Andrej Svec: All right. So we saw these and the best part is, and you already saw it in the slack that this is open source and should work on your project, hopefully.
We will be very happy if you give it a try, and let us know in theS lack or, anywhere in the. if you like it or not. We’d be happy if it serves you the same way as it’s served us.
And now, over to Michal for the second tool.
[00:20:41] Michal Kolacek: Thank you, Andrej. As you’ve heard already, we are using Superset as our visualization layer, and we wanted to make sure that dbt and Superset communicatse with each other well.
[00:20:55] dbt-superset-lineage #
[00:20:55] Michal Kolacek: And that actually gave life to dbt-superset-lineage. [00:21:00] It’s a package that does two things: firstly, we wanted to push column descriptions from dbt docs to Superset to enable our end-users, to see those descriptions directly when creating charts,
and secondly, we wanted to include Superset dashboards to data lineage and dbt docs to avoid those broken dashboards and if we make any mistakes, just to make sure that we can easily see where the deprecated column or deprecated table made that impact.
How does it actually work behind the scenes? So similar to dbt coverage, dbt superset lineage is a CLI application.
So it works in common line interface and it leverages three things: firstly, manifest.json that is such a special json file that contains all the information about your dbt project, and it’s generated every time you hit dbt compile, that is also [00:22:00] part of dbt dbt run or dbt doc generate. Secondly, dbt exposures which is a tool that is specifically designed by dbt and as part of dbt to, define downstream dependencies of your dbt project.
And lastly, of course, we use Superset API. Enough talking, let me show you how it works in practice. Coming to a similar setup that Andrej just showed a couple of minutes ago, I have a jaffle shop dbt project open, and in particular, I’m looking at the customers table, similarly with Andrej. And as he showed you, there are a couple of columns now that I documented, as well as I have the customers tables here, I have that table in Superset.
And what I’d like to do I would like to take the documented columns from dbt [00:23:00] and put them here. into Superset. How do I do that? So provided that I have the dbt Superset lineage package already installed, and,also that I have the access tokens saved in my global varietals. I can easily type dbt, superset lineage, and what I’m going to do, I’m gonna push descriptions and the only last thing that I want to put. It’s the link to my Superset. And I run that. It takes a couple of seconds to contact the API. All done. So when I click on the customers table in Superset, I want to create the new charts and to just take an example, I’m going to create a new KPI. So let’s say I want to see how many customers we have, so I create the new [00:24:00] metrics here. I wanna ee how many of them, and then when I click on columns, I can see customer ID and right next to it, this a little "i" icon and when I hover over it, there’s a description.
This is a unique identifier for a customer. I do exactly what I want. Just click on it. I know what I’m doing. I know it’s safe. So I can also save the chart, name it "customers," and just save and go to dashboard. And now the child was automatically added to the dashboard. Amazing. So I used the descriptions that are in dbt docs directly in Superset to a chart.
The the other way around, I have a Superset dashboard and I want to make sure that this one is reflected with the links to customers table that are on dbt docs. I can have it back to come online and use a similar command. I write dbt-superset-linaeage. At this [00:25:00] time. I want to pull dashboards. I’ll just give it the link to my Superset, wait for it to run, and it told me that it was transferred to a Yama file under models exposures. I can’t see anything, or if I just click again, it opens up. If I open the file, there’s a customer dashboard on this particular link with reference to the customers table. That’s great. But how does it look in dbt docs on on production?
I can simply generate it all. And when it’s done, I can serve them on an available port.
It’s opposite dbt docs running locally, and when I go to Jaffle shop models, [00:26:00] customers under referenced by, there’s the exposures part, and I can see the customer dashboard as the one that it has this reference. And when I open it up, it has a separate page with a direct link to this exposure. So I can even see which dashboard it’s referenced by that.
[00:26:23] dbt-superset-lineage is also open source #
[00:26:23] Michal Kolacek: Amazing. And since it’s an open-source project, similar to dbt-coverage, you can play around and hopefully also share some feedback with us. Thanks a lot. And back to you, Daniela.
Thank you guys.
[00:26:41] Daniela Barokova: I hope that some of you will actually be able toleverage some of these NER data. So now let’s just wrap it up and yeah, some final thoughts from me.
Even though we are only on the midway through our journey from hundred [00:27:00] spreadsheets, to hundred analysts, there are some lessons learned that we would like to share. dbt is a mature framework for analytics engineers, thanks to its adopters, even on non-standardized processing solution. Athena aan be supported and further expanded upon.
Second of all, obviously automation is the best way to reduce human error. During those demos that you’ve just seen and even like the best ones, I might forget about something, but automated tools like this one can make automated checks for them. And the most important one, a staging environment is indispensable for analytics engineers, and no analytics engineer can work without it. We have loaded it on our skin. We believe that we are heading to a future where everyone will need to analyze data, that everyone will need [00:28:00] to be a data analyst, and we will actually have those hundred analysts. And there is a long journey ahead of us with more data education and data democratization that’s all been founded thanks to dbt. So thank you for that.