Table of Contents

Tailoring dbt's incremental_strategy to Artsy's data needs

At Artsy, dbt’s flexibility has made it easier for our analysts to contribute to performance wins in our data pipeline.

One of our biggest wins so far has been finding clever ways to incrementalize our longest-running tables, while still solving for 2 main pain points:

  1. improving our workflow, and
  2. modeling Artsy’s complex business model

Browse this talk’s Slack archives #

The day-of-talk conversation is archived here in dbt Community Slack.

Not a member of the dbt Community yet? You can join here to view the Coalesce chat archives.

Full transcript #

Amada Echeverria: [00:00:00] Welcome everyone. And thank you for joining us at Coalesce 2021. My name is Amada Echeverria . I use she / her pronouns and I’m a developer relations advocate on the community team at dbt. I’m thrilled to be hosting today’s session.: Tailoring dbt’s incremental strategy to Artsy’s data needs, presented by Abhiti Prabahar. Abhiti joined Artsy’s data team in November, 2018 and was immediately drawn to their incredible data pipeline, which powers decisions across the business.

In addition to working on the data pipeline, she also supports Artsy’s growth product and marketing teams. Before Artsy, Abhiti was a data engineer at Civis analytics where she mainly worked on improving the pipeline of one of their main data datasets. And prior to that, she graduated from from the university of Pennsylvania Philly with both her BSC and MSC in systems engineering. At Artsy dbt’s flexibility has made it easier [00:01:00] for their analysts to contribute to performance wins in their database.

Over the course of the next 30 minutes,Abhiti will delve into one of their biggest wins so far, which has been finding clever ways to incrementalize their longest running tables while still solving for two main pain points: improving their workflow and modeling artsies complex business model. So before we jump into things, some recommendations for making the best out of this.

All chat conversation is taking place in the #coalesce-Artsy-data-needs channel of dbt Slack. If you are not yet part of the dbt Slack community, you have time to join now. Seriously, go do it. Visit getdbt.com forward/community and search for #coalesce-Artsy-data-needs when you join.

We encourage you to set up your Slack and your browser side-by-side. In Slack we think you’ll have a great experience if you ask other attendees questions, make [00:02:00] comments, share memes, or react in the channel at any point during the BT session. So to kick us off our chat champion, Erica, Louie, the head of data at dbt Labs has started a thread to have you introduce yourself, let us know where you’re calling in from and break the ice.

After the session Abhiti will be available in Slack to answer quick. Let’s get started over to you at Abhiti.

Abhiti Prabahar: Thank you so much for that wonderful introduction. Amada. LikeAmada said, my name is Abhiti Prabahar and I’m a senior data analyst at Artsy, an online art marketplace based in New York city. Today. I’m excited to share with you, how will you or see you have tailored dbt’s , incremental strategy toward data needs and what we learned along the way.

First a little bit about me. Amada said, I studied systems engineering and then worked as a data engineer, automating data pipelines. I came to Artsy to become a data analyst because I was interested in working on product and business strategy while still being technical. But once I came to Artsy, I was super excited to [00:03:00] find that knowing the data pipeline and ecosystem was crucial to being good at my job as an analyst at Artsy.

This is just to explain why I’m so passionate about this topic. And here today, now a little bit about Artsy. As a two-sided online art marketplace, where anyone can buy and sell art, our mission is to expand the art market to support more art and artists in the world. An average transaction takes place with 3000 miles between the buyer and seller.

And we have 1 million visitors every week. We have nearly 1 million in demand, artworks for sale and Artsy by more than a hundred thousand leading international. And Artsy partners with 4,000 plus art businesses. And we are 260 employees and growing. And finally, here’s the team that I have the privilege of working with.

We’re a team of seven spread across New York City, LA and Berlin. Like some of you may have experienced these past two years, our Berlin team was hired entirely during the pandemic, which is why we still don’t have a full team picture yet, but hopefully soon. Now that I’ve done the [00:04:00] intros, I want to take you back a year and a half ago at Artsy.

Imagine this were a four person data science and analytics team with a pipeline that looked like this. Our daily data pipeline bill took around 16 hours to complete the pipeline powered our BI tool Looker, which served over half our 200 person company on a weekly basis. Our team of analysts had a hundred percent ownership of the pipeline, which meant that any time the pipeline failed, it would.

One or more analysts entire day. We had no testing in place, which meant that whenever we wanted to make changes to the pipeline, we’d just cross our fingers and merge the PR, hoping that it wouldn’t break the pipeline the next day. And we didn’t have the bandwidth or skillset to work on proactive pipeline.

As you can imagine, pipeline work was taking up way more time than we wanted it to. We were reactively responding to fires or ad hoc requests, detracting from high impact analysis projects that we wanted to work on, and that we had the skillset to tackle. So naturally for years, we begged to hire data engineers.

So in [00:05:00] May, 2020, when we finally hired our first data engineer, you can imagine how excited we were. The pipeline was finally going to get the attention it deserved and it did. And Emma’s first few months. They migrated the transformation part of the pipeline to dbt something we’d been talking about for years.

Then we, Emma introduced data quality and circle CI testing. So this was really exciting for us because this meant that we, when we made these changes to the pipeline, we could be confident that we weren’t going to blow up a table or a table wasn’t going to be empty the next morning. And sh they also introduced circle CI testing, which meant that we had end to end testing of our pipeline.

The pipeline wouldn’t break. When we accidentally forgot to add a column name to a table or something like that. And finally, Emma introduced transparency into our pipeline. Something that we had never really had before. So we had alerting in Slack to know when the pipeline completed successfully something we didn’t really know, unless we like went to go check Jenkins and they also introduced a dashboard for us to [00:06:00] just be able to monitor how different parts of the pipeline were performing. If the length of the pipeline was trending upwards and things like that. So the speed at which our new data engineering teammates were making improvements only made us more enthusiastic.

We felt like the pipeline was finally getting the love that it deserves. So after introducing all these changes to the pipeline, we were so excited to figure out what to tackle next with our org new data engineering expertise. So we went back to our top pain points to figure out what to work on. Next, looking back, we realized that we had already tackled most of our top pain points at the time.

[00:06:34] Artsy’s data ecosystem #

Abhiti Prabahar: And that the remaining struggle we had was with the length of the pipeline. So we decided that was the next thing to tackle. So now I’m going to take a step back and give you an overview of our data pipeline, how long it takes and how we solved for the length. Our data ecosystem takes in Denny data from many different sources.

So this is a simplified diagram of our pipeline, but we essentially take data from these three sources, click stream data. We use segment. [00:07:00] Domain data from our internal systems and then third-party data we extract and load that data into Redshift, where we do all of our transformations. All now managed by dbt.

So this is where you get, hundreds and thousands of lines of SQL to combine the data, enrich it. And then finally output tables that can be easily digested and used for another. So these tables are then directly exposed to Looker where we even do more business logic there to get the data into a format that people or across the company can use.

We also push some of that data to production. And then finally we pushed some of that data to marketing tools as well. The part of the pipeline that we focus on when we talk about the pipeline build time in general is actually this core ELT pipeline, which extracts our domain data into S3, then loads it into Redshift and transforms it within Redshift.

As you can see here with these screenshots I have from Jenkins, the extract step at the time of screenshot, which is October, 2020, the extra step takes was taking seven hours. The load step [00:08:00] taking an hour and a half, and the transport step was taking seven and a half hours. Obviously each step was taking a long time namely the extract in the transform steps, but because we were an established data analyst team with a clear understanding of our own pain points and what we were capable of improving, we had total tunnel vision for the transform steps specifically.

That’s where all the SQL lived. That’s what we felt like we had the power to actually improve. And so while extract stepped in the transform step. We’re taking around the same amount of time for months. We only focused on the transform stuff. And we got to a point, which I just wanted to point out now that where towards the end we started micro optimizing for the transform stuff without realizing, oh wait, there’s actually also this extra step that we should work on as well.

And I just want to point that out because it’s a clear reminder of how having a, the right tools to solve for your problems and the right people with the right skillset and makes such a difference in the impact you can have. But for now, I’m going to just focus in on the [00:09:00] transform step and how we we’re able to cut down the build time on that transform step.

So this is our transformed stuff. I, if you work with dbt already, you might be familiar with this document or this screenshot on the right. You can’t really see what’s going on, but this is actually the dependency graph that dbt docs generates for you. And the transform step is where the data team spends most of our time on the data pipeline and.

Most of our business logic lids. As of today, we have 316 models to the right here and it’s a crazy web and greater than 50% of the pipeline, bill time comes from building one of our most important concepts sessions. So now I’m going to go into what sessions are just to explain to you why sessions are so important to us and why it was important to us that we cut down the bill time on this part of the pipeline.

[00:09:53] What is a session? #

Abhiti Prabahar: So what does a session? We take the definition of a session straight from Google analytics. Actually, a session is a group of user [00:10:00] interactions with your website that take place within a given timeframe. For example, a single session can contain multiple page views, events, social interactions, and e-commerce transactions.

It’s essentially a container for all these activities. So for example, Artsy is a website of artists and artworks that you browse to find an artwork that you love and might want to purchase. You might come across Artsy through Google search and land on Andy Warhols artist’s page. That’s your first page view.

Then you move to an artwork page. Your second page view, you try to save the artwork, which is your first event. Then you follow the artist, your second event, share it with a friend, cause you love it so much. That’s your social interaction. And then you purchase it. That’s the transaction. A session allows us to grow group all of those activities together so that we can then attribute baby behaviors to each other.

So then we can say things that purchase at the end, came from Google search or that social interaction towards the end of that came from Google search to give a little more here’s an example of what sessions help us answer. So [00:11:00] what is the conversion rate to purchases for users who arrive at Artsy through search versus users who arrived through email?

Sessions allow us to be able to make these types of conclude or have these types of insights and analyze user behavior on the aggregate. So we can look at all sessions that arrive through search all sessions that arrived through email and calculate the conversion rate. So what percentage of those sessions see an artwork page or that see that actually have a purchase and then we can compare it to see which channel is doing better.

So it’s a really powerful tool that we use every day. Completely critical to how we analyze our business. And just a quick example of what this data looks like for those of us who are more visual. We have, this is a much narrower subset of what our sessions table actually looks like, but these are just the core components.

So what you see here is we have the unique session ID. Then we have the timestamp at which the session starts. Then we have a user ID. Then the marketing [00:12:00] channel that the user came from. Then we have all of these like activities or user activities rolled up. So artists page views, artwork, page views saved artworks place offers purchases.

[00:12:12] The Sessions Pipeline #

Abhiti Prabahar: And so as you can see here, it’s it can, you can roll. So much data and this table can get really wide. So we have to be really smart about what data we put into our sessions table and to make sure that it’s not too wide, but still has all the critical information we need for analysis. Now, I want to go deeper into how we build sessions to demonstrate why this part of the pipeline takes so long to build.

So the way the sessions pipeline works again, this is a simplified visual is we take first take our raw clickstream data, which is the, which consists of those page views, clicks, follow saves, etca. That’s mainly coming from segment. As you can imagine, we’ve been tracking our data for around six years now and we have a million visitors now every week.

So [00:13:00] this table is huge. It has 2 billion rows of data, which means that it can take a long time to build. Then the next step is we enrich that click stream data with information that we need for later on analysis, we do things like add a device type. We try to figure out what country the user’s coming from.

We have our own internal tool for deciding what page type it is so that we can aggregate the traffic a little better. So that remains at 2 billion rows again. Then we decide when a session starts, this is, this requires a little bit of logic that we actually mainly take from Google analytics and have adapted that definition to our own needs.

But we basically do time and campaign based logic to find when the start of the session is this is when we figure out when a session starts, which means that we are looking through all of this clickstream data to find the one event that we believe started the session. So this is where the table collapses.

And so we get actually just 260 million rows of data. [00:14:00] And then finally, now that we know when each session each container starts, this is when then we enrich those, that session starts data with the sessions information we want to know for our final tables. So then we take that domain data from our internal systems and we enrich the session, starts table with all the activity we want to know about the user today.

[00:14:22] Incrementalizing sessions data #

Abhiti Prabahar: What I said before, like page views, clicks, saves purchases, everything. So our solution for cutting down the build time on these tables has always been to incrementalize them actually. So you might be asking what is incrementalization? Incrementalization is the process of transforming only a subset of your data in your table.

Usually data from the past one day, rather than transforming the entire table. So if you remember from my last. We have some tables with 2 billion rows of data. A couple of years, a few years ago, we realized that wasn’t sustainable for us to constantly be dropping the [00:15:00] entire table and rebuilding it from scratch every day.

Cause it just took hours to rebuild it. So actually a few years. Prior to dbt when we were actually building our transformed stuff in Ruby, we had already been incrementalizing a couple of our largest tables and that incrementalization was just rebuilding just adding the last days worth of data to the table.

So before dbt, we had already tackled like a couple of our largest tables, but then with dbt as you can see what this code here dbt basically exposed to us as the data analyst team exactly what’s going on under the hood. It’s pretty much all written in SQL. And so it was really easy for us to understand, oh, we could change this from one day to 30 days.

We could decide which timestamp we want to rebuild based off of. And so it allowed us to get more creative in how we were incrementalizing our team. So the screenshot I have here is the incremental code and it’s mostly copied from the dbt docs actually. So if you’re a SQL user, it’s relatively easy to understand what’s going on.

And so the data engineer when [00:16:00] we decided, okay, we want to just expand this to the rest of the tables. It was very easy for the data engineer to go ahead and just call it. These three lines of code change some variables here and there, and then add it to the rest of our tables. So very quickly we were able to incrementalize the rest of our sessions pipeline.

[00:16:15] The final incremental hurdle #

Abhiti Prabahar: So we had the, our entire web sessions pipeline, and then also our entire iOS sessions pipeline just by copying and pasting this code. But of course not every story is that easy up until now. The data engine engineering team was able to quickly incrementalize the entire pipeline bringing the rest of our tables to under five minutes, but then we have this one last table in the sessions pipeline that we wanted to tackle.

So if you look at this graph, literally all of our tables went to under five minutes except this one final one was still at a half hour. That’s the final sessions table. After weeks of going back and forth with a data engineer trying to incrementalize this final table, we realized that there was something wrong and how we were tackling it because it [00:17:00] was taking too long to figure out what the solution should be.

And what we realized as the data team was that we actually had the business context needed to unlock us in this final step. So we took it on ourselves to find a solution. So why was it so hard for it to incrementalize our final sessions table? This is because we have two types of data, immutable and mutable.

Immutable data is data that we know is not going to change it’s data like the page views and clicks and follows that I’ve described earlier that have already happened, making it easy to incrementalize. So for example, you may have seen Andy Warhols artist’s page on June 1st. That’s never going to change that’s history.

And so it’s really easy to just rebuild every, or just add on. Every day’s worth of data because that’s never going to change mutable data. On the other hand is data that’s going to change it’s data. Like how many offers have been confirmed? How many auction purchases have there been, etca? So this is data [00:18:00] that is essentially for our purposes data that’s changing status over time, like purchases.

So till now we had been rebuilding one day’s worth of immutable data on the left. So with all these tables and the entire table of six years of mutable data. So going back to this diagram in a previous slide, I didn’t show you that we had already actually broken down the sessions table. But a couple of years ago what we realized one of our teammates had done was they had actually decided that they were going to.

Break out the immutable data from the sessions table and incrementally build that just for some incremental performance win, but then keep the sessions, mutable data as is rebuilding every day. So once we realized that there was this difference in the data, and we knew exactly how dbts incremental strategy code worked, we had to figure out how we knew that we could be clever about how we would rebuild our mutable data to do that.

We had to figure out what data points we needed to identify sessions with mutable data. This meant [00:19:00] that we needed to take another step back to understand how Artsy’s business model.

[00:19:06] Artsy’s complex business model #

Abhiti Prabahar: So the way Artsy works is we have two businesses. We have the buy side of our business and we have the sell side of our business on the buy-side. We have we partner with galleries and auction houses where they were the galleries and auction houses put their artworks on Artsy to sell. And then under each partner, we have different ways in which users can transact.

And each of these takes a different amount of time to change. For example, if you see the contact for price I’m in the yellow box under gallery, that is a transaction where you can start a conversation with the gallery and then continue a conversation. The like details about the artwork and ask them about like the background of the artwork.

And then finally it can result in purchase. Because that’s a whole conversation, the average amount of time it takes from a creation to purchase is 17 days. It takes a really long time to finally result in a purchase. So [00:20:00] essentially on average 17 days for which that data can change. Then on the sell side, we have the consignments part of our business where users can submit artworks and sell them on Artsy.

And so that takes even longer because you can submit, you have to submit the artwork. Then we at Artsy have to approve it. Then we have to find a gallery to host it, and then we have to put it up at auction. And then finally it has to be sold. Usually it takes like weeks or months to complete.

[00:20:28] Introducing the open_session model #

Abhiti Prabahar: And so we have these five types of transactions that all take different amount of time to change status. And because of this is the solution we ended up coming up with introducing the open session concept, naturally our solution to all this complexity was just a few lines of code. We developed a field called open session, which is built off a condition to only rebuild sessions or rows that had activity that might result in a change of state days or months.

If you look at the code here, we have a few columns here where we looking back at the business model I [00:21:00] showed before, these are all activities users can do that can change. That can change status weeks or months later. So what we did is we said, okay, if any of these activities exist in a session, call it an open session.

Or, and all of these activities had to have taken place in the past 180 days. This is this timestamp filter that we added was just because we didn’t want to rebuild all open sessions of all time. And because that would just take way too long. And we decided that it made sense to do 180 days because we knew that would probably tackle, over 90% of all of our sessions that had this type of active.

[00:21:40] Open_session in practice #

Abhiti Prabahar: This is what open session looks like in practice. I just included the code excerpt from the last slide here so that you can see what exactly is happening. But like I said before, this is just a narrower version of our table. But looking at this, the columns in yellow are the ones that are taken into account when determining what an open [00:22:00] sessions.

And we look at the session startup to see if it was in the previous 180 days. And then we look at columns like placed orders to see if there were any place orders. And then based off that, we decide if it’s an open session. So the top row doesn’t have a place order, but it is in the previous 180 days.

So we don’t need that session to be open. But the bottom row did have a place to order and it was in the past 180 days. So that is considered an open setup.

[00:22:26] The final incremental solution #

Abhiti Prabahar: This is what our final incremental solution looks like for this mutable data. We basically just say rebuild any row we’re open session is equal to one or any row that is in the past one to two days. And that just is, I was just a huge win for us, again, only a few lines of code, but it was such an exciting finding for us because we, as the data team realized that.

That we could actually make these performance wins in the pipeline.

[00:22:56] What our incremental strategy looks like today #

Abhiti Prabahar: So this is what our incremental strategy looks like today. Our entire [00:23:00] session’s pipeline is now incrementalized bringing our transformed step down to five and a half hours due to a combination of the. We have our immutable data, which we rebuild data for the past one day. So that’s just the straightforward data on, because we know that it’s not going to change.

[00:23:18] What I wish we knew 1.5 years ago #

Abhiti Prabahar: And then we have the mutable data where we rebuild all rows that are an open session or all data in the past one day. What I wish we knew a year and a half ago from this experience. First dbt makes it possible for data analyst teams to make performance wins in the pipeline. This was a really big finding for us because we didn’t realize that knowing the business logic meant that we had the best knowledge to make the changes that we needed and making performance wins in the pipeline was also really exciting for us because we couldn’t really do that much before.

The next thing we learned was that our [00:24:00] business logic is too complicated to transfer to another team who isn’t working directly with it. This is something that actually we realized. Through this project, but also something that we could apply to the rest of our work. Something that we’ve noticed with Looker is that with Looker and the capabilities that offers we’re able to allow anyone in the company to analyze data from their space in the company.

And so something I’ve always realized is that I’m not always the one that’s the best equipped, even though I’m the data person. To answer other people’s questions in their spaces. Sometimes I realized that other people in their parts of the business might know their data better and that’s because they know how the business works and what the data looks like to map out to that business.

So this was like a really fun learning for us to realize this is also true for us. We know some of this data more intimately than other teams do. And so sometimes we might be the best task, the best to be tasked with this type of work.

And finally, I [00:25:00] think it would have been really helpful to work with our data engineers, to map responsibilities in our pipeline, based off our skills and expertise with the data world changing as rapidly as it does. How everyone has like so many different types of skills, it would have been helpful to sit down with our data engineering teams say, Hey, this is what we do right now.

This is what we can do. This is what we want to do. And then hearing from the data engineering team, this is what they can do. This is what their skillset is. etca, and figuring out then what we should each own in the pipeline. So what does Artsy data practice look like today? We’re currently seven data analysts and five data engineers.

[00:25:39] Artsy’s data practice today #

Abhiti Prabahar: Our nightly data pipeline takes fewer than 11 hours to book. We now actually have split pipeline ownership where the data science and analytics team owns the business logic. While the data engineering team owns the infrastructure, our pipeline maintenance truly accounts for 20% of data science and analytics teams, data day-to-day work and data engineering proactively works on improving the data [00:26:00] pipeline.

With projects like decreasing the extract step from seven hours to two hours. So if you remember from the beginning of the presentation, I said that because we had the, as the data science and analytics team had total tunnel vision for the transform stuff, we totally never thought about the extract step and how to cut down the build time there.

In the extra step. The reason it used to take so long is because we’re extracting data from Mongo DB, which is really difficult to do, especially incrementally, but our amazing data engineers with their amazing skillset and their background. We’re able to do that and make this huge win to our pipeline, which was really exciting.

[00:26:36] What’s next #

Abhiti Prabahar: Again, proving that having the right people with the right skills tackling the right job is so important. So what’s next for us. Next is expanding our teams work to higher level problems across the organization. Now that we have, now that we maintain the pipeline more minimally than we used to.

And then finally further developing a relationship with data engineering. I think we’re in a unique [00:27:00] position. We have this very established data science and analytics team in like relative to how old Artsy is. But this brand new data engineering team coming in. And so we have a lot to figure out how to map responsibilities in a way that makes sense, given our history with the company.

And so we would love to further develop our relationship with data engineering, to improve workflow workflows for rebuilding data from scratch. So now of course we’ve been incrementally. Table all these tables. Whenever we like notice above or want to make a change to a table, we have to drop the entire table and rebuild it from scratch.

And so we could improve our workflow there. We could implement better alerting when parts of the pipeline are down so that we know when we need to step in as the data science and analytics team or when the data engineering team needs to step in. And then finally, to clarify ownership of our marketing data stack.

So that’s all I had today. Thank you so much for listening. I’d love to hear any questions or thoughts or feedback, or even how you’ve tackled problems like these at your org.[00:28:00]

Amada Echeverria: Thank you for a great presentation abhiti. Incremental materializations are not only powerful, but are also one of the most common areas on which we get questions from community members as they’re scaling. And so folks, luckily you don’t have to fight your way through a crowd to meet the speaker.

Last modified on: Oct 13, 2022

dbt Learn on-demand

A free intro course to transforming data with dbt