Table of Contents

The Modern Data Stack: How Fivetran Operationalizes Data Transformations

Nick Acosta is a Developer Advocate and Data Scientist at Fivetran, who enjoys helping developers automate data pipelines

Anna Barr is a product manager for Transformations at Fivetran

Originally presented on 2021-12-07

Transformations are a critical part of the ELT pipeline in the Modern Data Stack– join us to learn how Fivetran is taking advantage of dbt for in-warehouse machine learning and predictive modeling practices.

This session will cover our own analytical use cases, as well as features Fivetran is building to support dbt in the open-source community, like our Visual Studio Code integration, Fivetran-managed dbt packages and our Transformations product line.

Browse this talk’s Slack archives #

The day-of-talk conversation is archived here in dbt Community Slack.

Not a member of the dbt Community yet? You can join here to view the Coalesce chat archives.

Full transcript #

Julia Schottenstein: [00:00:00] [00:01:00] Welcome everyone to your next talk at Coalesce, The Modern Data Stack: How Fivetran [00:02:00] Operationalizes Data Transformations. I’m Julia Schottenstein and I’m part of the product team at dbt Labs and I’ll be your host for this talk. We’re joined by Fivetran product manager Anna Barr and developer advocate, Nick Acosta.

And we’ll start by telling us how Fivetran is contributing to the dbt open source ecosystem and how they’re thinking about transformation within the Fivetran product experience. Nick will take us through an internal example. Fivetran uses dbt, plus the machine learning to forecast their ad performance.

if you’re not there already, please join the conversation in the dbt Slack channel Coalesce Fivetran. Anna and Nick will follow up with any questions you may have after the presentation. So without further ado, let’s get started. And Nick over to you.

Nick Acosta: Thanks, Julia. We’re here with The Modern Data Stack: How to Operationalize Your Transformations with Fivetran. I’m not seeing Anna on [00:03:00] with me. Hopefully she’ll be coming up in just a second.

Anna Barr: Hi, I’m here. Hi everyone. I’m Anna, as Julia said, I’m a product manager here at Fivetran and I’m here today with Nick. Let’s talk a little bit about transformations. So here’s what our agenda looks like for the next 30 minutes. I’ll speak briefly on what transformations are what the value of them are and what some of the common challenges are with building them and how we’ve designed tooling to solve those challenges both internally and for our customers.

I’ll touch on a few of those solutions before handing it off to Nick, to discuss how we use machine learning to get value out of the data we collect from variety of different sources within our ecosystem. And then we’ll go into Q&A. So let’s get into it.

[00:03:47] Transformations unlock key insights for your business #

Anna Barr: First data transformations are an incredibly important part of the modern data stack. So when we talk about transformations as a part of an ELT pipeline, we mean the process of taking [00:04:00] data from its raw form in the warehouse and cleaning it up, stealing it down, standardizing and aggregating it into a form that makes it valuable to your particular organization. So this enables you to make that data actionable and ultimately leads to reporting and insights that can really shine a light on how you operate your business. Doing this well and operationalizing your transformations in an effective way means that you can more easily make data informed decisions.

You can create measurable impact for your organization and ultimately make everyone’s job in your company faster. And so the value of transformations is very apparent. But its actual instrumentation is a huge challenge. Arguably the most challenging part of the data pipeline to manage. And there’s some cascading effects of those challenges.

So here are some of the stats that we’ve seen with our. 98% of companies say that their data pipelines are brittle. This is an enormous number. That’s everyone. Maintenance [00:05:00] is the biggest burden on an engineering or analytics team and broad data trust in an organization, which is very important, require us that engineers guarantee a constant flow of accurate and usable data, which is actually really hard to get.

Which leads us into the next figure, which is 68% of the engineers say building transformation is too time consuming to do as well as they’d like to. Because engineers are constantly fighting fires, building the data pipelines, they don’t have time to focus on value extraction. And the onus falls on the people downstream who wants to use the data to parse out meaningfulness.

This means that there can be many definitions of a particular metric floating around the org or a metric never gets built because no one knows where the right data lives or how to manipulate it, to build that insight. Which takes us to the next figure, which is 44% of companies say their most important data is unusable.

So when transformations are [00:06:00] built ad hoc by different people in the company, stakeholders have to rely on inconsistent or incomplete data, which means that those key business decisions that you can drive with data don’t have the proper foundation, which is very risky. Which takes us to the final number, which is that 59% of companies have lost, because of issues with data pipelines. So when you can’t make those key decisions, you lose money, you lose customers, you miss big opportunities to pivot your product and pivot your business. So there’s a massive impact, not handling transformation as well within your organization. So why do we see some of these issues?

Cause this is what your typical data architecture looks like. In this blue box here, you have custom black box logic. And only a few engineers in the company really know how to manage, which means you have huge bottlenecks in the system and much more breakage that is inherent to your data ecosystem.

[00:06:58] The solution: dbt #

Anna Barr: So the [00:07:00] solution that we’ve seen, that we are endorsing at Fivetran, is dbt which we see as the best the best framework for operationalizing your transformations. dbt allows you to expand that blue box out, focus more energy on building a framework that ensures that transformations are robust and reliable and accountable to themselves. And so with dbt, you can build all these really critical pieces of visibility into your data pipeline. So version control and review cycles, testing validation, and because there’s transparency now, and to that piece of the pipeline, many more people in the organization can contribute to generating transformations.

So data pipelines are not just relegated to engineering to manage. So we’ll take a very quick look at some of the features that we’re building at Fivetran. So first I knew VS Code extension that will be available to all dbt users, which will help accelerate your time to commit. Second, dbt packages [00:08:00] that accelerates time to analysis. And third, our dbt orchestration product that helps accelerate you into your insights.

So first our VS Code extension. We want to make it as easy as possible to build dbt models. So we’ve developed a language server that allows you to validate your models as you write them, rather than having to wait until compile and commit and merge to detect errors in your code. So this is something that has always existed for coding languages like Python and Java, but not for declarative SQL with dbt.

And so we’ve designed this extension for BigQuery initially. So you’ll get common, BigQuery functions, auto completed, and defined for you in your IDE. And it’ll use your dbt profile to reference table and column names for data validation as you code. So it auto compiles as you go. And you can right click to show the compiled SQL preview, which will expand any Jinja and table refs into the actual raw SQL that runs when you execute your [00:09:00] model.

We are very excited to get people testing this out and to build your features into this and expand it into other warehouses. And we’ll be launching this formally in the VS Code marketplace later this month. So from there, you’ll be able to see our source code as well and make requests and contributions directly in GitHub, which we are very much looking forward to.

Next we have our dbt packages. So we’ve been building these packages over the last 12 months or so, which are designed specifically for our Fivetran connectors schemas, but are available to anyone who uses dbt. So these packages contain common models for our most popular sources. So we have sales reports from Stripe and Shopify and financial summaries from NetSuite and ticket movement analysis from Zendesk and JIRA which allow you to get started analyzing that data much faster than if you were to build out those tables yourself. So we’ve seen this cut down on the time to get set up with dbt enormously for our [00:10:00] users, and we’re always building new packages and iterating on existing ones as those feature requests.

And just to reemphasize, we love the open source community here. Everyone in the dbt community has been really instrumental in helping us create these models and make them useful, not only for Fivetran customers, but also for the dbt community at large. So huge shout out and thank you to our 90 something contributors so far who have helped us develop these packages.

And then last but not least, we want to help you orchestrate your dbt models once you’ve built them. So with Fivetran transformations, you can schedule your models to run at the same frequency as your upstream connectors, which means lower data latency and improve performance on all of your transformations.

We’re making a lot of improvements on this product, around data observability and visualizations texture which I’m thrilled about. But if you have any questions about these tools, definitely reach out to me or Nick. And with that, I’ll hand it [00:11:00] over to Nick to talk about implementation.

[00:11:04] Fivetran, dbt, and machine learning #

Nick Acosta: Thanks, Anna. Now we’re going to take a look at how some of those things, and I mentioned can be applied to a use case. For my case that’s been with machine learning. So I’m going to be showing how and why I’ve been using these three technologies on my screen to forecast ad performance. So Fivetran runs ads on a number of different services, and we have information on their performance and report on among other things, the amount of clicks and impressions these ads are generating. With machine learning, we can extend these reports into forecasts that will predict future impressions in our cuts. And this will be helpful for informing or maybe altering targets in a really dynamic space like digital ads. If a forecast for the next month was significantly higher than our target, maybe targets could be reassessed and brought up in line with them [00:12:00] or spending could change the models I’m using don’t take into account the amount that we are spending, so they could maybe inform the allocation of spending enabled purchasing on well-performing ad campaigns in well-performing channels before they happen.

Things get more interesting as I start to apply these forecasts and multiple metrics and see some things like how are ad performance forecasting could affect our sales pipeline forecast and vice versa.

I know some organizations developing attributions models here in this space, and I’m looking forward to us getting into that. If you’re attending Coalesce, and in particular this talk, you may already have an effective modern data stack in place and are able to report on a number of different KPIs from all the departments in your organization.

If you’re looking for the next place to retrieve value from your modern data stack machine learning, but there are a number of different ways to make that happen, right? To bring machine learning into your [00:13:00] organization and to build and run it and true to Fivetran fashion. I’ve been focused on building one out that is not only effective, but very easy to use.

So I’m going to get into its architecture right now. So at Fivetran, the product automates data pipelines with fully managed connectors and Fivetran has ads about those connectors running on LinkedIn and running on Twitter. You can see this architecture starts with me using two of those connectors to bring data from LinkedIn ads and Twitter ads about Fivetran into our BigQuery.

From there, this may be a familiar step, I’m using dbt to transform the data coming from Fivetran into a data model that aggregates data coming from both of those sources into a single table. What you may not be familiar with is the dbt that I’m running. To do so I’m using the dbt out reporting package [00:14:00] from Fivetran and I’m pasting that in Slack right now, some more, information about it. But with this package, I’m just like this cow person. I am rounding up the data from across these ad platforms to get a clear picture of our daily spend impressions and clicks all within that single view, that single table. I’m using these first two ad sources, but you can use any combination of the seven that you see on my screen.

And with that, I’m just calling dbt run, haven’t wrote any of this SQL of this step. And I’m resulting with that very clear data model. I’ve dug into the ERD and the schema of some of these data sources. Some of them can be complex, but all of them are very different from each other. So having this package do things like find the tables where clicks are stored for all of them, find the data types that cliques are stored in and match and bring [00:15:00] all of these things together is something I’m glad that I don’t have to do. So I liked this package in particular. I also liked this package because of all the ones that Fivetran has contributed. And I mentioned, like an image and there are many, this is the only dbt package that we’ve contributed that goes across multiple different sources.

And just browsing all the others that are available at the dbt package hub, I think it may be the only one that does so when you consider both Fivetran and non Fivetran packages. So I like it a lot and I’ve been using it quite frequently especially for this use case that I’m showing right now.

But it does have its challenges using this particular package that’s going across different sources. Running a dbt package that relies on different sources has been difficult because the connectors that I’m running start at the same time, but not really. Maybe there’s a couple seconds of latency between them.

I have seen that they definitely almost always run at different links of time. So I’m often seeing us bring in more data [00:16:00] from LinkedIn. So our LinkedIn connector tends to take a little bit longer. So syncing those both up is difficult. Andhaving dbt run right after is also difficult because my dbt is in a different environment, running on a different schedule.

Bringing these altogether in concert is really difficult and also really important for machine learning, right? This pipeline that I’m building out, like Anna mentioned, I’m part of that 98% where this is a pretty brutal pipeline. And if things don’t happen at the right time in the right order, it can break down really quickly.

So I just got preview access to this integrated scheduling that we’ve built out at Fivetran. And I’m excited to bring this package and the rest of the dbt that I’m using out of a VM and into this integrated scheduling and having everything run together. So if you would like preview access to this feature, please let us know.

Now my data has been modeled by that ad reporting package, but it’s still not ready to be consumed by a machine learning [00:17:00] model. And I have to make some further transformations to bring the data to a state where it can be, and I’ll be showing exactly what I’m doing in two slides. But it’s a pretty straightforward process to get data from here to here especially when I’m using dbt. And I’m starting from this ad reporting package data model. Up to this point though, it’s just been ELT, right? I add a machine learning component here. And I’ve been experimenting with a couple of different ways to do. I found two options are basically available to you in BigQuery.

I’ve been using both to some good results. And I think the situation that I’ll explain about my experience and using BigQuery and adding machine learning to it is pretty similar for some other data warehouses I’ve seen. I’ve used this, a similar setup with Redshift and I think that it goes about the same way in the number of options in how you perform it, why you would choose the two.

So with [00:18:00] BigQuery though, they have BigQuery ML, which allows you to create train and use models all in SQL. And this really limits the complexity of getting machine learning model off the ground. So this is what I started with just calling a model that was available, provided by GCP in SQL and having that be trained and available and making predictions all out of SQL.

Really easy to get started and get a model up and running, that can just map the input that I have coming from these transform data into output, into some sort of prediction and get this pipeline flowing in data, moving all the way from source to forecast. But the limits in those complexity also impose limits of flexibility.

I can’t do everything that I want or anything that I want in machine learning, just out of the box with this BigQuery ML tool. Fortunately, I’ve been able to get that additional flexibility from TensorFlow, which is a Python library that allows [00:19:00] you to do just that. Whatever I want to do as far as how I’m on a model to look or how I want it to train, TensorFlow gives that capability.

It’s a little more complex, but much more flexible. Fortunately BigQuery allows you to upload train models. So as I’m building them out, get one trained, I can then bring it into BigQuery through BigQuery ML and have it making predictions without data leaving Fivetran’s data warehouse. I can have it stored as a BigQuery resource.

So either way, whether I’m bringing models in from TensorFlow or creating them directly in BigQuery, that’s a break from the norm for me. In the past, I would be bringing data out and bringing data to models, right? Sampling data out of a data warehouse into some object storage bucket and using a separate tool or in most cases, a completely different platform that was comprised of many [00:20:00] separate tools to develop and create these machine learning.

And then once one was developed and ready to use, I had to constantly sample data out to be able to use that machine learning model and that allowed for a lot of opportunities, for things to break down as data being sampled out. It can change from the source. It can change either in transit and kind of skew things quite a bit.

So what this current architecture I’m using does the exact opposite. I’m bringing data models, as you can see to the data. And keeping everything here in BigQuery, keeping everything in our data warehouse. Since I’m doing that all in SQL, I can run everything out of dbt. And I get all of the inherit, all of the benefits that you all may be used to, and that I like coming from dbt. One of the great ones that you can see on my screen here all the operations that dbt performs. It’s automatically storing these operations in [00:21:00] tables in my warehouse. So I have just checks in data and intermediate steps along the way.

That’s not really something that you’re seeing in a traditional or maybe a different machine learning platform where data is just running around without automatically being stored at every step, including this last step where the dbt that is running these models to create predictions is automatically storing them here in a table in my data warehouse.

As I mentioned at the beginning of the slide, though, everything up until this point that I’m circling here is ELT, which is why I was really excited to show this ad. Because the most difficult parts that I’ve had in this project, the most difficult parts that I’ve experienced in machine learning in general is getting data to here, getting data to a state where it can consistently incorrectly be consumed by a machine learning model is really where I’ve spent the most time and I’ve struggled the most.

Other [00:22:00] difficulties lie in this step, right? When a model has outputted its prediction, transforming that prediction that a model makes into data that is useful. So I bring that up because another way that to say that it’s the most difficult parts of a machine learning architecture are analytics engineers’ best skills. So I really encourage you to, if you’ve been on the fence about machine learning or maybe haven’t gone into it yet, I think now is a great time and that you have a lot of, not only a lot of really tangible skills for it, but some of the most important ones. And it would be a great time to get started if you haven’t already. I think that this architecture is a great way to do so that would really highlight some of the skills that you may have already picked up by using something like the non-dbt steps, right? The non analytics engineering, things that I’m doing here, the model development and model building, 99% of the time, I’m using something that someone else wrote, right? Like using another data machine [00:23:00] learning model that was given to me, you have to do that with BigQuery ML.

There is some flexibility, but you have to use like models that dbt has provided for you. But even with something flexible, like TensorFlow, I’m almost always starting with a similar model used for similar use case and maybe tweaking it here or there, but really getting those difficulties in. How do I get data into this model?

How do I work with it after some of those transformations in engineering, analytics engineering skills, and if you really want to get philosophical about it, a machine learning model is just a particularly big and particularly messy transformation in and of itself. But I am programming this architecture together in VS Code and it’s been working really great for me with that dbt extension that Anna mentioned because I can really start to experiment with things pretty fast and have all of my secret code in the same environment for my transformations in the same environment as my [00:24:00] Python code for machine learning model development.

And not only that, all of my SQL code that you get just in general with VS Code, but with our extension, getting that awareness of my dbt profile and air handling and auto-completion that are specific to our schema is in our data warehouse here at Fivetran has been really valuable.

[00:24:20] Experiment faster with VS Code #

Nick Acosta: And I think has even sped up this experimentation process. Just for an example here, you can see I know it might be small here, but you can see I have the number of input steps here being 21. So the data that I’m bringing into this machine learning model has been the current day’s date for every row.

I’m bringing that a data set that has, for every row, the current day’s date, and then the past 21 days of impressions up to that date, then the model is outputting a forecast of the next seven days. But if I wanted to switch that up and maybe change that 21 to 28, look at how four weeks of [00:25:00] data would model the next seven days of impressions.

All I do is switch a number here in my Python code, and then have the dbt code here matching that number of rows that my machine learning model would expect. And I am up and able to run an experiment. Whereas in the past, I would have to go to my dbt environment, check out my code, alter a couple of lines, check it back in, go to a completely separate environment, make sure that it has the same data that dbt is getting, check that code out, alter it just a little bit, push it back in. So this has really sped up that process of experimentation for something that’s too simple of a process to be so difficult. Now, they’re just one tap away from me.

So this setup is something we’ve been calling the modern data science stack, and it has a number of benefits. One would be I’m managing fewer services here, [00:26:00] right? By bringing data to by bringing models to data, instead of what I’m used to in bringing data to models I’m dealing with a single modern data set here, right?

I’m not dealing with something for analytics and something for machine learning. And this eliminates that data movement between the two systems and not only eliminating things for that can maybe go wrong in that data movement, but it’s also preserving all of the governance, all the security, the data lineage processes and all of the investment that Fivetran has built out into our data warehouse that I lose as soon as I sample data.

Scheduling machine learning as SQL so it’s something like dbt and GitHub, using those two things together. In the previous example I provided I can track and version experiments really easily. And because I’m using dbt here, it’s automatically creating an appending to tables, right?

Every model here that I’m creating and every [00:27:00] step along the way, it’s automatically storing things that could be referenced in SQL. So it’s much easier for me to not only have every models every prediction that a model makes. I don’t have to keep track of what I’m storing and how I’m storing it. dbt is just taking care of that.

For me, in other AI environments, predictions get wasted. As you’re experimenting and trying to find over the right model and trying to find the right output, your experiments can be overwritten. You can write them to a text file that you forget about, or you can just not use them. It’s all there automatically with dbt, which I really like.

And it’s all there in SQL too, right there, right next to my source table in my data warehouse. So it becomes a lot easier to work with these resulting predictions and pulling them out of a deep learning. You can learn more about the modern data science stack and a link that I’m pushing to our Slack right now.

I wanted [00:28:00] to end here with some results that I’m seeing from this project and where I would like to take it. For these graphs, the blue lines are the actual metrics that Fivetran have seen and the dotted orange lines or what my machine learning models have been forecasting. So you can see the product progress kits is quite good. Here. It’s still pretty good. Seven days out, but 21 days out the performance starts to wane quite a bit and get a little bit worse here. And with how Fivetran operates, I wouldn’t see much use in a forecast that wasn’t looking out at least a month in advance. So I need to do some more continued experimentation here on the models I’m using the metrics that I’m using them.

As I do, I’m going to be accumulating more data here, which I think is going to be the biggest aid in this success. As some of the training sets that I’m using here are under a thousand records, so bringing that data up generally in a machine learning model, the more data you use is going to [00:29:00] affect the performance of them.

I’ve especially seen that in time series forecasting. So I’m looking forward to not only having more data, but because of the nature of time, everyday, we’re accumulating more data there. I think that’s going to go a long way here. I think the moving to Fivetran integrated scheduling, getting everything on the same schedule and pipeline is going to go a long way in operationalizing this as well.

Once I have these forecasts all put together, it’s going to help inform and have Fivetran make business decisions quicker in accelerating where to spend our ad allocations, which platforms, which campaigns. Have those happened earlier? How have those happened? Maybe more proactively and get an even tighter lasso on our ad platforms than this cow person. So I’m really looking forward to that. But with that, I hope you’ve enjoyed the last 30 minutes here with Anna and I at Fivetran and have some new ways to run or [00:30:00] write dbt with VS Code or with Fivetran integrated scheduling, or even new ways to use dbt by using it with something like a machine learning.

So if you have any questions, please let us know at Coalesce Fivetran. But for now I’ll throw it back to Julia and thank you for your time. Thanks for having us Coalesce.

dbt Learn on-demand

A free intro course to transforming data with dbt