Table of Contents

dbt for Financial Services: How to boost returns on your SQL pipelines using dbt, Databricks, and Delta Lake

Ricardo is a solutions architect specializing in machine learning.

For modern investment management institutions and startups, gaining a competitive edge in the age of digital assets and boosted retail investing is difficult.

The difficulty comes from:

  1. variety of sources including fundamental data, market data, and alternative data, real-time feeds
  2. the sheer volume of order flows in today’s markets

Now, fintechs are figuring out how to process these large amounts of data along with traditional Wall Street firms.

We believe a lot of business logic to compute metrics such as NAV, peer groups, and trading benchmarks are mostly expressible in SQL but the problem is a lack of reliability, performance, and rigor around deployment of these rules-based models to production.

Databricks, Delta Lake, & dbt provide the perfect trio for saving investment management firms money and boosting productivity by allowing analysts to deploy SQL pipelines much faster. This is especially true for the variety of datasets described above.

In this talk, we will show portfolio managers, quants, and finserv practitioners how to deploy modern investment management pipelines within minutes using dbt + Databricks.

Follow along in the slides here.

Browse this talk’s Slack archives #

The day-of-talk conversation is archived here in dbt Community Slack.

Not a member of the dbt Community yet? You can join here to view the Coalesce chat archives.

Full transcript #

Elize Papineau: [00:00:00] Hello, and thank you for joining us at Coalesce. My name is Elize Papineau and I am a senior analytics engineer at dbt Labs, and I’ll be the host for this session. The title of this session is DBT for financial services. How to boost your returns on your SQL pipelines using dbt, Databricks and Delta Lake.

And we’ll be joined by Ricardo Portilla who is a lead solutions architect at Databricks. Before we get started, I want to point out that all chat conversation is taking place in the Coalesce-Databricks channel of dbt community. If you’re not part of the chat you have time to join right now, please visit community.gitdbt.com and search for Coalesce-Databricks when you enter this space. We encourage you to ask other attendees questions, make comments, react to the session as it’s going on. At any point during the talk we’re all [00:01:00] here together on Slack while we watch this session. Afterwards, the speaker we’ll have a couple of minutes for a live Q&A, and then we’ll be available in the Slack channel to answer any remaining questions.

We encourage you to ask your questions throughout the session so that we can get them queued up for the end.

And now let’s get started. Over to you, Ricardo.

Ricardo Portilla: Awesome. Thanks Elize. So I’m really excited, thank you guys for having me to talk today about dbt, Delta Lake, and Databricks in the context of an industry use case.

So today I’m actually going to focus specifically on investment management, but if you’re coming to this talk from another area of financial services, if you’re doing anything with alternative data, I still think a lot of the same concepts are applicable. So I think it will be useful, especially if you’re a dbt or Databricks fan.

Okay, so just to give an [00:02:00] idea of the agenda, of what we’re going to cover today, I am going to talk about data silos, specifically again for financial services and practitioners who run into this all the time, talk about why a Lakehouse is important for for the financial services industry architecture, and specifically, why dbt with Lakehouse?

I think these are two really powerful architectural concepts or platforms, but I think the combination is really key to simplifying a lot of a lot of industry workflow, and we’ll talk about ideal state architecture, and then lastly, I’ll close with a day in the life of what real practitioners might do at an investment management firm or anybody dealing with a blend of alternative data, unstructured data, structured data, and how to bring it all together under the same umbrella. So we’ll do that and I’ll cover a little bit of a technical demo. All of the things that I’m gonna be talking about today in terms of the dbt project, I have a public link for the project, so [00:03:00] I’ll share it after this and it should be publicly available if in case anybody wants to go and run that and then see how all this stuff is working. All right. A little bit about me. I I spent time at the University of Michigan, getting a math degree and towards the end of my time there, I actually focused on quantitative finance. It was completely outside of the realm of what I studied in in my doctorate program.

But it was fascinating because it was a blank. the financial services We just come off of a lot of trouble in the financial services industry. And so it I got the chance to learn more about practicality there. And then I spent seven years at FINRA doing everything from data intake to a lot of analytics engineering wasn’t called that back then, but a lot of SQL pipelines and applying rigor to those for market regulation, market surveillance, and then more recently at Databricks, I focused largely on the [00:04:00] FinTech space and helping financial services customers with everything from architectural guidance to use case deep dives and things like that.

That’s a little bit about my background. I’ve had tons of experience with SQL pipelines and all of the pains that have gone through that and applying to that. So we’re going to talk about how to simplify that today.

All right. So I think this quote has been repurposed for many different areas. I really love this quote and I’m presenting it here in the context of an asset manager, because I think largely if you look at history, The industry has evolved quite a bit. And I think now, more than ever with the rise of a lot of retail investing, a lot of support for retail investors and really the democratization of finance that, you know, what got data platforms, trading platforms, anybody supporting FinTechs on the investment management side.

"What got us here [00:05:00] won’t get us to the future. "And what I mean specifically is using tools like spreadsheets for for data science or just general BI using Hadoop for any kind of big data workloads or analytics engineering, or even brute force methods. So common example or primary example would be high-frequency trading firms, right?

In today’s world, there are lots of open source technology. There are lots of options for really rich analysis. dbt lowers that cost of curiosity also to do some pretty nice sophisticated SQL analytics. A lot of the high frequency trading from use cases I think are not as in the forefront, and I think the revenue has actually decreased quite a bit just in terms of total folks pursuing that market. So I think there’s a huge opportunity for open source technologies, and a big [00:06:00] opportunity for figuring out the right architecture that’s open and simple, and a lot of these older methods aren’t, quite going to get us to the future, especially if you talk about things like Hadoop that are extremely complex to code and deal with.

Personally, in prior roles, I’ve moved completely away from Hadoop and a lot of those same constructs. It’s phase really about how to make that simple. And of course I have to include a requisite Warren Buffet quote here.

I think this is a really nice quote again, because you can repurpose this across lots of areas. In the context of finance he said, "I don’t look to jump over seven foot bars; I look around for one foot bars that I can step over." And I think that the key to, this methodology is to focus on things like understandability, competitive advantage, price. This was mostly in context of investing, but I think as a data team, who’s trying to [00:07:00] build a future-proof platform that can process vast amounts of market data, vast amounts of alternative data, that ,really, this quote is about keeping things simple and removing complexity.

So that’s something that I think dbt and Databricks do actually really well. Personally, I think there’s always a big challenge in the financial services industry to get your hands on all sorts of data, join it together, get insights, and so you want to do as much as possible to lower the cost of curiosity. and I think dbt does a great job in that area, removes a lot of the, or adds a lot of data quality measures for all of the pipelines that you create, and then what I’m going to show in a little bit is that Databricks really helps in leveling up and being able to join or marry a lot of the alternative data sets that you might get that are highly unstructured, with more structured data.

[00:07:57] Can we reinvent asset management while keeping things simple? #

Ricardo Portilla: So, a match made in heaven there. But [00:08:00] I think that’s, the key from this role. So can we reinvent asset management, take it to the next level, use open source technologies, but still keep things simple? So I think the answer to that is a resounding yes. And I’ll go through a few patterns to do this that I think will resonate with, again, with practitioners who are dealing with market data, dealing with data. but at the same time, trying to figure out how are we going to apply rigor around our unstructured data sets, how do we apply rigor around SQL pipelines? A lot of times, at least in my experience, there’s been so much DIY in the financial services industry that it’s really hard to keep things simple.

And so leveraging really robust open-source frameworks, I think is the key to solving a lot of problems and collaborating as well.

[00:08:55] Why are trading systems siloed and hard to govern? #

Ricardo Portilla: All right. So why are trading systems, in particular, so hard to govern? And [00:09:00] like I said before, I think this applies to anybody who’s analyzing structured data, and then real time data, and mixing that with batch data, and consuming any kind of alternative data. This doesn’t have to be the investment management space particularly.

governance of siloed trading systems

Ricardo Portilla: It could be if you’re doing any kind of a cryptocurrency or you’re a FinTech who’s committed to managing credit risk use cases, same story. So I think what’s really difficult is that trading systems are siloed. You’ll have the top-line going through your more, very low latency, real time pipelines. A lot of times there’s a lot of custom Java ETL that’s actually going, setting up your streaming apps, piping things like real-time market event data, quotes, and then training analysts may be viewing those on operational apps or via other web apps and those tend to get prioritized.

And then everything else that’s a large scale batch data set. Think of your vendors [00:10:00] like Thomson Reuters and other vendors there, they’ll get you batch quotes. This is the bottom line. This is more of your batch pipeline. Those are completely analyzed and always, a lot of times completely separately.

So you may store those in specialized time series databases, because a lot of market data is very naturally time indexed. You need to query, as of particular times, you need to join quotes and options, data, all sorts of data together to figure out a holistic picture. So the problems with things like time series databases, a lot of times is that they are limited in what they can support from a data size perspective, and cost can rise very quickly in a lot of these databases.

So you pull out some data to a data lake, but then it gets hard to constantly keep a consistent view of your data and to have one single source of truth so that you can apply statistics and other advanced use cases around risk to your data. [00:11:00] And then the middle section is completely separate.

A lot of times, the alternative data that you get comes into a completely separate either cloud storage location, sometimes somebody will download it to their laptop, and so even if you do store it in a data lake, something central, a lot of analysts or data scientists may pull those results to their desktop, where it’s very hard to collaborate with anybody, whether that’s a data engineer, analytics engineer, or any kind of BI analyst, trading analyst.

And so it’s just, again, not a lot of upper rationalization going on, sorry. Or just consistency in terms of data that people are analyzing collaborating on. So, these tend to be major challenges and things that we hear all the time, and this is why there’s not this, or there’s a very high degree of data sprawl, tool sprawl, because you can see [00:12:00] that there’s a different process or even code base for a lot of these different pipelines.

Enter Lakehouse. I think the Databricks Lakehouse concept is really simple to grasp. It’s a combination of all the best features of a data warehouse, and then all the best features of a data lake. Personally, in my prior roles and prior companies, we have invested very heavily in data lake because they could support vast amounts of unstructured data.

So things like texts, really important for analyzing earnings reports and just understanding the profile. At the same time they support machine learning, and what Databricks has done is basically optimize this concept so that you can query data directly on cloud object storage. We have thousands of customers globally doing this, and we have extended open source technology to make this the best possible experience for all different personas, whether you’re a trading analyst, data [00:13:00] analyst, to data engineers, and machine learning practitioners.

I think the proof in adoption is here really in terms of how many monthly downloads. Our customers, basically everybody around the world, pulls all these different technologies. And so what we’ve done is basically just try to make the whole ecosystem as open as possible for providing support for open source standards and formats.

[00:13:26] Lakehouse with dbt #

Ricardo Portilla: So why Lakehouse with dbt? So I think that from a collaboration perspective, dbt opens the door for analysts and data scientists and engineers to work together very well. The integration with things like git is really important for collaboration, but apart from that. the, design of the Lakehouse is also open by nature, similar to the, foundation for dbt.

Whether you’re analyzing unstructured data, [00:14:00] like text, images, reports, or you’re doing more structured data like market data, semi-structured data that you might get from your operational apps, you could sort all those centrally in a data lake and you can see here from all these different personas. Everybody’s operating in the same copy of the data.

And so that’s what we mean by open design, Delta Lake provides assets semantics. It gives you ways to merge data together, account for late order arrivals, account for late corrections, which are extremely frequent for trade data and any kind of market data. So it allows you to reconcile those really easily.

And basically, the lake house provides you with a nice performance engine, so that if you’re a quant? and you’re querying the data, you’re going to get the same experience as an analyst, and you’re going to get the consistent view if you’re doing any sorts of statistical or machine learning analysis as well.

So that’s what we mean by combining data in AI in one space so everybody can use it. This is a snapshot [00:15:00] of the architecture with dbt highlighted here. So I think what’s such a big part of why I love dbt, especially for financial services use cases is because I’ve seen the same kind of framework built from scratch.

And it’s a huge operational burden on data teams. There’s testing, which typically has to be supported in a completely different framework from the coding framework. That’s one thing that’s built in. There’s also documentation. So documentation of testing itself, and then documentation of code and lineage is something that’s generally again, you have to either code yourself or use other tools which may be proprietary.

And then lastly, just having all of the rigor and governance around alerting, logging, when you need to do runs or schedule any jobs, all that is available in the same place, [00:16:00] and so from the Databricks perspective, you can do any kind of SQL pipeline orchestration this way. You would use Databricks as a backend.

So Databricks has a lot of features that support good performance for BI, including our product Databricks SQL and so dbt is basically a layer on top of that, where you’re executing your SQL pipelines and the background those are just executing against Databricks optimized SQL endpoints, which include Thoton and generally operate the best on Delta Lake.

So that is your single source of truth here, and then when hooking up to any of these different Databricks SQL end points, you can of course use other tools like Tableau or Power BI, whatever BI tool that you’re comfortable with, but generally here, we’re just going to be talking about dbt’s interface, dbt Cloud.

So one of the things I want to talk about here, which I’m going to show is that there is a great model, I think, which is really [00:17:00] understandable. And that’s that moving the data into cloud object storage is extremely frequent, a lot especially from from a lot of data vendors who will provide the data.

And cloud storage like S3 buckets or ATLS, or whatever the cloud specific storage mechanism is. So I think the key is that as much as you can, dump it into the data lake. That’s a highly automateable process and I’m going to show a process that automates dumping of text data with a little bit of sentiment tagging into the lake.

Market data typically arise in the lake first as well, so the idea here is to leave it where it arrives so that you can have all these breadth of use cases operating directly against it and give analytics engineers the power to write all of the business logic. A lot of times the engineering team actually has a lot of control over the business, which doesn’t make [00:18:00] sense a hundred percent of the time, but I think this is a great construct for again, lowering their cost of curiosity, figuring out exactly what surveillances need to run, what trading strategies you want to implement, looking at benchmarks. So that’s a little bit of what I’m going to show today.

So to to show the the optimized architect, Instead of having completely separate silo data pipelines or silo data in general, the idea here is that you can actually land everything into Delta Lake, whether that’s coming from the streaming source, that’s your first party data, third party data you can see here market feeds coming into Delta from streaming sources. You also have things like Auto Loader, so that anytime you land any file type in your cloud storage bucket, you can actually just land that into Delta Lake with some automation and everything beyond that point comes under the purview of dbt. For example, so if I wanted to compute a real time view of my portfolio, [00:19:00] then as long as my market, data is in my lake, as long as I have quote data in the lake, my first party transaction data, my portfolio data security master, I can do all of this or most of this in SQL and then use dbt as that layer for testing lineage, any kind of incremental modeling, which is going to be key. On the underneath all of that is all of the data bricks compute, so it’ll support high concurrency. It will support serverless compute, and under the hood, all of terms of Delta output use cases, this is just a very small subset of what you’re going to get with this kind of architecture.

Like I said, you have your more BI or SQL centric outputs here, or BI tools like Looker or Tableau, and you’ll be able to do things like market intelligence. You’ll be able to do things like trading benchmarks, whether that’s simple, moving averages, VWAP, all of those outputs will be available for you.

And even things like market [00:20:00] impact, where again, you’re lowering costs of curiosity. You can analyze the full data sets with the power of the SQL end points and the dbt APIs. So that’s one half. The second half of the whole Lakehouse architecture advantages is that, again, data scientists can come. Anybody can come really and implement a value at risk use case. They can implement a statistical arbitrage or anomaly detection. There are tons of rich open source libraries, which will allow you to do this stuff, and so you don’t really have to sacrifice this and you’re hitting those same underlying Delta Lake tables.

That’s where I wanted to head for the streaming architecture, or for the overall lake house architecture.

[00:20:43] Data-driven investing with dbt & Lakehouse #

Ricardo Portilla: Okay. So right now I’m just going to walk through a quick day in the life, and then I’m going to hop into the technical demo and or you can feel free to check out that after the fact as well.

I’m not gonna get too far into the code, but suppose I’m a trading [00:21:00] analyst or an engineer, and I want to understand if I can execute a large trade at the beginning of the day, let’s walk through a few different things that you can compute with dbt as you go. I likely want to know things like "what is the intraday position for a stock? For a portfolio? For whatever instrument I’m analyzing?" And suppose I want to have hourly refreshes for my positions, for what I look at throughout the day.

Ideally also, I want to keep things simple and I want to avoid writing complex merge statements or logic. The true value of dbt is that the API gives you a really easy way to do incremental modeling. In particular, if I wanted to compute this real time or intraday view, if my trays are landing, my first party execution data is landing somewhere, or that’s stored in the data lake and then my bid and ask for a real time third-party stock date or equities is coming into my data lake as well, those are representing the [00:22:00] green rectangles. So dbt can, through incremental modeling, basically consume all these data sets, compute all my positions, and then using that real-time view of how the prices are changing, compute my book value, or compute my portfolio value. So this is a really powerful concept, and the fact that I can do all of this incrementally is is really amazing, especially with just the simple APIs in dbt. Under the hood, a lot of this is coming from the capabilities of Delta Lake to be able to merge data do that in a performable way. we’ve gotten lots of optimizations in Databricks for that kind of workflow specifically. So that’s what’s happening under the hood. In the middle of the trading day, suppose I’m going to want to do visualizations, I’m going to want to basically look at things like alternative data, like my sentiment, and then join that with market data, and so all that’s possible with this tool called Databricks [00:23:00] SQL, which I was talking about earlier. So that’s where these visualizations are coming from, and I want to show everybody how, this works in practice. So I do have my dbt project here. A lot of the models again are doing that incremental ingestion.

So just to give an idea of how that’s working, here’s a really quick example of, let’s say, quote data that I want to incrementally add into my data lake tables. So you can see here that I am specifying incremental merges and Delta Lake is the key to doing this. That’s the foundation for the lake health.

And in all of my models here for all the market data, basically I’m defining my own surrogate or unique key so that I can keep track of, let’s say, ticker and timestamp. For different instruments, you could define your own key, but this is the heart of being able to join all these data sets together and do this incrementally. So this is one of the basic pipelines.

For something a little bit more complex, like I mentioned, [00:24:00] there is a nice way to basically just ingest not only textual data, but apply really quick sentiment analysis on those, and so this pipeline specifically takes all of that intraday book value. It puts unstructured data with a little bit of a sentiment polarity score in there, and this is actually marrying that data together. So what is my average sentiment and how do I overlay that with, for example, my market data, or summaries of market data, my minute bars, or some of my trading benchmarks? And so this model goes through all that and that final piece of being able to display this in the dashboard, all the hard work or heavy lifting is being done by dbt of course. So all those data models are created, but once I’m ready to actually share this with somebody, this is a Databricks SQL dashboard. I can share this with other users. I can schedule this for refreshing every hour, which this actually is. You can see this refreshed 20 minutes ago, probably at the beginning of this [00:25:00] talk.

And then you have other capabilities like exporting to PDF and sharing with other users. But yeah, from an internal reporting perspective, this is a great tool to use. You can see here I’ve got my real-time book value that was created from one model. I’ve got some raw textual data and the sentiment applied here, which it shouldn’t be a surprise for comments like these on the Tesla shorts. In addition to that, I have benchmarks like VWAP incorporated. The sentiment overlaid with actual normalized value and then things like minute bars. Normally, tools like this, where you’d need a proprietary trading platform to provide this for you, time series database, this is all made possible just through two tools in Delta Lake that I had mentioned earlier and all that integration.

Really wanted to just showcase how simple all of this is to put together using these two tools. [00:26:00] The actual project is stored here in this git repository. So again, this will be shared out for anybody who’s interested.

I did want to leave a little bit of time for questions. So let let me check the Slack.

Elize Papineau: Yeah, I can go ahead and read. We have one question right now from Kelvin, please forgive me if I mispronounced your name there.

They are askin:, Databricks delta life tables and dbt plus Databricks seemed to have some similarity with each other. When should we use each?

Ricardo Portilla: Yeah, that’s a great question, and we have a lot of customers that we talk to that are using both.

So we actually published a blog with Bread finance recently who are using both. DLT is being [00:27:00] used to do some streaming work that supports machine learning, whereas for dbt, there was already an investment in dbt for a lot of the customers that we talked to, and so I would say if you already are using dbt, you have hundreds of models in dbt, absolutely continue to use it.

And all you have to do is switch the backend to use Databricks SQL end points as the compute layer. I think it kinda comes down to that level of what you’re already invested in, and then you can make the decision to kind of explore DLT where it’s appropriate. For more than just SQL support, DLT is going to support Python and other things too.

Elize Papineau: I have a quick question, as someone who has zero background in finance, are there any unique challenges with financial data sets that you don’t see working with, maybe, other traditional more BI oriented datasets?

Ricardo Portilla: Yeah, that’s a really good question.

So I think the ability to join unstructured and structured data together is such a big one. In the [00:28:00] past when I’ve had to do analysis or data science and try to get my hands out on unstructured data, it’s always been in a completely separate place. So keeping it centralized in the lake, I think, is a big challenge, but I think more on the domain specific side, late arriving data and corrections is enormous in financial services. So the ability to do corrections and merges and things like- is really key here. So I think that’s why we highlight the Delta Lake pattern and some of this incremental modeling as well.

Last modified on: Apr 19, 2022

dbt Learn on-demand

A free intro course to transforming data with dbt