Table of Contents

dbt in a data mesh world

José works tirelessly to acquire extensive amounts of knowledge, be it in computer science, design, or automation. He’s always looking for feedback and ways to improve as a person.

What if dbt was able to connect to multiple databases at once and cut on costs and development time? Or even better, avoid lock-in into a data warehouse engine?

In this talk we advocate for using dbt with Presto/Trino, a great open source technology and what are the roadblocks to arrive into a world where every database in your company is at the distance of a query.

This alone would be great, but data mesh is here to make this even better by making every team care about its data which will lead into a truly data-driven company.

Follow along in the slides here.

Browse this talk’s Slack archives #

The day-of-talk conversation is archived here in dbt Community Slack.

Not a member of the dbt Community yet? You can join here to view the Coalesce chat archives.

Full transcript #

[00:00:00] Mark Maxwell: Hey folks, welcome to Coalesce. And I hope you’re all looking forward to this talk. So my name is Mark Maxwell. I work for dbt I’m leading the go-to market charge in EMEA for our cloud products. And this is a talk I’ve been really looking forward to because over the past while, I’ve really noticed an increasing popularity in the data mesh approach.

So I’m super interested to hear how dbt can work with that. And I’m super interested to introduce our speaker tonight, who is Jose Cabeda. He’s a data engineer in TalkDesk. He seems to have had a really interesting experience with with data mesh so far had a large impact on TalkDesk and is particularly interested in the impact that data mesh can have on how startups can scale their data infrastructure in particular. So I’ll get out of the way. I’ll leave it to Jose to run you through.

[00:00:53] Jose Cabeda: Hi everyone. Really great to be here. Thanks for the invite. Today I’m going to talk a bit about some of the [00:01:00] issues that we had trying to scale up, how dbt helped us and what the hell is data mesh.

But to start it all up. I’m Portuguese I come from the port wine city and I’m also a bookworm, specially love Harry Potter books, but at least partly stock. I’m a data engineer TalkDesk. And I think actually it’s most important point from all of What is TalkDesk? TalkDesk is a company that was founded in 2011.

It started in a hackathon sponsored by Twilio, and we provide the contact center as a service or what we like to call a CCas. And this. Nothing more than each company needs to have support for requirements. We provide a service in the cloud for them to be able to make calls, receive calls, send messages what they need to make their customers happy.

And another relevant point [00:02:00] is the fact that recently we have gotten a 10 billion valuation. And why is this relevant with the increase of the company? We have also seen an increasing data growth. We have seen twice the event types that we gotten since 2017, gotten twice the data throughput the number of events since that same time.

[00:02:24] Data growth #

[00:02:24] Jose Cabeda: But another point that I would like to focus on is the fact that we have seen an increase in space on the product launches to the point where I actually got 20 products launch in 20 weeks, which can be quite hard for an analytics team that is trying to drive some power punches analysis and doing other data requests.

And we need to keep up with the pace of the company. So another point that I would like to focus as I was talking about the data growth Talkdesk is event-driven. And this means that [00:03:00] for every action we have an event and the action is nothing more than the action itself. An actor and the object to the pillow that is associated with it relative to the contact center.

We can have an event that is the call started. That is when a user decides to call the company. He has an issue on this And they create these events when he starts the call calls started again, we have the action call started. We have the actor that is the user and the VFT object. The pillow that can be any free from the phone number to the reason why he’s calling. Later on an agent picks up the call and another event is created.

Like the agent status changed. That is nothing more than the pickup of. And the status test calls from a PC available to busy, and that information is starting the object of the events. We go on the many things can [00:04:00] happen here. Hopefully the client, exactly. And the wall interaction finish. With a call finished.

And again, we have the action. We have the actor and we have the object that again, can be the IP of the agent and the IP of the of the user and why this is relevant. We have seen that we are increasing the number of events and that this is a snapshot from January, 2017. When we get to the top 10, most frequent events. And in this case, what was true was that by a long margin contact updated was the most useful events. We also had the agent or plate that is for later to changes to the name and other types of things that we try and attack.

But some years later in January, 2021, [00:05:00] Stopped and are completely different. Now we have something that is the system initialized step and system finished step that are the most frequent ones. These events are something that became true when we moved on to another system. And now for each call that we get in certain direction .

That is the user goes to an agent. It tries to solve the problem he has in table. It moves on to another agent and this most another step up in the wall interaction. And this goes to show how from 17 to now the business completely changed. And in terms of the analytics teams, we need to keep up with pace.

[00:05:41] Analytics architecture 1.0 #

[00:05:41] Jose Cabeda: We need to be able to interact with the data and we need to handle this growth and all of these changes and be as lean as possible. We can just add more people to the problem at the same time, the architecture that we add is I’d say [00:06:00] it’s a very common. We had the centralized ratchet, centralized data warehouse, and all the data was being sent with for analysis, the events came from a spark processor and everything else came we’re using Fivetran either for internal databases that were mapped to events and external services like Zuora or Salesforce.

This was all created. Except the part that we found out, two things first, a Redshift is really great for analysis. But it’s great for a SOC analysis in some other use cases like real time we have, we can have better. Can I better refer other types of databases? Another thing that I would like to point out was the fact that we think rid of data requests, we started seeing the backlog, of the analytics team increasing and increasing, and that the requests that we’re taking some days or [00:07:00] weeks were starting to take months. And we weren’t being able to keep up the pace. And this meant one thing. If some, if our internal users aren’t able to make decisions based on the data, because we can provide the data in a timely manner.

This. just and start making decisions that are more gut-based than the data-driven. And this isn’t something that we are looking for. And we were looking for how to solve this problem. At the same time we had around the 10 developers between data engineers and the data analysts to give an idea of how we were organized.

[00:07:44] Data mesh to the rescue #

[00:07:44] Jose Cabeda: And they, the mesh appear at the same time that we are getting these issues, that the mesh is something that was proposed in an article by Zhamak Dehghani in 2019. And I would like to focus [00:08:00] on some of the core values that are proposed in this article. And these are the decentralized ownership and the distribute, the architecture instead of a centralized one. Having data code as one single unit and not as separate entities. Also, it started off a centralized governance decentralized. And something that I really resonated with you, that is the data as a product and not just an after thought thinking of data as something to use after we have the actual application, the actual product is, but think of it as a product itself.

[00:08:41] Data mesh at TalkDesk #

[00:08:41] Jose Cabeda: So at TalkDesk we have been done, been doing many things. But in terms of the data mesh architecture, I would like to focus on three things that I think simplify a bit on how we change our mindsets to this kind of architecture. And [00:09:00] that is using trainer having data products and also using dbt.

[00:09:07] Trino #

[00:09:07] Jose Cabeda: Firstly, what is Trino. Traditionally with data warehouses, like Redshift, Redshift does everything. We we insert the data into it and the Redshift is able to run the computation for the queries, but also stores the data itself. Everything’s aggregated, but using. We try and separate these into separate entities.

We have the computation engine and we also have the data storage and they must be separate. In this case. Three is only able to query the data. It’s an open source sequel engine. And what is great about this is that Trino as multiple connectors to connect to the databases that actually started data. In this case, we can query.

Databases like SLTP, Postgres or mySQL or the data warehouses like [00:10:00] Redshift or BiqQuery. And this is actually great because now using Trino we can with a single query, get the data from Redshift, get the data from another database like Postgre or Scylla and merge in a like we have at bottom.

And the bringing it all together to the user isn’t this, isn’t a word that there are different databases and we can keep the data where we need it more. So with this, instead of over centralized architecture and having to bring everything to a single data warehouse, we now have a distributed architecture and we solve the one off the first Carvel is in the data. But this isn’t enough. As a reminder, our problem is to try and bring more analysis. The analytics teams wants to be faster, be more productive to the mark and we had another issue and that [00:11:00] is it can equate multiple databases. But there is a division, a great divide.

I would say that is also proposed on the data mesh. That is between the operational data and the analytical data. First in the operational data, the application seams that actually create the applications and need to have the data in a very normalized schema. This data needs to be prepared for mainly right operations.

We are in this case, adding a new word, or we are updating to a single role with a contact that was updated and that is prepared for this, but when they need to analyze the data, we need to transform it. It can be to a denormalized schema by cable or any other thing that we think it’s best, but we mostly do read operations. And now instead of queering some, a thousand rows or only Android roles [00:12:00] now we need to run some queries that can spend months or even years. This can go from two millions or even billions of rows, and we need to have it in single digits of seconds. And this change from operational to analytical is one of the most time consuming tasks.

In the case, often the LTR, ETL architecture, the, this part of the transformation is can take weeks or even months. If it’s something really complicated, a new feature, it’s a new product that’s being added to this for analysis.

So one thing that with this idea to propose that with. To help us drive through our analysis was to think of moving the ownership of the data, to the operations teams, to the applications teams that are actually the ones that create the data.

[00:12:56] Move dataset ownership to operation teams #

[00:12:56] Jose Cabeda: They are two own owners of the, their own data. [00:13:00] And for this what occurred was that at the same time that we are having these issues, we as a new product that we decide that. Needed to bring to the clients. We decided that now each client customer needs to be able to access their own data needs to be able to understand how they are fairing and they become now data users.

And for these, what we have done was ever the interface in our case, we have create a self-service BI tool using local. But this wasn’t enough to just to present the data to them. We decided to bring the data to them. They will take into anything that they want. And for that to help increase the pace and help streamline some of the processes.

We also built a data platform internally, and this helped us. Now having the data platform, we started requesting each [00:14:00] application team that they created their own data sites. The data that they are generating for the new features and new products will be made available almost as soon as possible to the clients.

So they need to understand that the later that they create is a product he needs. And they are the owners of the, of this data sites that they should create using the data platform. And finally, as we are thinking for the standpoint of a analytics team, we now can leverage these new data sets that are not created by the data engineers and the data analysts, but are not created by the, by the application teams using the data platform .

For this to work, just to give a reference we joined forces with the reporting team and the design has been really great. And they took us around the near to go from concept to global availability. [00:15:00] This is an example of the dashboard. This one was created by by the reporting team and business intelligence, but this cost show one example of the dashboard.

But the other skin period, the data as they see fit and create it on dashboards, but this is all great, but the data platform is a car. It’s very central for out these old works from marketing on the ship to the teams. And as we are even proven what happens is the fact that the events, when they are creating.

They are inserted into a single queue. What we call a topic in a Kafka cluster that we have now, these topics are then processed and sent to different Kilz, which are then processed using an application that we have that is called broker. And this does everything from the cleaning to the denormalization.

And then we send it to another application that is. And this is nothing more than a much realization of all these events that were [00:16:00] cleaned and the normalized into the databases from the inception of the data. To this point, it can take some seconds only and for that to have the data prepared for real-time use cases there are multiple reasons for having this kind of data.

We started dating the DataVisor state are prepared for real time, like MRDB or even silly syllabi. And these data can now be pulled by the data users, either using a semi APIs that will provide or using trainer. But later on, we can just keep the data and we move, migrate the data to cold storage. In our case, we store it in a S3 buckets and then we can query it using a Trino and using the hive connector in Trino.

Again, Trino can query the data using the sealer and MRDB [00:17:00] gets also the data for months that we have and join it all together in the query. Finally we decided that we have different use cases and aside from the trainer, the users can also get the data using different API. And this is all great because now with the move of datasets to the application teams, we have more time for analysis and looking at the data mesh or valid.

[00:17:30] Data mesh at Talkdesk #

[00:17:30] Jose Cabeda: We get the decentralized ownership, the federated governance. And we also get the data as a product, not just an afterthought, but this isn’t And we need sometimes to do further transformations aside from the data that is on data buys from, and for that we have airflow to our orchestrate everything. We still use Fivetran to get external sources and that [00:18:00] up until some time we are using AWS data pipelines and this wasn’t actually. First the process was a bit complex and to release this, the new data pipelines needed to be added by the the engineer.

[00:18:16] Why dbt? #

[00:18:16] Jose Cabeda: And the court was repeated. We didn’t have many features that we were trying to implement ourselves, but what we found out was that by moving to dbt, we got some things for free, like Micross, which allowed us to have a dry coats. Don’t repeat yourself. If for some best practices that dbt implements and enforces, we also got less code to maintain because some features that we are implementing ourselves were almost free creating the table and creating the table.

Everything is much simpler using dbt and we have other things like that. The tests and documentation is called that we get by using the. [00:19:00] This is an example. Again, we are running dbt in airflow and before I, I couldn’t even the feet, all the call that we had in a single slide, but in a more visual way, what we had before was multiple steps just to create simple.

[00:19:20] Trino + dbt #

[00:19:20] Jose Cabeda: And I’m moving to dbt. We now were able to run this type of on single steps. And this has improves our calls and the visibility of everything that we do, which has been a really major, but we are also using Trino and Trino plus dbt has, we now can query data from multiple databases, not a single one.

We only need a single repo. We only need an activity, a project, the single one, total roll them all. We have already migrated some of our tables and views to this. Again, I’m going to explain why we haven’t already migrated everything later. And something [00:20:00] that is really major by having Trina, I can give an example is that we had to do a migration from a Mongo database that we had to a Scylla database.

And while before this could take a long time. It cause creating new connect and everything what’s happened was that a new connector or. The other team, the new dataset was added to the database and we just had to do a renaming from the data database. Everything was okay. Migration done.

Another part I would like to focus on is the fact that by moving to dbt plus our flow, and we also have some, a bit of continuous integration CI CD implemented. What, before it to be done by that engineer can now be done by everyone, by the data analysts. Also, and this shows in the top maintainers we have here the top four maintainers at the time I took the picture and the actually one of only one of those is a data engineer which is [00:21:00] gold, which goes to show how autonomous we currently.

And looking at the data match. Now we get data and call it as a single unit, not the separate entities. And again, just to give an idea, we had to grow, we had to do a reorg. Now we have a better platform and to prepare for the future and to be more capable of increasing with the needs of the whole company.

We most from initially having less than 10 developers to now having around. I think now we are much more capable of working with the rest of the company and the sending the new data requests that the clients they’re not clients needs, lessons learned. We had some performance issues depending on the connectors.

[00:21:49] Lessons learned #

[00:21:49] Jose Cabeda: We need to be aware that we need to push down the queries to the database itself, because if Trina is enabled to do this, we need to bring out the data in memory and run the query. [00:22:00] We follow the data, but this has been improving with the great work of Starburst, they are the main developers of Trino.

And another thing that I would like to focus, the fact that we had to move our mindset from a service team that has data requests to a platform. Now we have a platform we need to be we need to understand that too. We have metrics of satisfaction. We need to make the platform as simple as possible for the internal finance to create their own datasets.

[00:22:26] Next steps #

[00:22:26] Jose Cabeda: And now we provide a service starting to know users like applications. Regarding next steps. I would like to say kudos to Maurice grammar. He has been the one that has created the dbt Trino adapter, which has now been adopted by Starburst and the, we are getting new features being added, like incremental modes.

And this has been really great inefficient being added and the ha I really see us moving to these, all the ones that we have to dbt [00:23:00] using this new approach. We also have new projects coming up in coming months, like getting data lineage. And also I think that has to do data platform, which I think will improve our internal database and data platform.

Talking about dbt trainer. I hope I have convinced you that this is a viable approach to increase the pace of analytics. And if you are also interested all feedback and the help is welcome. And again, thank you and world. Not sure if he have any questions for Q&A.

[00:23:35] Q&A #

[00:23:35] Mark Maxwell: Yeah, we do Jose. Thank you very much. And I love the double height Potter reference as well. I think we have time for one there’s two questions here that kind of overlap. It’s great. In theory, how do you enforce, or how do you drive adoption? You mentioned having a simple setup, etc, and SQL helps, but how do you drive adoption across the board here, which is required for this whole idea of.

And then within that, how do you govern and say [00:24:00] standardization and add this sort of thing? What w what would be your answer there?

[00:24:05] Jose Cabeda: I would say that yeah this is a moving target. What we plan to w what we are trying to do is make it as simple as possible. And I would say that the pace of events need to be made first under, this is the first battle that we add to, to is to explain to the application team.

As they are creating the applications, they need to also create events. And this is the first obstacle we get. Another obstacle that we have is to actually add new edit work for the teams. And the points that I wanted to focus on the presentation is fact that by making sure that everyone understands. The data is a product you are not just releasing a new feature, adding a status mode, adding a new order.

You are also making the pro the date itself as a product. The, when you make available the product [00:25:00] new orders that are being added in the contact center, I’d say having the Slack message feature in it the applications team needs to understand. This is really important that it’s not only the feature you need to have the events, to have metrics for internally, understanding how we, our product is going on.

And you need to understand that data is needs to be available for the. For the for the clients, they need to understand that if the messages are going on, how many matches are being sent at the same time in real time use cases, how many matches is F been sent in the past month? And if they are not aware of this, it’s really hard to convince them to have the.

So we try to give this sense of importance of the data. And also we try to streamline this as much as possible and is is up the the building of these data data sets.

Last modified on: Apr 19, 2022

dbt Learn on-demand

A free intro course to transforming data with dbt