Table of Contents

Eat the data you have: Tracking core events in a cookieless world

Jeff Sloan is a longtime data practitioner and dbt user, and currently is a Customer Data Architect at Census.

Originally presented on 2021-12-10

Analytics engineering has made more clean, validated data available than ever before thanks to the modern data stack. But despite this, many companies still track, model, and consume their event data outside this powerful combination of tooling.

In this talk, Jeff Sloan (Data @ Census, Treatwell, RJMetrics) argues that you should identify core product and business events within your data warehouse. Businesses that do so will find themselves taking advantage of new opportunities in programmatic marketing and analytics, while safeguarding their efforts in uncertain times for front-end web tracking.

Follow along in the slides here.

Browse this talk’s Slack archives #

The day-of-talk conversation is archived here in dbt Community Slack.

Not a member of the dbt Community yet? You can join here to view the Coalesce chat archives.

Full transcript #

[00:00:00] Jeff Sloan: Oh, hey Amy. Oh, you caught me in the middle of one of my data experiments. One second. Some of these got a couple of cookies for good measure. Not going to need those anymore in the future. Right.

And let’s get started. Welcome to Love the Data You Have, Eat the Data You Have: Tracking Business Activity with Synthetic Events. I’m Jeff Sloan. I’m a customer data architect at Census. And if you have any questions, comments, hot takes, things you want to talk about after the the presentation today, please jump into Coalesce Census channel and dbt Slack.

This is a sponsored talk by Census, but much to the chagrin of [00:01:00] my corporate overlords, since this is actually going to feature as a small part in this trend that I really want to talk to you about today. We’re here to talk about your event data, specifically the event data that you’ve typically trapped by a front end JavaScript snippets and SDKs like Google Analytics, Segment, or Mixpanel.

Historically, this tracking has relied heavily on third-party cookies, but I’m here to tell you, there is a better way by analytics engineering. That is synthetic events. This talk is an impassioned call to use synthetic events. We’re going to talk about what they are, the problem they solve and why there are compelling solution.

But who am I to tell you about this in the first place? I’m Jeff Sloan. I spent the past seven years in data and product. I started my career at a full stack BI software company called RJMetrics, survived by two major thousand pound [00:02:00] companies in the modern DataStax space, one called Stitch and one called dbt Labs, which I think and love here at a dbt Labs conference.

And after that, I went on to work in BI and product at two tech companies in the UK, eMoov and Treatwell. I’ve consulted with clients on their data strategies, their hiring strategies for data teams. And nowadays I’m a customer data architect at Census. I spent the last five years of my career working with dbt. I’m a long time lurker in the dbt Slack channels.

[00:02:33] What is a synthetic event? #

[00:02:33] Jeff Sloan: And I’m really excited that I get to be here today and speak to you about some of the thoughts that I have around data modeling and dbt in general. So let’s get stuck in with a particular first question. What even is a synthetic event? Synthetic events, as we’re going to talk about them today, our model data points that look [00:03:00] like the events that you have in your warehouse, or that you might have tracked in your front end rather, but are actually sourced from your systems of record, like your application databases, CRM, or other tools.

So again, they look like the events that you might’ve tracked in your phone. But they’re not actually coming from your front end. In this case, on the slide, we can see on the left-hand side a table that looks like an orders table in a database. Maybe actually it’s even an orders table that comes after a series of transformations in your data warehouse.

And it doesn’t take too much imagination to think about pivoting or modeling that data in an event like structure on the right hand side. And that’s what we’re here to talk about today. Specifically, we’re going to walk through three key benefits of this particular approach. And if you walk away with nothing else, I hope you walk away with these three points.

Synthetic events in the warehouse enable accuracy and trusted report because we’re [00:04:00] tracking actions using the same data sets that the rest of your daily work is based on. And you’re able to really leverage the business logic that sits with this other data sets. It also means in comparison to traditional front end of the backing, you also have access to historical data.

Number two, synthetic events in the data warehouse enable cross domain product analysis. Many events are tough to access via front end, web tracking, things like a ticket getting created or a credit card swipe in a subscription service. Exposing these events enables the analysis of complex customer journeys that occur both on and off your application.

And number three, synthetic events enable data team autonomy and agility. Data and source systems are frequently accessible by ELT tools like Stitch and Fivetran, but events in the front end often require a product engineering . With this data accessible in the warehouse data folks can [00:05:00] model manage these events independently.

[00:05:05] Proxies vs. reality #

[00:05:05] Jeff Sloan: How did we get here in the first place? The truth is when we do data work we frequently measure metrics via proxies. And I’m going to explain by way of a silly cookie themed analogy here. Who here likes cookies? I love cookies. Who here has ever baked cookies. And let’s say someone asks you how the last time that you made cookies, how many did you bake?

Would you measure the amount of cookies by the amount of cookie dough that you rolled out and put on your baking sheets or the number of cookies you put on the cooling rack after you pulled them out of the oven? And really what we’re asking here is which of these is a proxy and which one is a direct mail. Of course you can actually measure what you’re after right here. [00:06:00] They’re on the cooling rack, and if I’m the one baking, then certainly I ended up eating some of the cookie dough and some of the cookies might catch on fire. So it’s a terrible measurement of this metric for me. This is silly. We do this all the time in the business world. And as a result, traditional event tracking via front end JavaScript snippets and alike is often incomplete and it’s getting worse. I’ll walk through an example here. Imagine a SaaS company and there’s their front end web application. And there’s a production Postgres database backing it, like on the left-hand side here. The front end web application sends ephemeral events to Google analytics.

The database stores key artifacts, so that the application runs in this world. How many of you have experienced this particular question? Product manager or a marketer comes to you and asks, The account creation numbers are different in Google Analytics than in our business [00:07:00] intelligence reporting. Which is correct?

And why are there differences in the first place? Before I explore the why, which of these do you think is the cookie dough, or the proxy in this situation and which do you think holds the truth? It’s the database. Key points here are that with ephemeral events, you’re typically measuring via proxies like button clicks or confirmation pages visited while your database is the source of truth, actually holding the number of registered users.

And not only do you have the proxy issue, but the measurement of these proxies is getting less and less complete right now. On one hand, that’s a bit of a good thing. Consumers have to now consent to being tracked on the internet through regulation like GDPR and CPRA. So they have more protections of their privacy.

But in addition to that, ad blockers are also disrupting [00:08:00] tracking and changes to third-party cookie tracking. Like Apple’s iOS 14 changes and changes coming from major browsers, they are making it more difficult to keep track of the same user as they come to your site, perform some actions, revisit your site after a period of time and then perform more actions.

[00:08:20] Synthetic events are a powerful alternative #

[00:08:20] Jeff Sloan: Synthetic events are a powerful alternative to these front end of event tracked systems. With synthetic events, you still track certain product interactions by a front-end activity, but you source core actions from your systems of record. In practice, app interactions like page views or clicking on dropdown menus are still attracting the front end with all the changes coming from third party cookies.

This is going to require either firing some server side events or using first-party cookies. There are many solutions for this, and I wouldn’t be surprised if there’s actually been a couple of Coalesce talks or other [00:09:00] conference talks that are going to dive directly into this topic. But for your core actions, like an order being created or an account being created or a workspace being created, you source these from your systems of record, like your database. Once you load all this data into a data warehouse, you can model it using a tool like dbt into your events data set. Now you can use this data from multiple activities, traditional BI and reporting from the data warehouse, data science, and you can use a Reverse ETL tool like Census to syndicate this event data to consumers like product analytics tools like Mixpanel or Amplitude or marketing automation tools like Braze or ads platforms.

And now I’m going to walk through the reason why this is such a compelling approach for the rest of this presentation, walking through those three points that I talked about in the beginning before ending on some considerations, as you’re thinking about rolling out this approach. And again, those [00:10:00] three why’s for synthetic events or that synthetic events enable accuracy and trusted reporting, cross domain, product analysis and autonomy and agility for your data team.

[00:10:13] Benefit 1: Data quality #

[00:10:13] Jeff Sloan: Walking through benefit one, synthetic events rely on your validated data sets. So when we walked through this slide before, I papered over the provenance of those synthetic events in the bottom part of the diagram, but in reality, the orders being created or the accounts being created, these could be built off the back of a long series of transformations to strip out test orders or test users, as well as compute validated attributes about those orders and users.

And now when you have this data available in the warehouse, something magical happens. You can actually have a one-to-one match between your BI reporting built off the back of these things and the event’s data set that you have in your warehouse. But not only [00:11:00] that, the events data that you might send out to downstream systems. In this case, I’ve built a Mixpanel report to showcasing what can happen here.

I have sent sample event data from a data warehouse into Mixpanel and built a report plotting this event over time. And we can see that a query on the data warehouse returns the exact same number as the data or the report in Mixpanel. And I’ve never been able to achieve this in a traditional event tracking world, much to many, hours spent trying to make these numbers match up.

And if all I do with this presentation is to save one person a data discrepancy, it’ll all have been worth it. Because the data discrepancy isn’t just an isolated thing where you have to go and investigate the reasons behind different numbers, but it also reduces the trust and confidence in the reporting you [00:12:00] have, regardless of whether there’s a good reason for that difference.

And for every product manager or marketer that comes to the data team and asks for why these numbers are different, there’s typically a couple others that don’t know to ask and just lose a little bit of faith. What synthetic events allow you to do is maintain that faith and keep these numbers much more in line.

[00:12:26] Benefit 2: Access to non-product business events #

[00:12:26] Jeff Sloan: The second benefit, access to non product business events. We spoke about generating synthetic events out of say a production Postgres database or database product data that lives from your production Postgres database. But you could also walk through these series of tests for events that live in cloud systems like your ticketing software, or your CRM. And some cool analyses shake out with this method. One, products or what product [00:13:00] interactions have led to support tickets?

You could also look at our more recently acquired cohorts submitting less support request. Or even things like do trials that have a midway check-in call convert more highly than the alternative. And I’m curious here, what are some other flows or funnel type questions that you wish you could understand at your business, when things cross from the product domain into the non product and operational domains?

I think there’s a lot of interesting things that you can unearth by bringing these data sets together. And in my case, you can certainly analyze this data in SQL once you have it in the warehouse. But what’s fantastic is is sending this data out to the spoke tools that are meant to analyze this data and make it accessible for non data facile users.

In this case, I’m going to answer the first business or first question here with a report that I built in [00:14:00] Mixpanel by sending this data from the warehouse into the next panel. And. I know that the type is really small here. So please don’t strain your eyes looking into the type on the presentation.

But the key thing here is that this is a kind of path to conversion type report, except instead of a conversion, we’re looking at a path to a ticket being submitte. And visually it’s very easy to see that one event that occurs just before a ticket being submitted has occurred as much more often than other events that occur right before the ticket being submitted.

Now as a product manager, I might look at this and realize, oh, this is a part of the product that I might need to make easier. I might need to dive into why users might be getting stuck and reaching out to our support team. But also as a CX manager, I might look into writing some docs to make it easier for folks to [00:15:00] self-serve and use this part of the product.

Or I might train my team and work with my team to gain a better understanding of this part of the product, knowing that it requires a little bit more support.

[00:15:13] Benefit 3: Data team autonomy #

[00:15:13] Jeff Sloan: The third benefit is data team autonomy and agility. Reasonably, this comes because with this data already living in the data warehouse, it has to cross a particular request for data…has has to cross many fewer hands to get fulfilled. So in the old way, the parallel event tracking world request gets made for a particular data point.

A product manager might take that down. It’s in a backlog, finally, an engineer instruments that, and the data is now available, moving forwards. And this could take weeks. This could take days. This could take months before this data is made available. Well, on the new way, the analytics engineering way, using [00:16:00] synthetic events in the data warehouse data teams can provision and meet their business users needs independently and autonomously.

So we’ve walked through what’s synthetic events are why they’re compelling solution to a problem now, and the reasons why those three benefits actually work in practice. And again, these three benefits are accuracy in your reporting, enabling cross-domain product analyses, the reasonably, you could also imagine sending this data out to many consumers. And finally, autonomy and agility for your data team.

[00:16:53] Cons, limitations and notes #

[00:16:53] Jeff Sloan: Now, before I end my talk, I hope I’ve encouraged you to consider using synthetic events, consider [00:17:00] them as a powerful tool in your arsenal as a data team. But there are some considerations that you might have in mind as you start rolling those out. And I want to walk through some of those as well as open areas of conversation.

For starters, synthetic events are not real time. They’re ultimately subject to the same series of processing steps upstream that the rest of your data work has as well. So think about taking that data out of source systems, loading it into your warehouse, transforming that data. Synthetic events have to walk through each of those steps as well.

And in many cases you can imagine actually getting this latency down quite low, even sub 15 minutes or sub hour. But in reality, sometimes you really need real-time event data and [00:18:00] there are tools out there like Materialize that are trying to crack this problem of giving a streaming data warehouse-like solution, but we’re not totally there in kind of mainstream adoption and usage of this type of . Number two, managing source data schema changes is going to be a part of managing these new events in a way that it wasn’t before.

So if your production database schema changes, you’re going to have to go through and remodel your transformations deal with your transformations. So that way your events continue to flip through. Now, there’s kind of two silver linings here. So one is typically if you’re sourcing your core events using synthetic events, you’re going to be basing your data off of a long series of transformations, where you’re going to have to do this type of remodeling work. For example, updating an order’s mart. [00:19:00]

And another piece of this is that I guarantee that the backend of your website and the schema of your data changes a little less frequently than the front end of your product. The other piece tangentially related is that modeling takes effort for each new events.

So we are taking some work that used to be done by product teams to structure the data and decide what data needs to be captured upfront. And we’re pushing some of that work down into the data team that’s managing and building these events after the data has been captured. And in some cases, this may be a challenge where the product team has the ability and the know-how to be able to work on these and capture this data quite well.

But in some cases, your data team might have the ability to serve your users a little bit more quickly than product team. And finally with this approach, there are now two sources for [00:20:00] events, which means two places to look for instrumentation, two places and two teams to ask when you need to provision this data source this data for for business teams or for product analytics. And that comes with its own kind of organizational challenges.

In this vein, some open areas of conversation include what are the best practices for modeling events data in the first place. And actually, I think this is quite a lively debate and a lively discussion. Three, three talks at Coalesce have discussed event data modeling in detail. The CEO of Narrator discussed an open source activity schema for modeling event data, which looks like it enables many of those cross domain product analysis that we’ve talked about, but directly within the warehouse. Indicative spoke about prepping data for product analytics. And I know the Snowplow session discussed modeling data at [00:21:00] scale for events in particular. I think the jury is out on the best solutions here or exactly the taxonomy of problems that we’re trying to solve with event data and where each solution first.

For example, should we have one big table for all of our events, super wide, super narrow. Should we split up events into multiple views and tables, even per event or per category, and don’t necessarily have the best answer here. I’m looking forward to seeing the conversations that are ongoing. Another piece here does that now, again, thinking back to our last slide, there are now multiple approaches.

And that means we’re entering the territory where data teams and product teams and business teams now need to change how they think and how they act and managing these multiple approaches might come with its own challenges as to when to source something via the product team. When to source something via the data team. If both can reasonably do it, who should do it, especially [00:22:00] given bandwidth considerations and just pragmatism.

And finally, I’m really curious on your thoughts here in governance. In my experience, event data has always been special case as a data set. It’s really fast moving. There’s always new events that need to be tracked. And for that reason, it seemed like in general, it suffered from a little less governance where was afforded a little less governance than other typical business objects and entities.

This is a little bit of a loosely held opinion. But I do think that by taking this approach, you’re able to benefit by all the analytics engineering practices, the testing, the good control and approval flows that you’ve become used to with event datasets. But at the same time, [00:23:00] event data just feels a little bit different and I’m not sure if it should be afforded some sort of fast path or a little bit more flexibility than other datasets. And so I hope that I’ve given you the whistle stop tour to synthetic events and why you should be using them, why we’ve gotten here in the first place, as well as some considerations you might need to have as you’re rolling these out for you and your company.

And I look forward to any questions that you might have, and certainly jumping into the dbt Slack after the presentation.

[00:23:36] Amy Chen: Thank you so much, Jeff, for showing us how the cookie should crumble. It looks like we still have time for questions. So I’ll read some attendees questions out loud for you. Please forgive me if I mispronounce any names. I’m a cookie, so very low bars here.

First question would be Jonah Dibbers asked, what is your point of benefits of synthetic events versus [00:24:00] implementing server sidetracking instead of client side to get to those core metrics?

[00:24:04] Jeff Sloan: Sure. It’s a great question. So server side event tracking is definitely still going to be necessary or the use of first party cookies in order to track say app interactions like somebody’s visiting a page, or maybe even clicking on a dropdown menu, something along those lines, but the key importance for synthetic events is that it can go strictly through the analytics engineering workflow. It doesn’t necessarily need to slot into a product engineering workflow. That’s one of the key benefits here is, that it gives you another path. It doesn’t necessarily mean you can’t track that using the synthetic events or rather the service evidence.

But I think for me, that’s one of the bigger ones. The other reason is actually the completeness of data that you’re able to access with synthetic events. So if that data has been given to you, [00:25:00] because it’s not your first party data, you now have users that have agreed to your terms of service to you to give you that data and use that data.

Whereas with server-side track events, you may not necessarily have that using a kind of traditional front end.

[00:25:17] Amy Chen: Thank you. So we’ll close out with Mila’squestion. What would you say to the concern that taking on this new technique might spiral into reworking the core transformations and workflows that companies have gotten used to making. Is migration cost a meaningful concern here?

[00:25:36] Jeff Sloan: Yes. It’s the short answer and I think pragmatism is key. I think that if I were to start from scratch, I would probably start sourcing more events using synthetic events, but in practice, if you have tons of workflows that are built off the back of front end tracking systems, then maybe this is a good approach to start [00:26:00] supplementing that and give you a fast path or an alternative path to sourcing data for your teams.

[00:26:09] Amy Chen: Oh, we have some more time. So I’m going to steal another question then. Aaron asks, what are some business questions where you’ve had to analyze product and non product interactions together?

[00:26:23] Jeff Sloan: Oh this one? Is that a question for the audience or is that a question for me?

[00:26:33] Amy Chen: Wait. I wonder if this is, I feel like that was a question to you, but maybe not. Oh, no. Life life is a struggle. Sorry. Here’s another one. Joel asked if you if use completed orders as your synthetic event source, how would you say identify abandonment percentages?

[00:26:55] Jeff Sloan: How would you identify abandonment percentages in this case? Yes, you’re probably going to [00:27:00] need some combination of things.

One, I think interesting question is the use of this alongside your front-end track events. So think of the key challenge I foresee and using this approach is you have page views. Yeah. People saying I’ve added this to a cart, and then now you have your order creation, for example, and your order creation is way more complete than everything else, because of all the concerns that we spoken about so far.

In this case, actually, probably the use of kind of filtering based off the fact that you cannot eye this kind of first party backend source event to your, to a particular session allows you to then compare like for an apples for apples in that experience.

dbt Learn on-demand

A free intro course to transforming data with dbt