Table of Contents

This is just the beginning

Alan is one of the 2020 UK DataIQ100 and leads the data team at Tails.com, an online, direct-to-consumer, pet-nutrition business that now feeds over 150,000 dogs across Europe. In his spare time, he has been wrangling with the challenge of shepherding the growing open source community around SQLFluff, a SQL linter for humans.

The “modern data stack” and dbt are often positioned as the current best option for organisations to aim for, and the end point of, transformation projects.

This talk will look at themes around that journey: metadata, standardisation, & latency, to paint a picture of what the next phase of the journey looks like. That this is not the end…this is just the beginning.

Follow along in the slides here.

Browse this talk’s Slack archives #

The day-of-talk conversation is archived here in dbt Community Slack.

Not a member of the dbt Community yet? You can join here to view the Coalesce chat archives.

Full transcript #

[00:00:00] Erin Vaughan: Welcome everyone. And thank you for joining us at Coalesce. My name is Erin Vaughn and I’m the director of customer success at dbt Labs. I am absolutely thrilled to be hosting today’s session: this is just at the beginning presented by Alan Cruickshank. Spicy title for what I assure you will be an exciting 30 minutes.

Alan leads data at Tails.com. Tails provides subscription tailor made pet food in the UK. Alan’s also the author and maintainer of SQL fluff. If Alan looks familiar to those of you who have been in the dbt community for a bit, you have fantastic memories. Alan is not new to the Coalesce stage and spoke last year about SQL.

So before we jump into things, some quick reminders, all chat conversation is taking place in the #coalesce-just-the-beginning channel of dbt Slack. If you want to join, you can click the button on your screen. I highly recommend it. If you’ve enjoyed the presentation so far this [00:01:00] week, I can assure you that seeing dozens of folks in the community. Live reacting, meming, and asking questions throughout the presentation really puts a cherry on top of the true Coalesce experience.

To kick us off because Tails is a company focused on pets, we want to hear about and see yours. So our chat champion has started a thread to have you introduce yourself, let us know where you’re calling in from and share a picture of your fur baby, and what their name is. If you are more of a pet rock person, let us know what you would name a hypothetical pet.

Once the presentation wraps up, Alan will stick around to answer questions in the Slack. So let’s go ahead and get started. Over to you, Alan.

[00:01:41] Alan Cruickshank: Thanks Erin. Hi everybody. I’m Alan, I look after data at Tails.com and Erin said, I’m looking forward to seeing pictures of pets in the Slack channel.

So this is all just a ruse to collect pictures of your animals. I’m just kidding. Anyway. So the title of today’s talk is, somewhat intentionally provocative, but let me do a little bit of [00:02:00] context before we get into that. I’m Alan, I look after data at Tails.com and as Aaron said I’m the author and one of the maintainers of SQLFluff . I think some of the SQLFluff crew are probably in the chat today. There are also a couple of tails to calm people in the chat today. Feel free to tag them guys if you’re around, say hi otherwise, but I’m not going to talk about either of those two specifically today, although they will come up a little bit later.

So before we get started, I want, I’d like everyone who’s here today to have a, take a couple of minutes to walk, say a couple of minutes, a couple of seconds to think about what your vision is for data, whether that’s for your immediate data team or for data at your organization or for your dbt project, just picture what it is.

And if you’re feeling keen, drop it into the Slack thread, we’ll come back to this.

[00:02:48] Analytics evolution #

[00:02:48] Alan Cruickshank: Okay, clear in your head. So if we think about how we’ve got to where we’ve come to, and this is going to seem a little bit roundabout, but this will this all make sense in a second. [00:03:00] We’ve come a long way in analytics. Even just over the course of the last year, from the things that were going on at Coalesce 2020, even within relatively recent memory, we were on the left-hand side of this diagram.

(

[00:03:12] Alan Cruickshank: We were an Excel 95. We had Clippy helping us make spreadsheets and at all. Now from there we’ve progressed a few. We’ve gone into ad hoc sequel and started to try and interact with different granularity data sets ourselves. We started using tools. We started using tools like get or airflow to version control our coach to schedule things so that we don’t have to click buttons all the time.

We’ve started to move into the world of dbt or other alternatives to dbt, look how to define our business logic and define things in tools so that we can start to reuse things and not write the same thing over and over again. We then move into dbt, macros into packages and to really develop the craft scale that is analytics and analytics engineering.

And if we look at some of the more recent [00:04:00] things that are happening right now, we’re getting into. I mentioned sequel fluff getting into linting and making sure that the quality of our code is good. We’re getting into tools like materialize, what we’re starting to see that the line between batch and stream processing, being blurred a little bit more.

And we’re getting into seeing tools like select staff who had to, we’re trying to map lineage between different places. And I think that leads us to an interesting place now for what teams or teams are working on. What I’ve primarily discussed here is that tooling. But I think there’s a reason that the tooling has changed so much.

[00:04:32] The insight has evovled too….. #

[00:04:32] Alan Cruickshank: And that’s because the things we’re trying to communicate as a community are getting more complicated. This is for some of you, they’ve very familiar decor, pyramid, data, information, knowledge, wisdom, and I bring it up, but not because of what it’s really what it’s, primarily useful, but because of its relevance, what it means for the data that we’re building on.

And why lineage and metadata has got more important within that. So at the bottom of the pyramid, if you’re communicating just data, you don’t need to know where the [00:05:00] data come from. Data is data. It doesn’t, in some ways, it doesn’t matter whether it’s good data or bad data. If that’s what you’re communicating, it’s just data.

In some ways, information requires a little bit of knowledge of lineage, but it doesn’t, you don’t need to make sure it’s trustworthy that in a way. And in some ways, fake news is infamous. It’s not necessarily valid information, but it’s still information. But once we start getting into knowledge and wisdom specifically, but knowledge, you need to know something about your lineage.

It’s not knowledge if you don’t have a reason to believe it, to be true. And if you can’t believe it to be true without having some reason to believe that the source is valid and doubly. So when we get to wisdom almost takes into account that lineage and quality. As a necessary part of the knowledge itself in some ways, wisdom is knowing the difference between trustworthy data and untrustworthy data and being able to make differentiations between those two and sensible assertions often [00:06:00] with those two things in mind.

So the progression of the tooling that we’ve talked about, the progression of the insight and the things we’re trying to communicate. If I listen to data teams that I talked to in the community, primarily I hear people talking about three things. Some people are focusing on reusability. I touched on dbt packages and there are other places that, that reusability is going on.

What do I standardize? What do I customize? And how do I do that effectively for my. We hear people talking about metadata, the data of data and how to treat that data, but essentially some ways the semi same way, potentially differently to other data that’s going on. And the last thing that I think is interesting right now is latency.

[00:06:43] What do current dreams look like? #

[00:06:43] Alan Cruickshank: So going beyond the daily batch process to refresh your entire dbt diag, or to refresh your entire platform on a daily basis, can we do better than that? Can we get better latency? Granted, this is probably a biased sample set. There may be other interesting things go on [00:07:00] there. There’s probably a shout-out to governance and data quality on here too.

But for now, I’m going to assume that these are the three things that are going on from, so looking at reusability first what I think is interesting at the moment is how we’re seeing. Areas of our stack standardize and areas of our stack differentiate. So if I talk you through the diagram, I’ve got on screen at the moment on the left, we instrument, we generate data to do analytics on that might be on a pencil and paper in a, in an industrial environment that might be within an app or a website, an e-commerce environment that might be, within a call center or within.

Some other mixture of physical and digital experience somewhere else, we then ingest, we inquire, we get the data, we bring it to some way that we can use it. We in-store, it normalize it, make it usable for analysis. We then combine and join that data together with data from other sources, we analyze it to extract the [00:08:00] information and then we visualize and deploy that out to the people using the data.

[00:08:06] Reusability #

[00:08:06] Alan Cruickshank: And if we look at the. Five years, maybe three years, the, certainly the ingesting and acquisition of data has become continually standardized. The adage of don’t ask your engineers to rate TTL has hit home successfully tools like Fivetran stitch multi-node pipeline. -wise. A whole wealth of tools in the ingestion and acquisition space standardizing what it means to bring data into one place storage and normalization is becoming standardized as a grow out from that space.

The way that we store data is certainly standardized. If I don’t think it would be controversial to say that the, of all of the people on the call today, the number of data warehouses represented is probably less than the number I can count on one hand. And the way that we normalize data within artists is slowly becoming more standardized as well.

Look at the work that Fivetran are doing with with dbt [00:09:00] packages, look at the work that other people are doing across the dbt community to bring packages into, to normalize data from different sources into a standardized format. I realized I skipped instrumentation and had less standardization at this end, but it is slowly starting to standardize the way that we collected instrument data.

Mostly in the web space. So if you think about web event collection, that is slowly being standardized with tools like analytics JS. But there are other things happening in instrumentation space. Do you slowly standardize the way that generate the data is correct? If I go to the opposite end of that, this diagram and the way that data is being used, certainly the way that data has been visualized is if at the very least becoming more professionalized, I think it’s also becoming more standardized.

It gone to the days when a business has to invent how to make each turn curve, the ancient people know what each turn curve looks like. There is a standard way of doing it. That has been well-established now bar charts, line graphs, the [00:10:00] ability of. Of analytical professionals to know how to make one when to use one is becoming more standardized, potentially a more controversial thing to say at the visualize and deploy end of the spectrum is around data science, have a reflect on how many, truly novel data science use cases you’ve seen in the last couple of years.

I would go as far as saying the core group. And of course there are interesting things happening at the tail end, but for the vast majority of businesses, we’re talking about gen prediction, we’ll talk about recommendation and product recommendations. We’re talking about basket analysis. We’re talking about To do in what circumstance. But the number of use cases is becoming consolidated for most businesses to be a standard set of things you’re going to do. And what I think that means for us as say, analytics community is the place where we add our secret sauce, where you add your business. Logic is getting squeezed into this bit in the middle, into how you combine data sources, cleverly, and how you analyze data [00:11:00] sources, cleverly.

And both of those two things, thankfully for the context of the conference, where at the moment, at least for our team. And I think for many people here happen within dbt. And this is where a lot of the secret sauce of what your data organization is doing is happening. And the rest of the snack is becoming standard or.

[00:11:21] Metadata #

[00:11:21] Alan Cruickshank: This leads nicely onto motivator, and then realize this is quite a wordy slide, but let me talk you through from left to right on, on what is going around on metadata. So we all of us will know that bad data is worse than no data. I would much rather have a graph that says. This is up to date as of yesterday, then it to show today’s data wrong.

And that’s because all of us operate on trust. We have trust with the people who use our data, that we must have trust in the data that we use. And we know that it only takes a few incidents or issues of of that data quality to erode that Goodwill erode that trust and drastically slow down the use of data in decision-making.[00:12:00]

So what. Bad data mean in that context? the very top end and how people are using data, that means a short metrics that you know are correct. And in, in short to that could just be because they’re correct today because we know that if we work them out for yesterday, they are what yesterday was yesterday on a grander scale or at least for our team at the moment that might mean taking our key metrics and making sure that when we recalculate them for a year, That they are exactly the number that we had a year ago and that can give us confidence that they are what they are today.

Just having a short metrics and the top end only gets you so far, though. We need to know known, and consistent latency teams need to, I say here real time, never exists. Anyone who says that has real time metrics doesn’t have real time metrics. The question is your latency in seconds, milliseconds, nanoseconds minutes, hours, or days.

And it for some teams days as well. Sometimes weeks is fine for some team, some teams, they need seconds, but they need to know whether [00:13:00] it’s seconds or whether it’s days and it needs to be consistent. And lastly, the inputs need to be quality. You can eat if your output start failing, or you’re starting to use outputs, which aren’t in that core set of things, you test, you need to have quality inputs and there are, I’m not going to go into it today, but there are a whole host of tools which are becoming standardized in that space.

And lastly, around later, I think certainly for us. And I think for several other dbt projects that I’m familiar with, the standard has been a monolithic project with monolithic latency and what we’re starting to see more and more not just@tails.com, but I think and I’m going to forget the name of it was last.

[00:13:40] Latency #

[00:13:40] Alan Cruickshank: Jet, a jet blue last year, talking about how they’ve been working on granular latency for their project and how you can have a monolithic project with different latencies within the different streams of the, of that project. And we can do better than just marching to the slowest drum, but we have to do it intelligently all of your data faster.

It doesn’t [00:14:00] necessarily make sense. You need to work out which bits make sense to go faster. And which bits actually, either you don’t. Pay the computational expense to make them go faster or actually the teams that are going to use it. They don’t need it to be any faster. Daily’s absolutely fine for that.

[00:14:18] What is your vision for data? #

[00:14:18] Alan Cruickshank: So we’ve just gone through in a little bit of a whirlwind way, three major themes that I think cover a lot of people’s goals for data. Right now we’ve covered reusability customization standards. We’ve covered metadata. And within that lineage and data quality on how much, and can assure the quality of your input and output data.

And lastly, we talk about latency, how we can be more intelligent about latency and get from just monolithic daily dumps of data, through to different speeds, and maybe even blending in some streaming into that.

How close to these set to some of the goals that you wrote [00:15:00] down at the beginning of this session and how bold do those goals feel, given what we’ve covered so far and this is potentially the spoilers part of it. So I think all of the things we’ve talked about now are all achievable today. At all shapes and sizes of organizations. Now the speed that you can do that will depend on the size and shape of your organization. But all of this is very achievable and potentially faster than you thought.

Not only have we got standardized approaches for some of this stuff, but we have standardized tool sets that are appearing for this and a wide range of people within the community. Some of you may have today, some of whom you’ve come across in the rest of the community who can help you do this and have ideas on how to do.

And in some ways, this is great. This is a sign that we, as a community have done exactly what we set out to do over the last couple of years. But I think the challenge here as well, if this is what we think [00:16:00] our goals are, and it’s all very achievable. W what does that mean? We should be aiming for, and is that bold enough for the next year, for the next five years?

For the next 10 years of what we, as a community should be trying to do. There’s a caveat here everything up to this point, I think is applicable to everyone. It doesn’t matter what organization you’re from. It doesn’t matter what stage of your journey you’re at, whether you’re just starting to get into this, whether you feel like you’re in the middle or whether you feel like you’re very mature.

I think it applies to everyone. What follows here is how I think we’ve applied this thinking at.com and how we see the next wave of progress. Results may vary how you apply this in your organization may be different, but I’d like to sketch out how it looks for us in a hope of trying to mash, map that onto your organization in future.

[00:16:52] The data spectrum of your org. #

[00:16:52] Alan Cruickshank: So if you picture the spectrum of data users in your organization at one end, you have. What you would probably call [00:17:00] either your analytics team or your data team or the data people who are using data a lot. And they find data very useful. And if you think about the opposite end of the spectrum of your organization, I’m going to pick our office dogs.

(

[00:17:12] Alan Cruickshank: now, I’m sure whichever organization you work for, there are people or other animals in your organization who are very not data. They’re not using data right now. And if we’re perfectly honest with ourselves data, isn’t particularly useful to them. And there is a big spectrum of people in the middle whose hypothetical usefulness of data varies.

Let’s say for the purpose of today, linearly, it might not be linear, but there’s some relationship between, how data they are and how much that data is useful. Consider before I do the next animation, how the actual usage of data varies across the spectrum right now? Is it mostly linear? Does it follow the same [00:18:00] relationship?

Is it above this line? Is it below this line? I would posit that for some organizations, us included the actual usage of data looks a lot like. You have a group, a core group of people who are using data, they’re using visualization, they’re using broad data. They’re using dbt. They using all of these tools a lot and they find data very useful.

But as soon as you step out of that group, the usage of data drops off quite dramatically. And the actual usefulness of data to that next layer out of people, hasn’t changed a lot. There’s a group of people who could get a lot of value out of data if they were using it more, but aren’t using it very much.

And then that stays consistent or maybe drops off a little bit more out to the edges. I’m going to bring up a term, which I feel like some other people this week have started to try and reclaim, but I’m going to jump on that bond again, the word selves. Now, [00:19:00] I think a few years ago, if someone said self-serve what they intended to do was this.

[00:19:05] Use your tools to enable autonomy #

[00:19:05] Alan Cruickshank: They wanted to get everyone in the organization, using all of the data all of the time. And if you look at that’s a big Delta from where we are today, that involves upscaling a huge group of people to use data a lot more, and that’s going to be a lot of work. And it’s going to take a lot of time and it may not be very successful.

And you might not see a lot of results very early. What I think dbt enabled. Is to as much to us, for us to much more accurately map the actual data usage to the hypothetical usefulness, without a load of extra effort. There are people in the fringes of your data nurse, whether they’re your office, pets who don’t need to use data a lot, but for these people who are closer to your data team, I think the next wave of what we need to do is enable.

[00:20:00] Not to look within at our own teams and help build out the tooling and approach is more than that is useful. But I think we need to start looking out to start enabling the people around us in our organization, and to use dbt as a tool to do that so that we can get all of the people around us that we work with closer to that hypothetical usefulness of data.

And I think this has a nice parallel in the, I feel like as a community over the last few years, we’ve been progressively reclaiming the word analyst to be a meaningful role with an organization. And not just a, if I think five years ago where the word analyst might have been a slightly derogatory term for an entry level employee in a in a large accounting firm.

Whereas now it means something again. While we’ve tried this a couple of times. I think it’s now time for us to go out and reclaim the self serve the intent of what self serve was good. We wanted people who may not [00:21:00] consider themselves to be data professionals in essence, to be able to do more with data themselves.

And that relies on having the tooling. Now, what we’ve done in this focus of focusing back on the analyst is to build out that tooling, to build out the approaches, to build out the frameworks that allow us as data teams to be effective and to be successful. And we’ve got quite automated at that and we’ve managed to standardize a good chunk of it.

And I think that standardization that we’ve reclaimed over the last few years will now be the bedrock that allows this next wave of self-serve to work and for us to empower the people around us, because really that’s what our data teams are there to help the organizations that they sit within to be more effective themselves.

And those affect those organizations. For the most part, a few exceptions obviously don’t exist to build data tooling. They exist to do something else for us. It’s to change the world of pet food for good. And so the time is to now to look [00:22:00] outwards, go and change the world of pet food for good go and reclaim self serve.

[00:22:05] This is just the beginning #

[00:22:05] Alan Cruickshank: And that is why I think this is just the beginning. All of the investments and development that we’ve done in tooling over the last five, five years. Have rather than being the end is not to get dbt out into your organization. The end is not your new BI platform. The end is not your data dictionary.

These are the entry tickets. This is the foundation. And now the real work begins to go into power of the people around. That’s it. If you want to find me on the socials you can find me on, get her below. You can find me on LinkedIn. If you wanna get involved in SQL five project or sqlfluff.com, if you want to get involved with tails.com, we’re hiring telus.com/careers at the bottom.

And as Erin said, at the beginning, I will be around in the Slack to have a couple of good arguments with everyone here about self said, I can already see the. I can already see the beams rolling in. Yeah, [00:23:00] fire away.

[00:23:03] Erin Vaughan: Hello. I brought barley to help close things out. Thank you so much for such a great presentation, Alan.

I can tell that this is just the beginning of a lot of questions. Folks will have about some of the very visionary remarks that you made. I know we have a few minutes before we wrap, so I’m interested. Do you feel like you’ve reclaimed self-service at tails already? Or is that really the focus for your next year?

[00:23:27] Alan Cruickshank: And I think that’s going to be the focus for next year. I think we’ve, I think we’ve reclaimed analyst. I think we have one, I’d be curious to whether any of my team who are in the chat. I agree with this. I feel like we’ve reclaimed the desire for self-serve. I think there was a period and a lot of organizations a few years ago where I w.

I think data teams got burnt out around self-serve, but I think business users also got burnt out around self-serve they’re like, it was just, the tools were too complicated. It was too hard to get the insights you needed to. [00:24:00] And I think we are at least internally starting to feel that resurgence of demand of people going through.

I’m ready to go. Now I would like more capability to like more tooling. I can see what you guys doing. And this looks interesting and exciting. How do I get involved in that? How do I make small steps towards that without needing to get the whole thing? And I think that’s, that I think is what.

Is the critical thing about what I think is different this time around, and maybe this maps quite nicely to the hype cycle where there’s, I’m going to forget that it works, there’s these there digital hype, everyone thinks everything is amazing. And then there’s the peak of like disappointment where you realize, actually, this isn’t a silver bullet.

It’s not gonna, it’s not gonna solve all your problems in the first place. And I think we’re now at the beginning of the second part where. Self-serve never went away. It was still there, but the expectations are a much more reasonable now. And people are wanting to get into this much more incrementally to take little pieces off and make progress with that.

And I think that’s going to be the foundation for [00:25:00] actually useful self-serve rather than spend all of your money on a BI tool, that’s going to solve all of your problems.

Last modified on: Apr 19, 2022

dbt Learn on-demand

A free intro course to transforming data with dbt