Table of Contents
- ⢠No silver bullets: Building the analytics flywheel
- ⢠Scaling Knowledge > Scaling Bodies: Why dbt Labs is making the bet on a data literate organization
- ⢠Identity Crisis: Navigating the Modern Data Organization
- ⢠Down with 'data science'
- ⢠Refactor your hiring process: a framework
- ⢠Beyond the Box: Stop relying on your Black co-worker to help you build a diverse team
- ⢠To All The Data Managers We've Loved Before
- ⢠From Diverse "Humans of Data" to Data Dream "Teams"
- ⢠From 100 spreadsheets to 100 data analysts: the story of dbt at Slido
- ⢠New Data Role on the Block: Revenue Analytics
- ⢠Data Paradox of the Growth-Stage Startup
- ⢠Share. Empower. Repeat. Come learn about how to become a Meetup Organizer!
- ⢠Keynote: How big is this wave?
- ⢠Analytics Engineering Everywhere: Why in the Next Five Years Every Organization Will Adopt Analytics Engineering
- ⢠The Future of Analytics is Polyglot
- ⢠The modern data experience
- ⢠Don't hire a data engineer...yet
- ⢠Keynote: The Metrics System
- ⢠This is just the beginning
- ⢠The Future of Data Analytics
- ⢠Coalesce After Party with Catalog & Cocktails
- ⢠The Operational Data Warehouse: Reverse ETL, CDPs, and the future of data activation
- ⢠Built It Once & Build It Right: Prototyping for Data Teams
- ⢠Inclusive Design and dbt
- ⢠Analytics Engineering for storytellers
- ⢠When to ask for help: Modern advice for working with consultants in data and analytics
- ⢠Smaller Black Boxes: Towards Modular Data Products
- ⢠Optimizing query run time with materialization schedules
- ⢠How dbt Enables Systems Engineering in Analytics
- ⢠Operationalizing Column-Name Contracts with dbtplyr
- ⢠Building On Top of dbt: Managing External Dependencies
- ⢠Data as Engineering
- ⢠Automating Ambiguity: Managing dynamic source data using dbt macros
- ⢠Building a metadata ecosystem with dbt
- ⢠Modeling event data at scale
- ⢠Introducing the activity schema: data modeling with a single table
- ⢠dbt in a data mesh world
- ⢠Sharing the knowledge - joining dbt and "the Business" using TÄngata
- ⢠Eat the data you have: Tracking core events in a cookieless world
- ⢠Getting Meta About Metadata: Building Trustworthy Data Products Backed by dbt
- ⢠Batch to Streaming in One Easy Step
- ⢠dbt 101: Stories from real-life data practitioners + a live look at dbt
- ⢠The Modern Data Stack: How Fivetran Operationalizes Data Transformations
- ⢠Implementing and scaling dbt Core without engineers
- ⢠dbt Core v1.0 Reveal āØ
- ⢠Data Analytics in a Snowflake world
- ⢠Firebolt Deep Dive - Next generation performance with dbt
- ⢠The Endpoints are the Beginning: Using the dbt Cloud API to build a culture of data awareness
- ⢠dbt, Notebooks and the modern data experience
- ⢠You donāt need another database: A conversation with Reynold Xin (Databricks) and Drew Banin (dbt Labs)
- ⢠Git for the rest of us
- ⢠How to build a mature dbt project from scratch
- ⢠Tailoring dbt's incremental_strategy to Artsy's data needs
- ⢠Observability within dbt
- ⢠The Call is Coming from Inside the Warehouse: Surviving Schema Changes with Automation
- ⢠So You Think You Can DAG: Supporting data scientists with dbt packages
- ⢠How to Prepare Data for a Product Analytics Platform
- ⢠dbt for Financial Services: How to boost returns on your SQL pipelines using dbt, Databricks, and Delta Lake
- ⢠Stay Calm and Query on: Root Cause Analysis for Your Data Pipelines
- ⢠Upskilling from an Insights Analyst to an Analytics Engineer
- ⢠Building an Open Source Data Stack
- ⢠Trials and Tribulations of Incremental Models
Building a metadata ecosystem with dbt
dbt has democratised the role of āAnalytics Engineeringā. It has empowered a new wave of ādata practitionersā across teams within the Data Mesh.
The Data Mesh is an architectural paradigm that decentralises data into product teams over a centralised data team. Combining dbt and the data mesh creates an explosion of dbt users and with it many organisational challenges such as:
- How do you maintain high dbt code quality across teams?
- Who owns what data?
- If data breaks in production who is the team responsible to fix it?
- How do you discover data built by other teams and leverage it within your own development?
Follow along in the slides here.
Browse this talkās Slack archives #
The day-of-talk conversation is archived here in dbt Community Slack.
Not a member of the dbt Community yet? You can join here to view the Coalesce chat archives.
Full transcript #
[00:00:00] Kyle Coapman: Hey there, dbt community. Welcome to another session of Coalesce. My name is Kyle and the head of training over at dbt Labs. And Iām super excited to be your host for the session. Weāre going to get meta with the session called Building a Metadata Ecosystem with dbt. Our wonderful speaker is Darren Haken, the head of engineering for platform and data at Auto Trader. He starts his day with coffee and looks forward to driving his new electric vehicle, a Volkswagen ID3, and in a past life was a DJ in Manchester. Heās a huge fan of Japan and has visited five times.
Just so you know, all chat conversations are taking place over in Coalesce Metadata Ecosystem channel of dbt Slack. If youāre not part of that chat, head over to community.getdbt.com in your browser, and then find the channel Coalesce Metadata Ecosystem. We encourage you to ask attendees and the speaker questions, make comments, react, drop memes, all sorts of things, especially metadata, the hot topic right now. Would love to get the chat going. If you have [00:01:00] a question, please just put the red question mark emoji at the beginning of your comment so that I can make sure I bring that up later if we have time for live questions. And so at the end of the talk any questions that havenāt been answered live will be answered directly in Slack. And without further ado, letās get meta with Darren, over to you.
[00:01:19] Darren Haken: Thanks for that, Kyle. Yeah, I truly mean it. If you do like my talk, I am definitely happy to accept a Shiba, Iāve wanted one for many years. I just need the perfect opportunity. So yeah. Hi, Iām Darren, head of engineering at Auto Trader.
I work in the platform data space. My role really is to build capabilities across our platform that help the rest of the organization. The big areas that I focus on is the data and analytics. How do we democratize data? How do we build tooling and capabilities around machine learning analytics, the whole raft to help the rest of our product teams, marketers, executives, whoever to work with data.
Todayās agenda, that kind of structure I want to go through around this metadata [00:02:00] topic that Iām very passionate and very interested about at the moment, especially with the platform lens that I do my day to day. Iām going to start by talking about Auto Traderās journey to this decentralization.
And this is the kind of context or the groundwork that explains why this massive data topic is so interesting that Iām going to talk about the hero, this the metadata. And Iām going to finish off a little bit around if youāre interested in metadata, which I hope you are after this talk, how do we as a community, how to take that forward and move it as an industry.
[00:02:29] Auto Traderās data journey #
[00:02:29] Darren Haken: So letās start with a bit about Auto Trader. So for those that donāt know who Auto Trader are, and some of you might not know where they are, the largest automotive marketplace in the UK established in 1977. So very old. In fact, I took a screenshot on the right here, so we actually used to be called in 1977, Thames Valley Trader.
[00:02:49] Darren Haken: It looks very old. I was actually surprised not to see horse carriages on there. It looks that old, but if we just say old kind of magazine, you could buy it in the Thames Valley area, which is a part of the UK [00:03:00] and buy a car. And to the right of that is this kind of a modern take on that, which is a fully digital organization that helps people buy and sell cars online.
And in terms of scale, which is always helpful for this. We have about 58 million cross-platform visits and weāre around roughly about the 12th largest science and the traffic here in the UK. So pretty reasonable size data and lots of activity. And generally people love to look for cars or buy cars.
Itās a very kind of key thing for a lot of families. And so the part thatās interesting about this is 1977 is a very long time. And that actually means that Auto Trader has had many stages in its data journey. These arenāt accurate timescales are just trying to show more of a sort of growth or movement or transition.
But at some point when computers existed around, God knows when, I imagine at first we actually worked with paperwork, but the dawn of analytics as far as Iām concerned from a digital perspective, where we probably had people working with spreadsheets on [00:04:00] some level and trying to do analytics about the marketplace and helping people helping us make sense of what we need to do, some period around sort of 15 years ago.
And we started building out a centralized data warehouse and building reporting on top of that. And it was a very classic period of analytics for that time. Iām sure if some of you have been working on analytics since that period youāll fully know it. And moving further into sort of more modern terms, we started looking at data lakes, machine learning, how do we do more predictive components of that, which has then led us into this sort of more recent journey where Iāve been a lot more involved in last years. And thatās building out what we call our self-serve data platform, which to me is, building capabilities thatās self-served by the rest of our organization.
And then recently one of the bigger focuses over the last sort of 12, maybe 18 months is more of a move to decentralization with how we work with teams in our platform on data. And weāre really focusing on this data mesh [00:05:00] architecture thatās been discussed quite a lot in industry, which hopefully some of you may have heard of.
[00:05:06] Self-serve data platform #
[00:05:06] Darren Haken: So in terms of this self-serve data platform worth giving a bit of an explanation of that for what it is in my opinion. So one of the key areas some of my teams work on is building what we call our data plan. And that is building capabilities that let product teams marketeers, like I said earlier, work with data. And we do that by building tooling or making it easier to work with certain things.
As an example, we use a technology called Apache Spark. Itās very complex to run that from zero from, I want to run smart. So what we do is we provide tooling to make it really easy to do that. And what are the other areas that we love which Iām going to start describe as our practitioner.
So practitioners are people who are maybe data engineers, software engineers, analytics engineers, data scientists, I want to bundle thatunder practitioner. One of the technologies that weāve fell in love with across Slack kind of cohort is dbt. And so crucially from a self-serve data [00:06:00] platform perspective, weāve invested in trying to make that available to everybody to work with and make it as easy as possible.
And through that weāve actually seen this explosion in data practitioners. So by trying to build this sales data platform and invest in dbt weāve actually seen this huge shift of an explosion of users across the organization. Itās going to back that up with a few illustrations. So weāve got an explosion of users based on, we started working with dbt about 2019.
Actually I was tinkering with it and looking at it. And we were exploring it as an alternative solution to working with Apache Spark with the premise that it was easier for a wider set of practitioners to work with. And thatās pretty good. Weāve moved into a place now where weāre actually more probably in the 70 regions.
I put this together a few months ago, but weāve got about 60, 70 active practitioners. And this is across an engineering goal of 200 people with a organization size of about 1000. And [00:07:00] the reason I can pull this through, by the way is at the moment we use dbt and we use it in like a monolithic approach.
So weāve got one central repository. So obviously I can go to get help like this, and I can see contributors on master. And thatās pretty awesome. Weāve now got from four people ish to 60, 70. Thatās great. Thatās a nice explosion, thatās a vibrant community. And another lens of how we can see that is weāve got a bunch of people who are using Slack and weāve got this sort of review process where someoneās doing dbt, they should get somebody else to review that.
So if I look in Slack, which I did a few months back, I was putting this thought together. Thereās a bunch of people who were like, Hey, Iāve done some dbt work and somebody have a look at this for me. And weāre here. Thatās awesome. Little bit more worrying in terms of the explosion of users, is this 79 active pull requests.
So that was actually so another lens I thought as well, if weāve got active contributors, [00:08:00] and the way that people do things is pull requests, letās have a look at how many weāve got. So 79 actually is a little bit more worrying because even though it shows an explosion of users, it is actually probably showing weāve got a bit of a backing up here as well.
[00:08:13] dbt + decentralization #
[00:08:13] Darren Haken: Weāre not actually getting through them. So little bit more alarmed at this one, but nonetheless is showing an explosion of users. So to bring this sort of what I was describing as decentralization together with dbt weāve tried to organize the way we work with data around kind of business domains or so. An example of that would be, weāve got this marketing domain and inside that weāve got these groupings of data that represent things like looking at campaign performance.
This might be like social campaigns or advertising campaigns and looking at the performance of them, maybe weāre tracking a bunch of data around SEO. And so weāre actually trying to build out a better data modeling to look at SEO, optimizations and things we can do with that. On the other side, who might have a, I donāt know, a sales domain and in there weāre [00:09:00] looking for sales opportunities or churning customers and that kind of thing.
And with that, we have a bunch of people working in these areas who are actively developing in dbt. So within the sales domain, weāve maybe got, five, 10, whatever number of people, actively developing dbt data models to do things like risks and sale opportunities. And on the marketing side, we may have a similar size number of people doing things like campaign performance and so on.
So explosion of users, lots of activity with dbt. Brilliant, amazing. However, distributing our data across the organization has not just been the perfect situation we expected. And we have seen a set of new challenges that have come from that, which is what Iām going to get into now, the challenges.
[00:09:50] A new set of challenges #
[00:09:50] Darren Haken: So one of the challenges weāre seeing by this de-centralization is this problem, or sort of ownership/maintainers of dbt. So weād have this exposure [00:10:00] people now working across all these domains and theyāre actively working, but weāve experienced this problem where people work on things and they stop working on them and forget about them.
And the data platform team, one of my teams did a, an audit of all the dbt models we had in our monetary pay, and it looks for, I donāt know, people who wave their hand and said, have your eye on that, thatās something Iām responsible for. And 20% of our models have no sort of ownership or maintainer. And this isnāt a a dig at dbt specifically because we actually looked at other tools.
We use like Airflow, Spark and other things, and this is just a common occurrence or a problem spread across all of our technology. And itās actually more this decentralization, I think thatās caused it. More alarmingly then, what we discovered is we use Looker to drive and access some of our data.
We build a dbt and we found that Looker has 74 references to BigQuery tables. We use BigQuery that, that were connected to these open source dbt models. And thatās a little bit [00:11:00] concerning because people who use Looker might not really be as connected with the dbt work. Maybe itās our CFO or a CMO or it could be somebody in our accounts and sales team who were accessing data.
And thatās the view. This is well-maintained and reliable data, right? Because itās available to look at, but that might not be true. Certainly maybe fully 20% / 74 references. So ownership problems, which Iāve covered that, but just to make sure Iāve not missed anything. We start asking questions, especially my platform teams who want to look at like quality and the stability here, or who are the maintainers of these models? Like what, who is it, we canāt find them. And then that kind of cascades into what stakeholders, depending on the data. And the bit where I think this is really important as well is, if we have a bug or some sort of production incident, or something goes wrong with that data who is there to fix that, who resolves it? If a model is broken, who fixes it.
So this is a problem that we started to see when we saw this explosion of users. [00:12:00] And then this is other thing which I struggled to give a title for, but hopefully this makes sense, but itās this sort of idea of sharing data without permission.
So itās really easy, I think in dbt too, especially if I want, when Iām on a repo at the moment to access and reference data that youāve got the ref function, we can look up and I can access data. But that means itās very easy cross team at the moment for like team A to access data from team B.
So I see it as itās like teams depending on other teams data without permission, because this ref hasnāt been agreed necessarily by one of the other teams. And so what weāre seeing here is the sort of lack of contracts. So team A can access team Bās data, and they donāt actually know that, theyāre just, they can reference in that kind of blissfully unaware of that.
And weāve seen this over issues that start to manifest with that, where I donāt know one team. So we had like a sessions data. So tracking user tracking and data, and weāve got some sessions table. Maybe a couple of our different teams [00:13:00] have got tables they call sessions and itās for different parts of our business, right?
So maybe a brand new cars, leasing cars, used cars, and they want to look at sessions around these sort of vehicles, but they call them all. We found that we ended up with this kind of multiple sources of truth problem where different teams are referenced in different sessions tables cause they thought that was the right sessions table.
And then we have to backtrack on some of that and that creates this sort of ripple, right? So thereās trust confidence. People would want to validate, am I accessing the right sessions table here. They want that validation cause they donāt trust the data right there.
Theyāre asking that question to make sure. And then they have a part of that is itās may it feel like itās expensive for us when we want to restrict access to data. So weāve got very openness at the moment, very easy to access the data, but actually sometimes we donāt want that, maybe for GDPR reasons or personal data or some sort of intellectual property, whatever tax do we want to restrict that and just, I guess the sort of blocking of day areas at the moment for us feels more expensive.
[00:14:01] Darren Haken: Iāve got an example of this. Donāt worry. Donāt need to read all of the slide. Iām not sure if any of you can actually read all of the boxes, but visualizations are helpful. Nonetheless, in this example, Iāve got this kind of top note and thatās what we call one of our teams, which we call the stock and search platform team.
And they had a bunch of dbt and theyāve got it in the kind of world of dbt modeling and they built out a bunch of stuff. And actually theyāve got data that a lot of teams care about. Theyāve got really high value data. And so actually all these other teams underneath referenced that data because they want to work with it, to do what they need to do in their area of the organization.
But theyāre stuck in searching. Didnāt know that all these people were referencing their data through the kind of ref function. And so when we actually had a data incident, we had a production incident, whether it was a problem with that stock search platforms data, and it cascaded. So the blast radius was quite significant and all these different teams were effective.
So yeah, donāt worry. Donāt read all of the boxes and everything, but it was just a really straight in this kind of blast radius idea of, the [00:15:00] data was in-office encapsulated. It was leaking out into different teams.
[00:15:05] Discoverability was becoming a challenge #
[00:15:05] Darren Haken: And weāve got this other problem that started to happen with this explosion of users where discoverability, finding the right data is getting harder and harder.
Weāve got people that moving to Slack and almost this sort of tribalism about Hey, does anyone know where I can find this data? Iām looking for X, Iām looking for, why canāt people help me? So thatās why I described it as like tribalism, because itās about what the people know in the community.
And that has some upper limit right on scale because eventually, everyone can remember all the day. We were at least not an organization of where I work our size. So itās a good and a bad thing, right? Like people are obviously interested in the once access data, but at the same time, if it was more discoverable and more intuitive than maybe they would have to use Slack to ask these questions. So weāve got these challenges.
Itās happened through this decentralization and explosion of users, which is still fantastic problems to have, but I still feel like we need a scale away to overcome these [00:16:00] challenges exposed by decentralization.
[00:16:04] Automation over human processes #
[00:16:04] Darren Haken: And an overriding principle for me, or, overarching principle is Iād like to approach this with automation. Itād be very easy actually to introduce governance and human processes and say letās go harder on reviews or letās hire more people to attack all of the active pull requests. A lot of my background is software engineering and Iāve learned through that process, that human processes are either error prone or we just cannot keep up with the scale, right?
So we need automation, ways of validating that and to keep us moving into scale. And thatās where I, for me, one of the most interesting things or ways we could probably solve these challenges is through this idea of metadata.
[00:16:47] dbt supports metadata #
[00:16:47] Darren Haken: So some really awesome news, dbt supports metadata. It is part of it. It supports it, itās brilliant. So the idea is that in your kind of schema or demo files, your definitions of your models, there is this [00:17:00] meta block that sits underneath the model. And you can see it on the right hand side on this slide and inside it can contain key values. So for example, I think this is from the actual dbt Docs, or it was at some . And weāve got owner and weāre saying you have @alice and thereās this other one, which is model maturity.
And so we can encapsulate all of this model development and metadata together. Weāve got metadata thatās code, which, Iām sure, hopefully all of you know is great and itās all version control. So thatās brilliant. And then, because itās all this code, we can actually integrate our metadata and use it in our CI/CD pipeline and the way we deploy.
And this is just a CI/CD pipeline where code gets captured in GitHub, dbt does a compilation like build stage. And in, in there, Iāve got this green circle, which is what Iām saying is we can build a hook to apply what Iām describing as automated policies on the metadata, and Iāll get into what [00:18:00] I mean by that next, we then use Airflow at the moment.
So you schedule dbt on a regular basis that pushes data, that writes data into Google BigQuery on one of the other hooks, which Iāll touch on is we can synchronize this metadata to the broader ecosystem.
[00:18:16] Platform governance #
[00:18:16] Darren Haken: So automating policies. So one of the kind of models that we follow at Auto Trader is we group metadata together, use it, itās like a namespace or a prefix. So weāve got this data governance dot model. So anything to relate to data governance would be grouped together. And weāve come up with all these different attributes, like business owner, team owner, and all that kind of thing. And the great thing about this is we can create policies.
So in that build stage, I described earlier this hook in CI/CD, weāve written some Python modules that inspects the dbt manifest, and it looks to see what values are exist for all the models on the metadata. And that lets us define [00:19:00] policies. We can come up with any policies we want, but as examples of that could be, we want to make sure that all of our models have an owner.
And if they donāt have an owner, we fail to build. Maybe we create some sort of flag, which is like the life cycle. And we say production models can only be used by other teams. And so if itās a developer one or itās in testing, or maybe we use this PII flag and say only the security team can access that datamodel and we can apply to an automated through the policy.
And weāve done this through kind of inspiration from a linting platform. If anyoneās used one, then we can get the idea, but weāve got these rules and weāve built rules. And then on the right hand side is a little box. The bottom right, is showing this is a legitimate build failure that happened in CI/CD.
So no metadata existed for one of the for the models. And so it failed to build and that canāt go to production anymore, which is amazing.
And then on the ecosystem side so weāve covered the build automated policies, and on the right, I have that other block over green [00:20:00] blob, which was the synchronized, the metadata.
So Iāve exploded that and what Iām really saying here is you can use that metadata and you can push it out to other technologies that youāre using in your modern data stack, your ecosystem. And for us, where weāre looking at a metadata solution slash data catalog called Datahub, and weāve been using for quite some time, a monitoring data observability tool called Monte Carlo for sure.
A couple of examples of how we use that metadata in that way. So one of the other kinds of namespace blocks that weāve used is this data observability. So anything we see related to data observability, weāre bundling together under a namespace of data observability. I know that the only one thatās probably worth looking at here without going into too much detail about Monte Carlo is this notification channel.
So the idea of that is, the team thatās building the dbt models can say this notification channel is a Slack channel. And if thereās a problem with my data model, I want you to send the [00:21:00] alert into there and tell me thereās a problem with my. And so like in the packages weāve written in Python in the CI/CD, weāll hook into that and then synchronize that into Monte Carlo.
And this is an example of that. So Monte Carlo is on the left. So essentially weāve grabbed that meta data. Weāve posted it to the Monte Carlo API and itās registered there.. And then when thereās a problem on the right hand side, you get this Slack alert to say, thereās a problem with your data model in production, and itās gone wrong and you need to look at that.
And weāve driven all of that through automation synchronizing it into them ecosystem. And on the other one weāre evaluating at the moment is this trying to tackle discovery and thatās through a kind of central metadata. So weāre looking at tool called Betahub, which is an open source metadata store.
And the great thing about these is they can bring metadata together from all sorts of services. It makes them searchable and hopefully it reduces tribal knowledge, reduces the site queries. And on the right, weāve actually been pulling in all the metadata fields into it. [00:22:00] And itās already around, out new features.
Itās worth looking, but weāve also started seeing that theyāve grabbed this meta block and theyāre pushing it into all the different components like owners and tags and everything. So itās really exciting and it makes it fully searchable. So Iām running out of time. So Iām going to get through to the end part, which is hopefully by this point, youāve seen the problems decentralization can cause and like multiple people, lots of people working with dbt to scale.
And hopefully you can also understand that metadata might be one of the ways that we can tackle some of this and, keep that explosion of users cause we want that. But we also want all that stability that you get when youāve got fewer people working on a code base.
[00:22:39] We need centralized metadata #
[00:22:39] Darren Haken: So where do we go from here is around what if you want to start using metadata? And so one of the things I think we need an industry is a metadata, a centralized metadata store. So even though lots of tools attempt to, dbt for example weāve got metadata in dbt, but Iām imagining youāre in a similar place to me where you use lots of [00:23:00] data tools around dbt, adjacent technologies, maybe a BI tool like Looker, a data warehouse, I donāt know, Fivetran to ingest data, whatever. This diagram here is actually from LakeFS, a company who built up open source file format, file storage, and they do this landscape for you every year. And you can see this absolutely loads of technologies. I use some of these, youāll use different ones, but the point of all of this is why we need to centralize the metadata so we can get all of these tools and bring it into a central hub.
[00:23:29] Darren Haken: In my opinion, thereās a few worth looking out there. So data helps when I mentioned weāre reviewing Amundsen, itās another one thatās quite interesting. And weāve got Marquez, but thereās loads of them popping up. Now, these centralized kind of metadata stores, itās not really a solved problem, but itās very exciting.
And I know that dbt is also looking at it. Iāve seen a few things coming up as well. So thatās super exciting. One of the other things I think that we need to look at is this standardized metadata. So these ideas of like ownership and [00:24:00] governance and observability that weāve come up with that Auto Trader, I donāt feel like theyāre unique, right?
These concepts like must exist across the organizations. And this is why I think we need to start exploring like standard ways of documenting ownership, for example. And thatās super important, right? Because then the ecosystem of metadata can thrive because lots of tools can say as long as I understand what this standardized metadata is for ownership, I can use it within my software, my tool.
And without that, weāre going to end up with 500 different ways of defining ownership. And it will be really expensive to integrate all together and have an ecosystem. There are some emerging standards out there for this. So weāve got open metadata and open lineage. Youāre worth looking at that and making attempts now at trying to build these standards.
And then I think is basically what we need. So I think itās just really exciting and Iāve go and take a look at these projects. So really the reason I put this talk, and this is the final part for me is [00:25:00] for me, this is a call to arms actually. So this metadata thing for me is super exciting. Itās the only problem is there isnāt any standards out there and there are no kind of incumbent metadata solutions yet.
[00:25:14] A call to arms #
[00:25:14] Darren Haken: But it is really important. And I think this is where itās a call to arms to me. If you think itās important or you believe this metadata concept is a good idea, then I think the way we can take this forward is through open source, right? If we get together, especially the dbt incubator cities, thereās absolutely loads of people using dbt, we can all work together to drive these standards and maybe invest, or maybe we contribute to Datahub, or maybe we contribute to Amundsen and maybe we contribute to Open Lineage, but whatever it is, Iām just trying to spur people on to think about how you could contribute.
So takeaways for the talk, distributed teams, or lots of people working with dbt across lots of teams in a decentralized fashion can introduce a lot of these challenges Iāve seen. Iām sure itās not unique to me. [00:26:00] Like metadata is a really powerful tool that we can use to automate some of these problems.
And I also think these missing components in industry around metadata, in the modern data stack, we can solve them together by contributing. So thatās what I have you do straight after this talk, but also buy mea Shiba. So if you want to know more of Auto Trader where I work, we write lots of blogs and everything.
We try to get out there on the engineering blogs, go and take a look. And if youāre in the UK or thinking about moving to the UK and youāre interested in a role with us, then go to careers at Auto Trader or feel free to message me. And weāre always looking for people that are passionate. And thatās all, and yeah, Iām open for any questions. So then Kyle, if you can jump on now? Okay.
[00:26:46] Kyle Coapman: Yeah. Darren, thank you so much. I really appreciated how you mentioned pain points throughout there and how you address them. I have one question Iāll raise from a different Kyle Shannon. And his question is who on what teams were involved in defining the structure of the [00:27:00] metadata you use organization-wide and how did those conversations go?
[00:27:04] Darren Haken: Okay. Yeah. So this is actually a really good question. Iāve not been asked this before, when Iāve talked about it before. So we got the teams together, actually. So a lot of itās been driven through the challenges that we witnessed.
So weād find a problem and then weād look at what metadata we needed to tackle. This was one of the issues I found in, so I think standards are important. So we started looking at ownership first, and then I thought surely thereās going to be a bunch of people out there who said hereās some really good metadata youād want to capture about ownership. And there wasnāt actually, as I spoke to passionate people in the space who shared what theyāve learned through their organizations. And we got started from there, but what we did is we just brought our work in groups together, like a specialist interest group or a guild around the metadata in the org.
And then pulled like dbt practitioners from say three or four teams and said, Hey, like we need to move this forward. And then building out that way. But itās been evolving. So we started as, as limited as we could. And then widened, widened the metadata.
Last modified on: Apr 19, 2022