Table of Contents
- • No silver bullets: Building the analytics flywheel
- • Identity Crisis: Navigating the Modern Data Organization
- • Scaling Knowledge > Scaling Bodies: Why dbt Labs is making the bet on a data literate organization
- • Down with 'data science'
- • Refactor your hiring process: a framework
- • Beyond the Box: Stop relying on your Black co-worker to help you build a diverse team
- • To All The Data Managers We've Loved Before
- • From Diverse "Humans of Data" to Data Dream "Teams"
- • From 100 spreadsheets to 100 data analysts: the story of dbt at Slido
- • New Data Role on the Block: Revenue Analytics
- • Data Paradox of the Growth-Stage Startup
- • Share. Empower. Repeat. Come learn about how to become a Meetup Organizer!
- • Keynote: How big is this wave?
- • Analytics Engineering Everywhere: Why in the Next Five Years Every Organization Will Adopt Analytics Engineering
- • The Future of Analytics is Polyglot
- • The modern data experience
- • Don't hire a data engineer...yet
- • Keynote: The Metrics System
- • This is just the beginning
- • The Future of Data Analytics
- • Coalesce After Party with Catalog & Cocktails
- • The Operational Data Warehouse: Reverse ETL, CDPs, and the future of data activation
- • Built It Once & Build It Right: Prototyping for Data Teams
- • Inclusive Design and dbt
- • Analytics Engineering for storytellers
- • When to ask for help: Modern advice for working with consultants in data and analytics
- • Smaller Black Boxes: Towards Modular Data Products
- • Optimizing query run time with materialization schedules
- • How dbt Enables Systems Engineering in Analytics
- • Operationalizing Column-Name Contracts with dbtplyr
- • Building On Top of dbt: Managing External Dependencies
- • Data as Engineering
- • Automating Ambiguity: Managing dynamic source data using dbt macros
- • Building a metadata ecosystem with dbt
- • Modeling event data at scale
- • Introducing the activity schema: data modeling with a single table
- • dbt in a data mesh world
- • Sharing the knowledge - joining dbt and "the Business" using Tāngata
- • Eat the data you have: Tracking core events in a cookieless world
- • Getting Meta About Metadata: Building Trustworthy Data Products Backed by dbt
- • Batch to Streaming in One Easy Step
- • dbt 101: Stories from real-life data practitioners + a live look at dbt
- • The Modern Data Stack: How Fivetran Operationalizes Data Transformations
- • Implementing and scaling dbt Core without engineers
- • dbt Core v1.0 Reveal ✨
- • Data Analytics in a Snowflake world
- • Firebolt Deep Dive - Next generation performance with dbt
- • The Endpoints are the Beginning: Using the dbt Cloud API to build a culture of data awareness
- • dbt, Notebooks and the modern data experience
- • You don’t need another database: A conversation with Reynold Xin (Databricks) and Drew Banin (dbt Labs)
- • Git for the rest of us
- • How to build a mature dbt project from scratch
- • Tailoring dbt's incremental_strategy to Artsy's data needs
- • Observability within dbt
- • The Call is Coming from Inside the Warehouse: Surviving Schema Changes with Automation
- • So You Think You Can DAG: Supporting data scientists with dbt packages
- • How to Prepare Data for a Product Analytics Platform
- • dbt for Financial Services: How to boost returns on your SQL pipelines using dbt, Databricks, and Delta Lake
- • Stay Calm and Query on: Root Cause Analysis for Your Data Pipelines
- • Upskilling from an Insights Analyst to an Analytics Engineer
- • Building an Open Source Data Stack
- • Trials and Tribulations of Incremental Models
The Future of Data Analytics
Originally presented on 2021-12-08
A panel discussion about trends in the data space. Hear from venture capitalists on what’s new and interesting in the ecosystem.
Browse this talk’s Slack archives #
The day-of-talk conversation is archived here in dbt Community Slack.
Not a member of the dbt Community yet? You can join here to view the Coalesce chat archives.
Full transcript #
Julia Schottenstein: [00:00:00] Welcome everyone to your last session of the day at Coalesce, the future of data analytics. We’re going to spend the next 45 minutes talking about what we see as new and exciting, and the data ecosystem, and hear from our panel of venture capitalists about trends they’re spotting early. For those who don’t know me, I’m Julia Schottenstein, and I’m part of the product team at dbt Labs, and I’m also the cohost of our dbt Labs, analytics engineering podcast. So if you like the conversation, check out an episode and I promise lots more pontificating, for me and our CEO, Tristan Handy, as well as our guests who are at the frontier of what’s happening in the data space.
And you may be asking yourself, why am I hosting the panel of venture capitalists at Coalesce? Because a year ago, I was one of them and I made a big bet to come join the dbt Labs team because of my strong conviction in what we’re building. Before we kick it off, I invite everyone to join the conversation in the Coalesce Data [00:01:00] Analytics, Future Channel in dbt Slack.
I’ll be sure to answer any questions after the event. Okay. Without further ado, let’s turn it over to our panelists for a quick round of introductions. Astasia, let’s start with you.
Astasia Myers: Thanks Julia, super excited to be here. Hi everyone. I’m Astasia, I’m an enterprise partner at Quiet Capital, a VC fund that has a billion dollars of assets in their management.
I’ve been investing in data and ML startups for about seven years, and have been really humbled to partner with about 15 companies. Some companies that you may know of, or like Dremio, Cohesity, Preset, Airbyte, and a host of others. Excited to have the conversation today.
Julia Schottenstein: Sarah, why don’t you go next?
Sarah Catanzaro: Awesome. Yeah, thanks for having me here. So I’m a partner at Amplify Partners. We’re an early stage venture capital firm and we focus primarily on investments [00:02:00] in technical tools and platforms. I specifically focus on investing data and ML tools and platforms, and I actually made the opposite transition from you moving from a role in data, into a role in venture,
Jennifer Li: Hi, good to meet everyone on the dbt community. I’m very excited to be here also with the three amazing panelists who I happen to know outside of this context as well. So very excited to be here. I’m a partner on enterprise team at Andreessen Horowitz. Similar to Sarah, the passion and interest in data really came from my earlier career as a product manager at various B2B companies, I was the one setting up internal BI tools, product analytics, and building customer dashboards at companies like at Dynamics that’s a DevOps company. Definitely had a lot of challenges, [and] also a lot of learnings from that experience. So as soon as I joined Andreessen Horowitz three years ago, I dove into the [00:03:00] data infrastructure analytics space. It was my partner, Martin Casado, who we probably all met and heard from on the keynote on Monday, and we built this internal landscape of data tools and data infra companies. It seems in 2018, 2019, I made a lot of investments like dbt fivetran presets, and data breaks course from years ago, and also published a couple of blog posts about this topic. We’re doing a refresher on our emerging architecture for modern data infrastructures, so stay tuned for that, but I’m very excited to be here talking about the future of analytics.
Julia Schottenstein: Yeah, we’re really lucky to have such an awesome panel. Sorry, I want to start off with a big question. So this is a dbt crowd, so they are bought into analytics engineering best practices. They know the benefits of modular code testing, documentation version control, and so on.
They never have any problems, with their data reliability or quality, but so what? What does an investment in these best practices unlock for [00:04:00] teams beyond more reliable dashboards? What can they do now differently than they that they couldn’t do before? Sarah, why don’t we start with you?
Sarah Catanzaro: Yeah, absolutely. As I mentioned before, I used to be a data practitioner and I feel like I spent so much of my time, like trying to convince people that data quality matters.
[00:04:21] Data quality in today’s landscape #
Sarah Catanzaro: Now what I see among data teams is this kind of wide recognition that data quality matters. We don’t have to prove that point anymore, and so we’ve been able to drive investments in the tools focused on improving data models, improving data monitoring, testing and so on and so forth.
But I think the problem is. Is that we’ve lost track of the "why," and I was thinking about this earlier because I don’t know if it’s controversial or not, but like I, at the very least believe that high quality data is still a [00:05:00] useless if it’s not operational. If your data is not being used to drive decisions to build better products, then it doesn’t actually matter if it’s high quality.
Now I think there’s a different set of tools that are actually needed to operationalize data, and we’re starting to see the rise of categories, like Reverse ETL so that the data can exist where decisions are being made, like experimentation such that data and product decision-making are more intimately tied, and one of the trends that I’m very excited about is causal inference such that we can answer these questions, not just about what is happening but the "why" behind how things are happening. Ultimately, I think the connection between data quality and data best practices and the operationalization of data is also intimately tied because as we start using data, we can gain more insights [00:06:00] into how "quality" does it need to be, like where can we maybe, cut corners, and where do we need to invest more in improving data quality?
Astasia Myers: Sarah, I think you make a lot of really good points, especially around the operationalization of data. Like yourself, I spent a whole bunch of time thinking about how can data be used throughout the organization. Once you have these lower layers and the stack, Reverse ETL, with the great teams behind HighTouch and Census.
Really cool to think as the data warehouse has collected all this data and is the single source of truth, how do we move this data into SaaS applications that can actually drive business decisions for enhanced business processes? And as we know, a lot of the Reverse ETL use cases today, around go-to market teams who previously did not have this data, or it was developers, in-house that had to build these connectors themselves, which can be really painstaking.
[00:07:00] And so by moving billing information, customer intelligence, any feedback from marketing campaigns through these systems more regularly, a sales rep can make smart decisions about outbound and real time. I’ve even heard of really cool use cases for Reverse ETL, moving beyond. Just go to market teams, even finance teams, for example, people that would be doing their financial analysis and Google Sheets and using Reverse ETL to pipe this into net suite so it could be very effective for them. This full cycle of data, life cycle is really crucial for the next generation to make it usable. The second angle that I’m excited about is we’ve built these different layers of the stack—how can we leverage underlying technology across SaaS applications.
Once again, not just moving the data, but even a lower level of how do we look at the metrics? And I know dbt is very involved with this, thinking about the analog of Transform or the look [00:08:00] ML by standardizing this metrics layer, you can have this single source of the definition and a framework so business users can have more consistent understanding of these terms, and it’s in the same vein and analog of data quality, right? Everyone should have high, consistent performance of data terms and make sure it’s an operational systems so that we can be smarter and more active about making decisions.
Sarah Catanzaro: Yeah. now I think there’s still this gap though that Reverse ETL doesn’t quite address where we’re still talking about data.
We’re still talking about data sets and making those available, making those trustworthy. We’re not talking about actions and insights and just because some data model that exists within your data warehouse is now both trustworthy and accessible in your CRM, doesn’t mean that it’s [00:09:00] important. It doesn’t mean that it’s doing anything.
And so I’m excited to see this category continue to evolve. I think a lot of the Reverse ETL activity is orienting now more towards automation, which does have visible impact. But I think, just like we had to do in self-service analytics, like, we’re going to have to answer this question of okay, now, a sales rep can see their data in salesforce ,how does that change their behavior? What do they do? And that will probably require a new set of tools. It’ll probably also require a different way of thinking about organizational behaviors, structures, and so on.
Jennifer Li: Yeah. I agree more with the utility of data these days when we say "so what?" of data quality, the other side of the question is "who cares?" It used to be only a few people on the BI team that [00:10:00] manages the data and sending out the reports that care about the data they own, and the decisions that are going to derive from that, but now it’s across organization. Everyone cares about it, which does go back to, the robustness and the accuracy, and the foundation laid by the data team. All of that, I think, and their pains, why we started talking so much about data quality, the robustness that the infrastructure layer. Thinking about five, six years ago, it was a lot of passive data decisions where you pulled up once a while and make decisions from that, and as Sarah mentioned now, it becomes more of the triggers that alerts the more proactive actions of data, which I always call the first step of data applications, is what can you automatically derive or have data inform as the business operations are going on? So that’s also the reason why I believe this infrastructure layer [00:11:00] needs to become a common ground shared by the organization, but each individual team will build their own workflow, and their own applications, and their own meaning or major apps on top of it.
Julia Schottenstein: I think if you look at Airbnb, which has an incredibly sophisticated data team, they build a lot of their infrastructure in-house, but when they actually built their metric system nearby, it solved the dashboard problem there. Sorry. I asked. Superset I always mix up Preset and Superset. But it was really because they needed to take action, because there were gonna be real-time decisions happening on these metrics, and they needed that central source of truth to feed into their recommendation engine, their decisions on how they dealt with different kinds of customers, and so that real-time action was what forced them into this metrics layer. I’d love to hear, like, a little bit about people’s opinions on [00:12:00] metrics layer.
Is it going to deliver us into this new era of promised land? Once we have metrics layer in defined decentralized business ad, definitions, will be able to be a lot better at doing all the other things like taking actions with our customers, or is it over-hyped in your minds?
Astasia Myers: Yeah, happy to take that.
So something that was really exciting to me about the metrics catalogs is it’s really this idea about headless BI and the removal of the tight coupling between the visualization layer and the semantic layer, such that a single API can have a standard metric definition that can be pumped into many different SaaS applications, and while I don’t think it’s a panacea that will solve all data challenges, something, it really does help is for organizations to have a centralized location where everything can be defined for consistency because [00:13:00] often, similar to data quality is that there would be different definitions across SaaS tools, so when an executive or anyone in a conversation to make a decision would be looking at the result of the metric, they’d be like "in one report, it says this," and then this other SaaS will assess this and there’s this miscommunication. About the actual data itself. And so by having a consistent layer, you help enable stronger decisions because you’re not dealing with internal disagreements about the data itself, but also you have ownership around the metric and you can make sure it’s up to date.
And there is someone responsible for maintaining that, and I think when you have ownership, this actually enables individuals to have confidence that the metric is correct and that they can actually rely on it as compared to before, you would have different ideas of what the appropriate definition of MRR is, which could challenge companies.
Now I think metrics
Sarah Catanzaro: are really [00:14:00] interesting because there are two different kinds of value propositions that I see that also impact different stakeholders, and I think Astasia addressed the value of having consistent metrics and being able to articulate your business domain model as a data model, as a set of metrics. That, I think, is incredibly valuable, but the other value is perhaps like almost more technical or valuable in that metrics potentially unlock new paradigms for application development. So I think it’s easy to forget sometimes, because the performance is getting, like, better and better, but like, Snowflake, BBB, these other data warehouses, they’re analytical databases.
Like they were not actually designed to power application. So if you want to build your own in-house BI tool, if you want to build an experimentation platform and leverage the data that you have [00:15:00] within your data warehouse, like, that’s not really feasible right now. So I think one of the other benefits that is often not as widely discussed with metrics layers is the ability to enable fast slice and dice. It’s the ability to actually build apps using those metrics, and I think that’s something that I’m really excited about too, including some of the more recent efforts to not just have metrics tools, but to have open source metrics tools. Now you can actually separate out the data on the application. I don’t know exactly what new tools that’s going to unlock beyond experimentation, BI, etc., but I have a feeling that could have a really strong impact on the future of data.
[00:15:50] What exactly are data apps? #
Julia Schottenstein: Does anyone want to take a stab at defining a data app? I know that lots of venture capitalists talk about data apps as being really exciting, or future of notebooks. [00:16:00] What do we actually mean by that when we say these words?
Jennifer Li: my head, the data application may not be a full blown application. It could just be data powered features, it could be where data is being used to trigger certain actions. Data used to be very static. It is a dashboard and it is a chart we’ll look at, but now these become events and nudges, actions in applications that may not be data centric where it data is a feature, or it can be a full-blown analytics application where it’s people’s operating ground, and they’ll go refer to that of what’s the next step to do. And it could also be just machine learning applications that’s largely powered by data as well.
Julia Schottenstein: Yeah, I love that. I don’t know if anyone else was able to catch the Materialise talk earlier in the conference, but they described a use case where Drizzly is now powering, like, abandoned cart notifications with their data team, and so [00:17:00] it’s a real shift in data team owning static data or batch data to taking real-time action.
Now with just SQL, you can say "hey, you didn’t check out when you were supposed to have checked out this cart. It’s been abandoned. We’re going to send you a notification." And traditionally that was really owned by software developers or your front end team and your backend systems, but now the data team is having a voice in that conversation.
And so for me, it’s this new set of tools that are really empowering people who know SQL and taking them to the next level of being able to build production systems, which is super exciting for me.
Jennifer Li: Yeah, totally agree. And that’s definitely one of the major trends we spent a lot of time looking at, and as I mentioned, it doesn’t have to be full blown apps.
Data will be a very critical differentiation if not for every, but for most applications in the next generation just because application development has come to a mature stage where it is really how we [00:18:00] manifest and utilize data to build differentiation and personalization.
And I would say dashboards is only one of the interfaces. There are many other interface where data’s being consumed. We recently made an investment in a company called MADEC. It is actually a dynamic PowerPoint slide generation company, but if you think about it, the secret power of that is actually instead of people copy pasting data from their dashboards from Salesforce all the time is the data team writes SQL queries directly to their data warehouses, [and] that’s automatically put into Google Slides or PowerPoints, and that’s a very smart way of utilizing data and automating a lot of mundane workflow. And again, that’s just data as a feature, and of course there are many full-blown data applications like Panther or Indicative that’s getting the next generation.
Splunk or Amplitude on top of data warehouses, and of course the retools of the world is really giving them power to the SQL proficient users to, again, like you said, Julia do the [00:19:00] applications for internal use cases.
Julia Schottenstein: Yeah. So just to underline that, you just described security and massive industry and two internal tools, which is usually all custom, now being built off of your data warehouse, that’s super powerful.
Sarah Catanzaro: And I think ultimately we’re talking about three different things though. And I worry sometimes that by focusing on what I’ll describe as like the first two, we somewhat underestimate or undervalue the third. When I say, maybe used the word "data apps" before, I was specifically talking about SaaS applications that are built on top of the data warehouse.
So we talk about those like new tools, new platforms, which may be either internal tools and platforms that, a vendor will sell, that our data warehouse needs. Then we have this set of data apps that I think you described ,Julia, that are really [00:20:00] enabling the data team, the analytics engineering team to build external products, to build products that the users [and] the customers are going to interface with.
But ultimately I think there’s this third category that deserves some discussion. Sorry, the mail is here. Apparently it doesn’t deserve any discussions. Anyway, there is a third category that, had some discussion and that category, I would say, is really the way in which we enable others within a company to make decisions using data.
Going back to what I was saying earlier, it’s not enough to just give somebody a spreadsheet. It’s not enough to just put a lead score and Salesforce. I think the ability to create a more interactive experience, the [00:21:00] ability to transform that data into a product that other people on your team interface with is another form of data apps that we should be discussing too.
Astasia Myers: Totally. I love that you broke it down into three categories. This first category that you mentioned about SaaS applications no longer having their own database layer, and it’s really the cloud data warehouse [that] is the backend for these apps is super cool. And you can see this for like PLG, CRM tech, like Endgame and Pocus.
You could see it on the finance side with businesses like Puzzle Jennifer highlighted Panther. It’s really cool. Once again, we’re seeing this divergence between the visualization where the intelligence at that is shown to people and the lower infrastructure stack, and it just allows teams that have made that infrastructure investment to get the most leverage of a commitment that they’ve already [made].
The second category that you [00:22:00] highlighted about, exploratory data analysis through a data app, [is] usually tied to some type of notebook experience that a more technical user is doing the analysis, and that’s super neat, cause I think that’s the vision that like BI tools had 10 years ago with Tableau and Looker: " let’s get data, let’s have insights and understand the parameters or levers behind our business." What can we change to make smart decisions? And it’s lived up to that, but by actually having a shared workspace where a technical person, like an analytics engineer, or a data analyst can dig deeper into the data and then share that across their team for these non-technical partners of theirs that is engaging in dynamics. The partner themselves can change parameters to do their own lightweight exploration, and it’s also an enduring asset that can be shown and have a history tied to it. It’s pretty darn cool. [00:23:00] It’s something that, I think if you talk to a lot of the senior leadership at BI Solutions, they would say "yeah, we’re we tried to do this. We’re trying to get there." And this is a nice complimentary piece of technology for this exploratory research.
Sarah Catanzaro: Yeah. I don’t know if any of you have seen the digital journal "The Pudding," but they basically have all of these kinds of interactive visualizations as well as some really cool and exciting data journalism.
And when you start just creating new ways of interacting with data and communicating with data, when you enable people to have this like dialogue with data, it is just so cool. I guess if I have like any suggestion, it’s go check out The Pudding because I hope in the future, like that is how we’re going to be like interfacing with data, not just in our personal lives, but in a business context.
Julia Schottenstein: Totally. I think like data [00:24:00] storytelling is becoming increasingly important and people have different perspectives when looking at the same two numbers, and it’s really helpful. If you can add like a little bit of extra detail on your key takeaways, how you interpreted this and try to create alignment with the teams so that you can either take action or have the same shared understanding, and I think that’s what you get in notebooks or these kind of apps where you have a bit more of a custom experience of how your employees interact with. I wanted to switch gears a little bit. It wouldn’t be a panel with venture capitalists if we didn’t talk about machine learning. [We] talk a lot at this conference about the analytics stack or the data stack, but there’s also a different stack, which is machine learning stack.
Traditionally, the tools have been pretty different. Also, the people who spend time on these tools tend to also be a little bit different. Do we think that there’s going to be a convergence of these two stacks? Is it going to be more of a continuum? What are you seeing of the blend of worlds, of [00:25:00] analytics engineering and machine learning?
Sarah Catanzaro: As you mentioned, I think like for the past several years, the analytic stack and the machine learning stack, they’ve been like rather separate. And I think in a large part, because like there, there haven’t been great ways to do polyglot analysis to work in both Python and SQL, also because frankly. Both the fields of analytics and the fields of machine learning were relatively new an, most teams were like trying to figure out how to work, how to organize their teams, [and] what people’s responsibilities should be. That said, I think one of the things that I’m excited about is starting to see more collaboration between analytics engineering and machine learning.
And I’ve seen that, kind of, most clearly looking at kind of the [00:26:00] intersection of features and metrics or features and data models. I see increasingly, instead of implementing new feature stores, many machine learning teams are basically saying either our analytics engineering team is spending all of this time improving the quality of our data models.
Like maybe we should be using the data models as features. And so you see machine learning teams like using a stack that kind of looks like Snowflake plus dbt, plus maybe a framework like Metaflow or something that is a little bit more training and deployment specific. But it’s definitely something that I am very curious to follow in the coming months. Will this be the convergence of features and metrics? There are some disparities in terms of what you need, like a metrics layer might not address issues associated with online [and] offline inconsistencies, but [00:27:00] there’s a similar requirement where it feels like there could be one universal solution.
Jennifer Li: Yeah, even though we talk a lot about the great convergence in our post but I still argue that the two stacks are fairly separated. Analytics and production machine learning, especially.
From my point of view already because the product interfaces, the teams skillsets, and also [the] working dynamic, more already like in the engineering philosophy that the software development life cycle versus data is still emerging into having more and more tools to being like that. I think we’re still the early to middle phase of that.
I think that’s really driving the mechanics behind the technical stack. I do believe the infrastructure layer, and it’s already happening or crossing over more and more, like we see the overlaps between Snowflake and Databricks. The orchestration and pipeline tools are being more.share, and definitely the metric layer will become the next one because the separating of the layer does enable [00:28:00] teams to not rebooting same architecture, same infrastructure again and again, but just building different products on top of it.
But when it comes to model, training and deployment, [and] model building, it’s still fairly separated. And then the cadence, the type of products the teams are deriving out of data are still fairly different. Where I do have a lot of interest and [what] I’m excited about is what are the vertical applications being built on top of these?
I believe because of the shortage of machine and expertise and engineers, there’ll be more and more vertical applications such as e-commerce, logistics, or like sales marketing that’s going to be like 80% pre-trained models with a bit more customization on top, or maybe local solutions.
I feel like that’s where the future could be like sharing the same infrastructure, but having a little bit more higher level tools on top of that.
Astasia Myers: I think you [00:29:00] make a great point, Jennifer, like this idea that analytics being the data warehouse, and then operational and ML being the data lake, and I think that’s going to be a continuum more than we believe today because it really depends on the business priorities, and if you are a company where ML is nice to have, you’re going to start with your data warehouse, and then you’re gonna adopt solutions like ML flow or continual IQ on top to do an augmentation when it is appropriate versus having to stand up an entire ML infrastructure suite, like SageMaker, [or] Databricks, cause it is core to your business value prop, and so with this convergence over a time, it’s just going to be based on the business priorities and the stage of. which they’ve done their initial adoption, but I don’t think it just stops at the infrastructure layer.
We’re just talking about data apps and notebooks. That’s something that came out of the ML world that now we’re seeing being brought to [00:30:00] analytics engineers and data analysts. So it’s really cool that these best practices can be shared across teams, and it’s not just the tools, but it’s also the fact that the barrier to entry for adoption is getting lower than ever before, and what an analytics engineer can do now with SQL was impossible five years ago, based on the interfaces to your points. For me, it’s somewhat of a continuum for the tech. It’s also the app layer, and the third point is that less technical people can be enabled to do more than ever before, like with continual IQ, you can be a data analyst and build an entire model on top of a cloud data warehouse. Like holy moly, before you need a PhD, like that’s pretty darn cool.
Sarah Catanzaro: Yeah, I think, Astasia, you implied something that I strongly believe too, which is frankly, analytics engineering tools are just much better than [00:31:00] ML tools right now, like most ML tools and platforms are like super shitty. And frankly, the experience of using the data lake sucks. And so like at a minimum, one of my hopes is that like the trends that we see in the analytics engineering, the focus on not just building tools that are powerful and scalable and highly performance, but building tools that enable analytics engineers to be more productive, that enable new people to explore the discipline of analytics engineering, [and] that we’ll see similar patterns in ML. Like when I look at the ML ecosystem, it feels like there are tools and platforms for everything ranging from distributed training, to model monitoring, to experiment tracking, and you talked to ML teams and despite having so many different point solutions, they’re still really struggling in a way that I [00:32:00] don’t think analytics engineering teams are struggling anymore. And to me, like the only answer can just be that like the tools are crap and they’re crap, because they were to support [and] help teams implement these big models that we’re trained on petabytes of data. They were not built with developer ergonomics in mind. So I really hope that like analytics engineering and analytics engineering vendors can raise the bar for help. Cause man, like Spark is just not fine.
Julia Schottenstein: What’s so striking to me is the same problems that I think we’re trying to handle, and the analytics engineering workflow, are very similar to what I’m hearing pain points. So before, analysts were like "I’m the one responsible for the business project.
I’m the one who’s writing [00:33:00] the sequel, but I can’t deploy my models." Hence the engineering analytics, or the analytics engineering movement, where the same person could be the one who writes the models and also deploys it, and we empower that human to go do great things. The same challenge is what I constantly hear in ML: "hey, I have this model. I’ve trained it, but I haven’t yet figured out how to productionize it and deploy it." And that’s really the hard part, and what really still resonates to me is you’re still using the same underlying data. You’re going to train your models on your business data or whatever data you have and you need it in a nice clean format because that’s how you’re going to get your model weights and that’s what you’re going to use for all your predictions, so if it feels to me that we’re certainly not there yet. We are getting better with these like lightweight things. Astasia mentioned like continual, I was also like super excited to see this other like ML or Python on dbt called Fall, but I [00:34:00] think that there has to be a second wave where it just gets a whole lot easier, cause the problems are the same and the data is the same. Awesome. So I think we’re almost out of time. I think we’re going to have a couple of quick questions and then we’ll wrap up. But one of the things that’s just so striking to me is that there’s so much capital going into the data space. Just in the past month, we’ve seen fundraisers from Elemental, Hightouch, ThoughtSpot, Datafold, Superrain, ClickHouse, the list goes on. Why are investors so hot on data?
[00:34:37] Investors jumping on data #
Sarah Catanzaro: So my very like optimistic view is they’re hot on data because data is hot right now, because we have finally hit that like important?
I guess what I’m getting at is I don’t think that all of these companies are overvalued. Certainly I feel like it’s getting more competitive [00:35:00] and there probably won’t be 10 data quality monitoring companies that ended up IPOing, but will there be at least one? Likely. Again, that’s because we’ve hit a point at which many companies recognize the importance of data and are thinking more about things like experimentation, things like reporting, and metrics. [For] the companies that are raising at high prices, like maybe they haven’t de-risked their business entirely yet, but like they’re growing, like they’re not just raising massive rounds because [ they’re] good at telling stories, like they’re actually quite successful.
So I think there’s a lot of things to be cynical about and in looking at the venture and startup ecosystem, but I think we need to give ourselves credit too. dbt is where it is, including from a funding perspective, because [00:36:00] like, y’all are kicking ass. That’s my positive view.
Astasia Myers: Yeah. I think there’s other factors that are going on here as well. We have this movement to the public cloud that’s enabling new technologies that people want to adopt. We’re also seeing five to six years ago, Informatica was taken private and they were a huge player in this place, but they were on prime focus versus cloud.
So they weren’t able to capture this new audience of buyers. We had the rise of companies like Uber and AirBnb, Stitch Fix with Spotify who just came out the gate. Data centric did huge, amazing things. And their competitors were like, "oh shoot, like we need a data strategy. They’re kicking our butt without it.
What are the tools and solutions that we should be adopting to enable our business to be successful?" And then let’s not forget that the Snowflake IPO raise over the past year is the biggest enterprise to offer IPO that has ever happened. And people have woken up. It’s like they [00:37:00] forgot Oracle existed, and you’re like, "oh shoot, you can make a lot of money and be part of amazing journeys by being part of a data company."
And so with that in mind, they’re starting to think about, "oh, this is a $65 to $70 billion market. What are all the different adjacent offerings that we can think about in the context of a data lake or a data warehouse?" And with the, increased a number of personas that are touching data, you guys have talked about the rise of the analytics engineer role that did not exist a few years ago. You’re trying to get specialization of data tools, and so that reads to a whole bunch of amazing startups coming out with great tech that enable a business to do something [they] weren’t able to before with a much larger audience of users that we’ve ever seen.
Jennifer Li: Yeah, I completely agree with both what Sarah and Astasia mentioned, and also think it goes back to the point I [00:38:00] made earlier: analytics tools are important and hot, but data becomes features of every individual company and every software company, and it means a lot of overhauling, tooling, or investing in tooling, also investing in the teams that build the foundation, whether it’s for analytics or machine learning.
We see the growth of roles in data from our internal portfolio companies from public stats on LinkedIn. It almost looks like 10 years ago when we’re talking about the rise of developers from a few million to 40-50 million people, and this is providing tools for the developers. So it’s very much underpinned by the rise of workforce in the US.
And also, it’s just very rare to see the [where] scale of Snowflake and Databricks is at today where there are hundreds of millions in revenue, [and] still growing to 3x year over a year, and that’s just, never heard of before, I’m sure. From a bunch of, we all know the benchmarks do like companies at that [00:39:00] phase go back to, the 50, 70 percentage. That’s the best in class.
So we’re really seeing a huge market tailwind and a wave of growth in this category, and we’re seeing multi-trillion dollar market unfolding in front of our eyes, both underpinned by the fundamentals, as well as the public market rewarding it.