I’ve been thinking a lot about the analytics stack recently. My friend and ex-coworker David Wallace wrote a brilliant post about how analytics technology should be modular, and I couldn’t agree more. Modular analytics tools allow companies to choose components to fit their needs and switch out or augment components as their needs—or the underlying technology—change.
I’ve written in the past about the modern, SaaS-based analytics stack. In short, the architecture looks something like this:
Consolidate data into a high-performance cloud data warehouse, transform it, and then analyze it in your BI tool of choice. The one piece that’s missing from this diagram is event collection—capturing and storing event stream data from your web and mobile applications. But otherwise, this is really what the analytics stack at modern, tech-forward businesses looks like today.
All of this technology, today, is configurable in less than a day and can be bought for less than $1,000 a month. At Fishtown Analytics, we’ve implemented this stack for over 40 startups at this point, and know its strengths and its weaknesses inside and out.
What’s somewhat surprising to me is that, for a space that has changed so much in the past five years, how little has changed in the past two. Since starting Fishtown Analytics in July of 2016, the fundamentals of this stack have remained the same. The products in it have each gotten better — sometimes in very meaningful ways — but there are no new entrants who have changed the game. We still feel that the products of two years ago are the best in their respective categories:
- ETL: Fivetran and Stitch
- Events: Snowplow
- Warehouses: Redshift, BigQuery, and Snowflake
- Transformation: dbt
- BI: Looker and Mode
As someone who has witnessed first-hand the speed with which the analytics tech stack can be shredded and rebuilt, this makes me uncomfortable. As a business, the quality of our work and the strength of our reputation is based on implementing the very best analytics tools.
Are we getting complacent? Sticking to what we know? About to get hit by a freight train of next-generation technology that we don’t know anything about?
Famous last words: I don’t think so. Here’s why. Monolithic analytics technologies of the past were faddish; when a new one came along you had to replace your existing one. Modular analytics technologies build on top of one another. When new technologies come along they are likely to augment existing technologies.
So: I think the existing layers of the stack are here to stay. They will get better. Maybe there will be a new data warehouse that processes its relational algebra in GPUs rather than CPUs. That will be awesome, but it won’t change the fundamental role of the warehouse layer. There will still be products that are really good at ETL, event collection, warehousing, transformation, and BI.
Instead, I believe that there will be many new layers of the stack that do not exist today. In working extensively with the current stack, we’ve had plenty of opportunities to identify gaps, areas where there’s an unmet need. Many of these gaps don’t feel that important today because we’re still in the early adopter phase of this technology and early adopters are very forgiving. But if our experiences are any kind of guide, there’s a lot of technology to be built.
Here are some notes on the gaps we’ve identified and how I think products could potentially take shape to address them. Caveat Emptor.
The Next Layers of the Analytics Stack
Automated data cleansing
Data loaded directly from production systems is really messy. And data cleansing is painstaking. It involves manual effort to identify and resolve each individual problem in the data. Often the ROI on having a human find and fix each of the numerous problems isn’t there.
I think that there will be an entire class of products that simply plugs into your data warehouse and analyzes your existing and new data for patterns it identifies as problematic:
- referential keys that don’t exist
- null values that shouldn’t be there
- invalid addresses and emails, and
- so many more things (you get the idea).
It will triage these issues for you and suggest resolutions. You can intervene or have it apply its best guess correction. At first this will likely be basic, but I’m confident that over time the application of ML and data network effects can enable the creation of compelling products in this space.
Column- and row-level data security
Today, security is applied at two stages: the database and the BI layer. Databases have a fairly unsophisticated security model to limit what individual users can do, typically defined at the object level. BI products often have much more security functionality but this functionality isn’t helpful in a modular world: these security rules have no effect when a user logs in directly to the warehouse in their Jupyter notebook or SQL client.
The analytics stack needs a common security layer that governs all data access. This can be provided by a product that acts as a proxy, intercepting all ODBC / JDBC requests and applying a security model on top of them. All user access would be provisioned via this layer. This would allow the creation of security rules that historically have been very challenging to implement, such as hashing user email addresses when queried by non-admin users.
Automated entity resolution
It’s very challenging today to know whether two records sourced from two different customer databases actually refer to the same person. Every company implements a slightly different set of heuristics to attempt to answer that question, but everyone knows that these heuristics are just slightly better than nothing. There will be products that probabilistically link records together from within a given system or across multiple systems.
This is not a new product category by any means. Enterprises have had “customer data platforms” for a long time and this is one of their core features. The difference here is that this functionality will be provided in a modular way as a part of the warehouse-focused analytics stack, not as a part of a monolithic product.
Companies now have their customer data loaded into high-performance databases, but it’s shocking just how infrequently they pair that with demographic enrichment. Out of 40 companies we’ve worked with, only a single company has ever invested the time to grab gender and age information on their customer list to use in segmentation, and with good reason: it takes a surprising amount of work to build this today.
There are solid API-based data enrichment products today—we’ve used both Clearbit and Full Contact—but none of them provide native integrations with data warehouses. In the future, either these platforms will provide native integrations or someone will build that glue for them. Companies will be able to check a box and import demographic data for their customer list on an ongoing basis.
Anomaly detection and notification
The analytics data stack has become the single place that all of a company’s data lives together. Today, that data is commonly being used for macro-scale analysis, but it’s almost never being used for organization-wide monitoring and alerting. Use cases abound:
- website traffic fell 50% in the past 15 minutes,
- orders from Eastern Europe using Bitcoin have spiked, or
- customer service response time has grown from 5 minutes to 30 minutes.
Your data warehouse already has all of this information, but you just don’t have the technology that actually monitors your various data streams to give you notifications for the things you care about. This will change, and when it does your analytics data stack will begin to function as the nervous system for your entire company, not just providing executive function but acting as the sensory organs as well.
Data engineers and analysts are spending so much time ingesting, cleansing, and modeling data in their warehouses, but that work almost never flows back into the systems that actually run the business. Messaging platforms, CRMs, and inventory management systems will, in the future, not just be data sources, they’ll be data destinations.
Rather than pushing events to all platforms (a la Segment), all data will be ingested via the analytics tech stack into a data warehouse. After being processed by all of the various layers, products will emerge to pipe the resulting data back to the systems of record, allowing targeted marketing, personalized sales conversations, and smarter inventory management (among others).
The more data sources get integrated and the more uses for the data in this stack, the more its users will demand explainability. Data consumers throughout a business need to understand where data came from and how it’s been manipulated in order to trust it for their use case.
Currently, analysts are solving for this via documentation. At some point there will need to be a richer, more automated mechanism for end users to explore data provenance, because complexity will simply be too high otherwise.
This is just the beginning…
Pulling back from the weeds for a second — I truly believe that this is just the beginning for the development of this tech stack. We’ve seen decades of enterprise software built in this space, but these companies rise and fall without pushing the overall field forwards; their proprietary software dies with their ability to convince the next enterprise buyer to sign a seven figure contract.
Now that this technology is being built modularly, targeting widespread adoption and with a focus on open source and open standards, I believe we will see steady forward progress towards recognizing the longstanding dream of listening, processing, and reacting to company data using a single cohesive set of technologies.
I know that sounds lofty. I tend to like to keep my prognostications rather more grounded, but I really do believe this is the direction the industry is moving. If you disagree, please, call bullshit in the comments.
Thanks for reading :)
⚡️Ready to improve your analytics engineering workflow? Get started with dbt today. ⚡️
Last modified on: Apr 25, 2022