In December of 2020 at Coalesce, nearly 800 people tuned in live to hear David Murray, director of data and analytics at Snaptravel, share his team’s experience with data team org structure. Over the last four years, the data team at Snaptravel has grown from one analyst to almost a dozen, and they have tried five different data team structures over the course of nine months. This is a legitimately difficult problem!
Org structure is challenging for everyone in the industry, not just for fast-growing teams. I lead the customer success team at Fishtown Analytics. We’ve worked with data teams of all shapes and sizes—from data teams of one to massive enterprise teams comprising dozens of engineers and hundreds of analysts. We get asked questions about the ideal data team structure all the time, so I, like the rest of the dbt community, was interested in this talk.
This article recaps some of David’s key insights from the talk along with my own commentary and some thoughts from the dbt community. I highly recommend watching his talk or reading his blog post on the topic.
This topic clearly resonated with a lot of folks, and I think it’s worth considering why that is. This problem is not unique to SnapTravel in any way. It’s something we’re all thinking about right now. My take? Data team structure is difficult because data technology has changed so rapidly over the past five years and this has had a cascading effect on what data people do.
Ten years ago, the most challenging problem a data team faced was managing compute and store resources. James Densmore, currently the director of data infrastructure at HubSpot, described the old challenges of data management in a recent blog post,
“…getting a columnar database in the early 2010s meant investing in some serious bare-metal. For the young folks out there, that means you had to buy a physical server rack filled with what we now refer to as compute and storage. (as an aside, if you’re never had the pleasure of managing physical hardware in a freezing cold server room, find a way to visit a data center. It makes what we do feel more “real”)”
At this time, data analysts had no choice but to request changes to the data warehouse and patiently wait for the data engineers to deliver. Modern cloud warehousing completely upended this relationship. James goes on to write:
“…this new breed of data warehouses made it possible (and economical) to store and query far higher volumes of data than ever before. From there, data engineers found they could focus their efforts on efficient ingestion (Extract-Load) of data into warehouses where data analysts could flex their SQL muscles to model the data (Transform) on their own. ELT not only saved the data warehouse, but it lead to the restructuring of data teams and the emergence of a new role, the analytics engineer.”
The challenging problems of managing compute and store resources have largely been solved. The biggest challenges today are around speed: How can we help data engineers and analysts collaborate more effectively? How can we empower analysts to move quickly without sacrificing data quality? How can we empower analysts, engineers, and business users to make sense of the data in our warehouse? These aren’t questions about technology, they’re questions about humans and how we can all work better together.
What I’ve seen working with companies at varying sizes, and what I’ve learned from folks in the dbt Community, is that the spectrum of centralized to decentralized is one of the key decisions to make about data team org structure. David used the word “embedded” to describe the more decentralized model, others refer to this model as “distributed”, we’re all talking about pretty much the same thing. Here is how David defined the two ends of this spectrum:
A quick poll in the dbt Slack channel for this talk showed that a majority of teams use a centralized model as opposed to an embedded model.
In the fully centralized data team model, all data resources – people (data analysts, analytics engineers, data engineers, data scientists, etc.) and technology (data warehouse, transform, ingest, BI tools) – are owned by one central data team. If someone from product or finance has a data-related request, they submit it to the data team for prioritization.
Picture courtesy of Medium
A few benefits of this model…
Alignment of data resources to company-need: When you are a small data team, like Snaptravel was, and are growing, company-alignment is particularly important. A small company doesn’t have the bandwidth to do all the things. It’s important to focus data resources on the highest-impact areas of the business.
Knowledge-sharing: By placing analysts and engineers in close alignment, the centralized model prioritizes knowledge-sharing. This makes it easier to build cultural data norms together like naming conventions, syntax, or even how to write and review pull requests.
Mentorship: In the centralized model, analysts get to learn from more senior analysts as well as data engineers. This is incredibly valuable for analysts new to the analytics engineering workflow.
The biggest issue with a centralized model is speed. If marketing needs support adjusting their attribution model, it’s likely going to have to wait until the end of month reporting is wrapped for the finance team.
In the decentralized model, you’ll typically see a central core group of data engineers who own the data warehouse with analysts being decentralized, or embedded, within a business function such as finance or product.
Picture courtesy of Medium
The biggest advantage to the embedded model is speed. Data resources are aligned with department needs (instead of company needs). So if a business user has a request, they don’t need to wait for that request to be prioritized against all of the other needs of the business. Faster time to insights!
Speed also comes from having greater context. In a centralized model, work tends to be assigned in a more “round robin” fashion. In a decentralized model, the marketing analyst owns all marketing requests. They understand that function’s KPIs, know the metric definitions, and are familiar with the quirks of the data. This is often a benefit to both business users (who spend less time explaining themselves) and analysts (who get to go deep in a given function).
However, what we see is that this speed is highly dependent on just how empowered analysts are. If analysts are empowered to own the analytics engineering workflow, this model can work quite well (see examples from JetBlue and HubSpot). If analysts in a decentralized model spend most of their time in the BI tool and rely on data engineers for data transformation and modeling work, then analytics velocity will slow as analysts wait in the data engineering queue.
One of the biggest downsides that we see with the decentralized model is how challenging it can be to keep analysts working closely together and improving their shared knowledge of data analytics. Let’s say your head of finance hires a finance analyst. It’s very possible (likely!) that person will continue to work in the spreadsheets that finance teams are traditionally accustomed to rather than adopt the modern data stack used by your centralized team. This is what David calls “knowledge share”. In a centralized model, the knowledge that is most easily shared is data-related, in a decentralized model it’s domain-related.
Five data team structures in nine months is a lot, but the potential efficiency gains for their team felt important enough to make these efforts worthwhile. “We actually tried out about five different structures, which, if anyone on my team is here, we apologize,” David said to conference attendees. “That’s a lot of work structures. We would not recommend that. There’s a lot of change, and there’s a lot of reasons that organizations should not do that.”
Each of the five structures that Snaptravel tried was a different mix of centralized vs. decentralized. Ultimately they landed where we see more and more companies land – a hybrid version. The question for data teams is no longer “centralized vs. decentralized?” The question is “what, exactly, should be centralized and what should be decentralized?”
Finally, after nine months of constant change, Snaptravel landed on a hybrid setup that they call “domain-based” team structure. In this structure, a senior member of the team is labeled “domain lead” for a specific business area in a domain-based structure. They are then responsible for assigning work to other data engineers and analysts on an individual basis to support business priorities
Picture courtesy of Medium
This filled some critical gaps for them:
While this process currently works for Snaptravel, they recognize that it will evolve as their data team and organization grow.“One of the things that I’ve heard from people is that it won’t scale,” David said. “And to be honest, I don’t know. I’ve never worked at a large data organization. What we know is that this works for 10 people, and we think it could probably work for 20 people. Beyond that, we don’t know.”
I found this talk fascinating. Over the four plus years I’ve been on the Fishtown Analytics team, I’ve witnessed growing data teams struggle with this first hand. It cannot be overstated how important it is to reassess team structure with some regularity. What works for a data team of 2 will not work for a team of 20. And what worked in 2015, will not work in 2025.
It’s rare to get such a thorough and deep dive into how teams think about the problem of organizational design and the problems they encountered along the way. During the talk, folks in the Slack channel shared a few other fantastic resources on this topic. I’m adding them here so we can keep learning together:
If you missed David’s talk at Coalesce, you can still watch it here, and I highly recommend doing so.
Coalesce 2021 is taking place from December 6-10! Register for Coalesce here. We hope you can join us.