Table of Contents
- Start here
- Accessing raw data
- Data transformation
- Downstream use cases
- Building a data team
- Joining a data team
When to hire a data engineer?
Originally published on 2019-01-02
The role of the data engineer in a startup data team is changing rapidly. Are you thinking about it the right way?
I find myself regularly having conversations with analytics leaders who are structuring the role of their team’s data engineers according to an outdated mental model. This mistake can significantly hinder your entire data team, and I’d like to see more companies avoid that outcome.
This post represents my beliefs about when, how, and why you should hire data engineers as a part of your team. It’s based on my experience at Fishtown Analytics working with over 100 VC-backed startups to build their data teams, and on conversations with hundreds of companies in the wider data community.
The Role of the Data Engineer is Changing. #
Software is increasingly automating the boring parts of data engineering.
In 2012, if you wanted to have a sophisticated analytics practice at your VC-backed startup, you needed one or more data engineers. These engineers were responsible for extracting data from your operational systems and piping it somewhere that analysts and business users could get at it. Often they would do some transformation work to make the data easier to analyze. Without the data engineers, analysts and scientists didn’t have any data to work with, so frequently engineers were the very first members of a new data team.
Coming into 2019, you can buy technologies off-the-shelf to do most of that work. In most scenarios, you and your data analysts and scientists could build the entire pipeline without the need for anyone with hardcore data eng experience. And you wouldn’t be building some second-rate, shitty pipeline: off-the-shelf tools are actually the best-in-class way to solve these problems today.
This ability for data analysts and scientists to build self-service pipelines is new—about 2–3 years old at this point. The driver of this is three specific products: Stitch, Fivetran, and dbt (full disclosure: my company, Fishtown Analytics, maintains dbt). These products were initially launched in the wake of the release of Amazon Redshift, when startup data teams discovered a remendous latent hunger to build data warehouses. It took several years for the products to get good, though—back in 2016 we were still in early-adopter land.
At this point a pipeline built on top of Stitch / Fivetran / dbt is far more reliable than one built on top of custom-built Airflow tasks. This is an empirical statement, not a theoretical one: I’m not saying it’s not possible to build a reliable Airflow infrastructure, I’m just saying that most startups don’t. At Fishtown Analytics, we’ve worked with 100+ VC-backed data teams and have seen this play out over and over again. We’re consistently migrating people from custom-built pipelines onto off-the-shelf infrastructure and in literally every single case the impact has been tremendously positive.
Engineers Shouldn’t Write ETL. #
In 2016, Jeff Magnusson wrote a foundational blog post called Engineers Shouldn’t Write ETL. It was the first post I’m aware of where someone called out this change. Here’s my favorite part:
Data processing tools and technologies have evolved massively over the last five years. Unless you need to process over many petabytes of data, or you’re ingesting hundreds of billions of events a day, most technologies have evolved to a point where they can trivially scale to your needs. Unless you need to push the boundaries of what these technologies are capable of, you probably don’t need a highly specialized team of dedicated engineers to build solutions on top of them. If you manage to hire them, they will be bored. If they are bored, they will leave you for Google, Facebook, LinkedIn, Twitter, … — places where their expertise is actually needed. If they are not bored, chances are they are pretty mediocre. Mediocre engineers really excel at building enormously over complicated, awful-to-work-with messes they call “solutions”.
I love this section so much because it not only highlights why you don’t need data engineers to solve most ETL problems today, it also states why you’re better off not asking them to solves these problems at all.
If you hire a data engineer and ask them to build pipelines, they will think their job is to build pipelines. This will mean that tools like Stitch and Fivetran and dbt will seem like threats to their existence instead of tremendous force multipliers. They’ll find reasons why off-the-shelf pipelines won’t actually suit your very custom data needs, and reasons why analysts shouldn’t actually be building their own data transformations. They’ll write code that is fragile, hard to maintain, and non-performant. And you’ll come to rely on this code because it’s underneath everything else your team does.
Avoid this situation like the plague. The pace of innovation on your data team will plummet and you’ll spend all of your time thinking about infrastructure issues that aren’t actually revenue-generating for the business.
If Not ETL, Then…What? #
So, do you still need data engineers on your startup data team? You do.
Even with the availability of new tools that empower data analysts and scientists to build self-service pipelines, data engineers are still a critical part of any high-functioning data team. However, the tasks they should focus on have changed, as has the sequencing in which you hire them. I’ll discuss the “when” question in a later section; for now, let’s talk about what data engineers are responsible for on modern startup data teams.
Data Engineers are still a critical part of any high-functioning data team.
Instead of building ingestion pipelines that are available off-the-shelf and implementing SQL-based data transformations, here’s what your data engineers should be focused on:
- managing and optimizing core data infrastructure,
- building and maintaining custom ingestion pipelines,
- supporting data team resources with design and performance optimization, and
- building non-SQL transformation pipelines.
Managing and Optimizing Core Data Infrastructure
While data engineers no longer need to manage Hadoop clusters or scale hardware for Vertica at VC-backed startups, there is still real engineering to do in this area. Making sure that your data technology is operating at its peak results in massive improvements to performance, cost, or both. That typically involves:
- building monitoring infrastructure to give visibility into the pipeline’s status,
- monitoring all jobs for impact on cluster performance,
- running maintenance routines regularly,
- tuning table schemas (i.e. partitions, compression, distribution) to minimize costs and maximize performance, and
- developing custom data infrastructure not available off-the-shelf.
These types of efforts are often overlooked at earlier stages of a data team’s maturity, but become incredibly important as that team and the dataset grow. In one project we were able to cut BigQuery costs for building a table incrementally from $500/day to $1/day by optimizing table partitions. This stuff is important.
One company who has gone far down this path is Uber. Data engineers at Uber built a tool called Queryparser that automatically monitors all queries run against their data infrastructure and gathers statistics about the resources utilized and utilization patterns. Uber data engineers can use metadata to tune infrastructure accordingly.
Data engineers are also often responsible for building and maintaining the CI/CD pipeline that runs the data infrastructure. While many data teams had extremely poor VCS, environment management, and testing infrastructure in 2012, that’s changing, and it’s data engineers leading this charge.
Finally, data engineers at leading companies are often also involved in building tooling that doesn’t exist off-the-shelf. For instance, data engineers at Airbnb built Airflow because they didn’t have a way to effectively build and schedule DAGs. And data engineers at Netflix are responsible for building and maintaining a sophisticated infrastructure for developing and running tens of thousands of Jupyter notebooks.
You can get most of your core infrastructure off-the-shelf today, but someone still needs to monitor it and make sure it’s performing. And if you’re truly a cutting-edge data organization, you’ll likely want to push the boundaries on existing tooling. Data engineers can help with both.
Build and Maintain Ingestion Pipelines
While data engineers no longer need to hand-roll Postgres or Salesforce data transport, there are “only” about 100 integrations available off-the-shelf from the modern data integration vendors. Most of the companies we work with have off-the-shelf coverage of between 75 and 90% of the data sources they work with.
In practice, integrations are implemented in waves. Typically, the first phase includes core application database and event tracking, with the second phase including marketing systems like an ESP and advertising platforms. These first two phases are available completely off the shelf today. Once you go deeper into your more domain-specific SaaS vendors, you’ll need data engineers to build and maintain these more niche data ingestion pipelines.
For example, ecommerce companies end up dealing with a ton of different products in the ERP / logistics / shipping domain. Many of these products are very specific to particular verticals, and almost none of them are available off the shelf. Expect your data engineers to build these for the foreseeable future.
Building and maintaining reliable ingestion pipelines is hard. If you decide to expend the resources to build one out, expect it to take longer than you initially budgeted for, and expect it to require more maintenance than you’d like. Getting to V1 is easy, but getting a pipeline to consistently deliver data to your warehouse is hard. Don’t make the commitment to supporting a custom data ingestion pipeline until you’re sure the business case is there. Once you do, invest the time and build it to be robust. Consider using Stitch’s open source Singer framework — we’ve built ~20 custom integrations using it.
Supporting Data Team Resources with Design and Performance Optimization for SQL Transformations
One of the shifts we’ve seen in data engineering in the past five years is the rise of ELT: the new flavor of ETL that transforms the data after it’s been loaded into the warehouse instead of before. The what and the why of this change are well-covered elsewhere; the reason I mention it here is that this shift has a tremendous impact on who builds these pipelines.
If you’re writing Scalding code to scan terabytes of event data in S3 and aggregating it to a session level so that it can be loaded into Vertica, you’re probably going to need a data engineer to write that job. But if your events data is already in BigQuery (loaded by Google Analytics 360), then it’s already fully addressable in a performant, scalable environment. The difference is that this environment speaks SQL. This means that data analysts can now build their own data transformation pipelines.
This trend started in earnest with Looker’s PDT feature release in 2014. It ramped up aggressively with entire data teams building DAGs of 500+ nodes and processing many-TB datasets using dbt over the past two years. At this point, the pattern is deeply entrenched in modern data teams, and it has enabled analysts to self-serve in a way they never could before.
This shift to ELT means that data engineers don’t have to build most data transformation jobs. It also means that data teams without any data engineers can still get a long way with data transformation tools built for analysts. Data engineers still have a meaningful role to play in building these transformation pipelines, however. There are two key areas where data engineers should get involved:
- When performance is critical. Sometimes business logic requires some particularly heavyweight transformation, and it’s helpful to have a data engineer involved to assess the performance implications of a particular approach to building a table. Many analysts aren’t deeply experienced with performance optimization within MPP analytic databases and this is a great opportunity for collaboration with someone more technical.
- When code gets complicated. Analysts are great at answering business questions using data, but frequently aren’t trained to think about how to write extensible code. It’s very easy to start building tables in your warehouse and have the entire project get out of hand quickly. Get a data engineer involved thinking through the overall architecture of your warehouse and to do design reviews on particularly pernicious transformations or you’ll find yourself with a spaghetti bowl to clean up.
Build Non-SQL Transformation Pipelines
While SQL can natively accomplish most data transformation needs, it can’t handle everything. One common need is to do geo enrichment by taking a lat/long and assigning a particular region. At the moment, this is not widely supported on modern MPP analytic databases (although this is starting to change!), so the best answer is often to write a Python-based pipeline that augments the data in your warehouse with region information.
The other obvious use case for Python (or other non-SQL languages) is for algorithm training. If you have a product recommender, demand forecast model, or churn prediction algorithm that takes data from your warehouse and outputs a series of weights, you’ll want to run that as a node at the end of your SQL-based DAG.
Most companies that are running either of these types of non-SQL workloads today are using Airflow to orchestrate the entire DAG. dbt is used for the SQL-based portion of the DAG and then non-SQL nodes are added on at the end. This approach gives a best-of-both-worlds outcome where data analysts can still be primarily responsible for the SQL-based transformations while data engineers can be responsible for production-grade ML code.
When Does My Team Need a Data Engineer? #
This change in role also informs a rethinking of the sequencing of data engineer hires. The previous accepted wisdom was that you needed data engineers first, because data analysts and scientists had nothing to work with if there wasn’t a data platform in place. Today, data analysts and scientists should self-serve and build the first version of their data stack using off-the-shelf tools. Hire data engineers as you start hitting scale points:
- Scale point#1: consider hiring your first data engineer when you have 3 data analysts / scientists on your team.
- Scale point #2: consider hiring your first data engineer when you have 50 active users of your BI platform.
- Scale point #3: consider hiring your first data engineer when the biggest table in your warehouse hits 1 billion rows.
- Scale point #4: consider hiring your first data engineer when you know you’ll need to build 3 or more custom data ingestion pipelines over the next few quarters and they’re all mission-critical.
If you haven’t hit any of these points, your data analysts and scientists should probably be able to self-serve using off-the-shelf technology, support from outside consultants, and advice from data communities that you’re a part of (like the Locally Optimistic and dbt Slacks!).
The key thing to realize is that data engineers don’t provide direct business value—their value comes in making your data analysts and scientists more productive. Your data analysts and scientists are the ones working with stakeholders, measuring KPIs, and building reports and models—they’re the ones helping your business make better decisions every day. Hire data engineers to act as a multiplier to the broader team: if adding a data engineer will make your four data analysts 33% more effective, that’s probably a good decision.
Data engineers deliver business value by making your data analysts and scientists more productive.
As you scale your data team, I’ve generally seen that the ratio that works best is around 5 data analysts / scientists to 1 data engineer. If you are working with particularly large or unusual datasets maybe that ratio changes, but it’s a good benchmark.
Whom Should You Hire? #
As the role of the data engineer changes, so too does the profile of the ideal candidate. My esteemed colleague Michael Kaminsky put it better than I ever could in an email we exchanged on this topic, so I’ll quote him here:
The way I think about this shift is a change in data engineering’s role on the team. It’s gone from a builder-of-infrastructure to a supporting-the-broader-data-team role. That’s actually a pretty huge shift, and one that some data engineers (who want to focus on building infrastructure) aren’t always excited about.I actually think this is important for startups to appreciate: they need to hire a data engineer who is excited about building tools for the analytics / DS team. If you hire a data engineer who just wants to muck around in the backend and hates working with less-technical folks, you’re going to have a bad time. I look for data engineers who are excited to partner with analysts and data scientists and have the eye to say “what you’re doing seems really inefficient, and I want to build something to make it better”.\
I could not agree more with this sentiment. The best data engineers at startups today are support players that are involved in almost everything the data team does. They should be excited about that collaborative role and motivated to make the entire team successful.
If you’ve made it all the way here, thanks for reading :) This is obviously a topic that I care a lot about.
Please do let me know on dbt Slackin the comments if you think I’m totally off—I’d love to hear about your experiences structuring the data engineer role within your data team.