Now that we had fast compute + cheap storage with cloud analytics warehouses, what data will we analyze?
We needed new tooling to reliably ingest data and integrate common data sources (yesteryear’s versions of Shopify, Snowplow, or Stripe).
In an OLTP world, the demand for these data integration tools is relatively small, but it blossomed along with access to OLAP data warehouses.
Prior to these data extractors entering the market, ingestion scripts were written by engineers (I’m not sure they were even called data engineers at this point).
Sidebar: Highly recommend reading Jeff Magnuson’s seminal blog post, Engineers Shouldn’t Write ETL, if you’re curious why this was a suboptimal setup.
TL;DR: maintaining a data ingestion pipeline is hard work.
You have to write a ton of transformation logic to keep schemas consistent, and APIs change all the time.
Companies like Fivetran, RJMetrics (predecessor to Stitch) and others emerged with GUI-based tools to do this hard work for us.
And since then, open source frameworks like Singer or Airbyte have entered the field. It has truly never been a better time to be a data engineer.
Companies that adopted these data loaders now have consistent access to fresh raw data in our cloud data warehouses.
The question then became: what do we do with it?