You can start to see how this step function improvement in data warehousing might change our data pipeline workflows.
When using an OLTP database for analytics, storage was expensive, and querying was slow.
This put a premium on transforming data before it hit the database by implementing the ETL workflow (extract -> transform -> load).
This optimized for expensive storage + slow compute, but it limited flexibility to transform data down the line.
What if your business logic changes, and you need to update data models? Re-runs in this paradigm could be painful.
The cheap storage + fast compute of cloud data warehouses, combined with accessible data extraction platforms (Fivetran, HVR, Stitch, etc.) made a more flexible data transformation workflow possible.
Rather than transforming data prior to load, limiting future flexibility, you could load raw data and transform it after the fact.
Better yet, anyone on an analytics team (not just a data engineer) could now set up extraction jobs using a GUI (at least for commonly-used data sources), rather than writing one-off ingest scripts.
This improved the lives of analytics teams in a few ways:
You may notice a theme here.
Each new technological development creates new challenges + opportunities.
MPP databases and the ETL workflow put new pressure on the transformation layer:
Teams had raw, source-conformed data in their analytics warehouse, and could hook it up directly to a BI tool for visualization.
It also led to an explosion in the number of poorly-maintained or low-trust data pipelines in production. “Mo data mo problems,” in the words of one sage data mind.
One can trace the development of the modern data stack in this way: a tooling breakthrough opens the door to another, and also demands us to level up our skills.
What was missing after introducing cloud warehouses + data ingestion tools was a sensible way of transforming data across an organization so that everyone involved could make sense of and trust.