Table of Contents
- Start here
- Accessing raw data
- Data transformation
- Downstream use cases
- Building a data team
- Joining a data team
The case for the ELT workflow
We had to draw a line in the sand, or we’d never make the switch.
The time had come to move our data transformation pipeline to an ELT (extract -> load -> transform) workflow.
Our old ETL (extract -> transform -> load) pipeline, a set of stored procedures running on a Postgres database (PL/pgSQL to be specific), was grinding our data team’s productivity to a halt.
We were spending our days performing painstaking data archaeology, digging through sedimentary layers of transformation logic, and management was starting to take notice:
Why are updates to data models taking forever? Why is it taking so long to build out new reports?
Making changes to the stored procedure-based pipeline was time consuming and risky, and “don’t touch it and it won’t break” is never where you want to be as a data team.
We kept saying something had to change, until finally we made it happen. What follows is the story of why and how we migrated 80 stored procedures across 20 business units from an ELT data transformation workflow from ETL.
Why would an ELT workflow improve quality of life? #
Our primary bottleneck during development was a lack of separation of concerns: we were extracting raw data and transforming it within the same set of stored procedures.
We extracted raw transactional events from our ERP system, transformed them into the fact and dimension records (orders, customers, payments, etc) that we needed for reporting, and loaded them back in.
But we had created a frankenstein: as business needs evolved, we’d kept tacking on more and more INSERTs, UPDATEs and DELETEs to keep pace. It had become impossible to make sense of how a single record ended up in its final state.
Once these records were created, it was very difficult to roll them back—they were a large set of one-way incremental table builds. Anytime we wanted to make a change or fix an issue, we had to scrub records and rerun the ETL pipeline from scratch, which was time-consuming and introduced major risks to data quality.
We basically had to maintain some boilerplate code at the top and bottom of every stored procedure, that allowed us to manually declare the timestamp bookmark to re-run transformations from—this variable would need to be set for every SP that was re-run, every time. Not fun.
It looked something like:
declare v_delta_date_inserts timestamp; v_delta_date_updates timestamp; v_logger_etl_name varchar(255) := 'edw_order'; v_logger_step_name varchar(255) := 'order'; begin perform reporting.sp_etl_run_start(v_logger_etl_name, v_logger_step_name); select max(delta_date)::timestamp from edw.order where delta_flag = 1 into v_delta_date_inserts; select max(delta_date)::timestamp from edw.order where delta_flag = 2 into v_delta_date_updates;
In an ELT workflow, we could separate the data extraction and transformation steps—source data would make its way into the warehouse (“EL”), and then we could transform it to our heart’s content (the “T”).
This would give our team of analytics & data engineers much more flexibility when building data transformation logic. We’d always have the raw source data available, and could iterate on how we transformed it after the fact.
Consolidating data transformations into a single workflow
As an aside, I neglected to mention that we were triggering data transformation runs in a few different ways:
Nightly cron jobs executing stored procedures
Direct Python transformations
Python running custom SQL and/or stored procedures
Stored procedures running directly from Looker
Individuals running stored procedures locally that wrote to production tables
As you can imagine, the complexity of maintaining these disparate deployments was enormous, and there was no single place to debug our production transformation workflows.
Consolidating our transformations into a single process (dbt running on BigQuery) allowed us to cut a huge amount of maintenance overhead.
Making the case for ELT #
If you’ve worked in an ELT workflow, you know inherently why it improves your life as a data person.
But how do you make the case for migrating from ETL to ELT? My boss’s first impression of ELT wasn’t all positive:
Isn’t it just a marketing buzzword? What does flipping the L and the T around really do for us?
Let’s come back to the term data archaeology. A lot of what we do as data teams is debugging potential data quality issues, or investigating the impact of upcoming transformation code changes.
These are archaeological tasks—we comb through the layers that data travels from source to end use, and check for anything off along the way.
This work will (likely) not go away for a while—so will we be productive at it, using proper tooling, or bad at it?
Being a skilled data archaeologist requires, as an individual developer, being able to quickly and efficiently gain understanding by performing key investigative tasks:
Recreate historical transformations: Can we easily rebuild this modeled data from source data?
Trace data model dependencies: Can we inspect what lower-level data models (stored procedures in our ETL case) rolled up to a given fact or dimension model?
QA test analytics code changes: Can we ensure that changes being pushed to transformation code won’t introduce data quality issues?
Let’s dig into each of these tasks, and why in our case an ELT workflow made them significantly easier than ETL.
Recreate historical transformations
Transformation code would change over time, and since records were transformed incrementally, the meaning of records would be time-dependent.
Before February 2019, a field would be defined one way, and after a different way—it literally required archaeological layering of definitions.
With an ELT workflow, we could implement proper change management for transformation code: when we deploy an update to a field definition, the change can apply across all historical data, rather than only newly-created records.
This means that data transformations would be idempotent—if we recreate the data warehouse with the click of a button or the run of a command, and source data or transformation logic doesn’t change, we get the same results.
Some would argue that rebuilding an entire data warehouse from scratch is computationally wasteful, and in some cases that may be true!
But without the ability to quickly run a full refresh, we found we were trading off compute hours for more-precious developer hours, building + operating workarounds to allow our stored procedure workflow to do something approximating a full rebuild.
For us, it made more sense to have the option to spend compute, vs the guarantee of wasting our own time.
Trace data model dependencies
As a developer, it took a ton of investigation to uncover how a given record came to be.
We had to trace that flow point-by-point through multiple stored procedures for the record to see how it got where it ended up.
We were not able to easily recreate how the record got to its final state—there was no centralized, documented dependency graph for us to trace directly. We had to do it live, by hand.
Now, this isn’t strictly a problem with ETL vs ELT: this is a problem with using a series of layered stored procedures to perform data transformations.
When migrating to ELT, we could implement a tool like dbt as the transformation layer, which would infer our data transformation DAG and make it searchable via automatically-generated documentation.
QA test analytics code changes
We really did not want to introduce any errors, since we’d have such a hard time tracking them down—but we didn’t have any sandbox to easily test changes to the stored procedure pipeline.
My boss spent a week scaffolding a whole system that allowed us to test our changes in PL/pgSQL stored proc code: building in a test environment, confirming it in QA, and then pushing it to production.
But even then, I didn’t feel confident about pushing changes, because I didn’t have an automated way to run those tests, and it wasn’t in my wheelhouse to spin up the environment to run them locally. There were just so many unknowns, and it was very time consuming to iron out differences that popped up during testing.
The ELT workflow on its own, of course, wouldn’t completely prevent us from introducing issues in data as a result of changes—but dbt certainly would give our developers the tools (version control, easy creation of dev environments, data + schema tests) to be much more confident that we weren’t breaking things.
Why is changing data transformation workflows so hard? #
Even after we bought the case for ELT, it took a significant amount of time (months, even years) to make the switch. Why?
Data has gravity. It accumulates mass and then it becomes immovable. That’s what our situation had become: our stored procedure-based data transformation process was untouchable and unmoving.
That’s largely because, as a small data team at a growing medium-size business, we focused on new problems—there wasn’t time to circle back to old problems and repay our tech debt.
What we needed was a catalyst for change—a compelling business reason to start the move to ELT and get gravity flowing in the right direction.
Without a catalyzing project, it’d be tough to make the case to management that we should divert resources away from net-new work to refactor our old transformation pipeline.
Once we had a catalyst though, we could commit to only building new transformation logic in the ELT workflow with dbt.
And at some point, because data has gravity and our work accumulates, we’ll eventually be in a situation where most of our transformation code has migrated off of stored procedures.
In that sense, making the switch to ELT is no different from any technical infrastructure change—at some point you know you’ll have to eat your spinach, it’s just a matter of timing the change (in terms of available resources + institutional alignment) so it has the highest chance of succeeding.
But until that catalyst appeared, we’d have to be patient, cope with our stored procedure-based ETL workflow, and keep repeating “we need to do something” amongst ourselves.
How we transitioned to the ELT workflow #
When the time came to pull the trigger, we knew the migration plan to ELT would roughly look as follows:
Dependency mapping: List out the dependencies embedded in our stored procedures, to be able to implement them in dbt following modular data modeling principles.
Prioritizing migration order: Which transformation models could be migrated first and serve as quick wins? Which are more complex and a better fit for when we’ve picked up some momentum?
Build out models: Get into the weeds with dbt, and start migrating over the data transformation logic from stored procedure-based SQL to dbt SQL.
Testing and vetting: QA the new dbt-based output to the stored procedure output, and identify any differences that require archaeological investigation. We planned to use a query process similar to the audit_helper dbt package for this comparison.
Get signoff from stakeholders + migrate: To the end user, we’d want the migration to be as seamless and invisible as possible—but we knew intuitively that, given historical changes in our stored procedures, there would be gaps vs historical data that we’d need to validate with our finance & accounting teams. Once we agreed on the models, we’d also need to update reporting to reference the new tables.
Repeat with other datasets! Given the large number of transformation models we were migrating, we anticipated we’d be repeating steps 3-5 for a number of months.
Once we shipped an initial win, we knew we could then draw a real line in the sand, and state that any new development would happen in the new workflow vs the old.
Got to get that gravity flowing into ELT!
The migration timeline
At the outset, we didn’t know exactly what type of timeline to budget for, but in the end these 6 steps ended up looking like:
- Dependency mapping + prioritization: 2 weeks
- Build out models: 6 weeks
- Testing and QA: 2 weeks
- Signoff from stakeholders: 2 months
- Repeat with other datasets: 6 months
The catalyst project emerges
We knew the time would come when a new transformation project would be easier to rebuild from scratch in ELT than in ETL.
That project turned out to be an old retail classic: “buy one get one free” (BOGO).
Our job was to facilitate reporting on the upcoming rollout of BOGO (coming up in a few months), and when we dug into what it’d take to build it in stored procedures, the outlook was grim.
We could either spend a couple months refactoring a bunch of old Postgres stored procedure code, or a couple weeks building it in ELT (using dbt + BigQuery) to get the same result.
BOGO was a major initiative for the company, and anything that’d cut over a month off of the rollout would be widely welcomed. Time to move!
Drawing a dependency graph by hand
We decided to make a run for it, but the first step was to crawl…manually through the stored procedure SQL code.
I sat down with a pen and paper at a heavy, old wooden desk, sifted through every SQL file in our repo of stored procedures, and mapped the dependencies between them.
Eventually dbt would generate this DAG diagram automatically for us, but first I’d need to draw it by hand.
I wrote down anytime something was a source input for another table or for another stored procedure, and mapped out how data flows in the system.
It was a painful 2 weeks, since stored procedures are notoriously long and difficult to read (at least they were for me), with lots of boilerplate INSERT / UPDATE statements and aliases.
But at the end, I could visualize left-to-right (from source to fact or dimension) how data was currently being transformed.
The format of this dependency mapping wasn’t super important - we evolved rough sketches in a notebook, to a text file, to a spreadsheet and to DOT graph notation. The key thing was just to communicate dependencies, not to fuss over the tooling we used to map out the flows (at which spreadsheets and DOT graph notation were equally as effective).
I could clearly see where logic would take the most time to untangle, but also what nodes were relatively standalone that might make for quick wins when we started writing transformation code in dbt.
Standing up our new transformation logic #
When we actually started coding and vetting data models in dbt, we knew that results wouldn’t match up exactly with our old warehouse.
Our stored procedure ETL process was incrementally snapshotting records from our ERP system, and that snapshot logic had changed numerous times over the years, so there was just no chance that running a full refresh of the current logic would net the same results.
Most differences were small: within a tenth of a percent on data more than a few years old.
The question was: what matters to a business? If a metric is 1% different, will it make a meaningful difference? Sometimes yes, sometimes no—the tolerance changes depending on the context.
Is the metric reported to the board? Does the marketing team change their strategy/behavior? Do we hire more people or invest more/less money? We had to ask these questions to determine the error tolerance.
So when we did see differences of a percent or two on financial metrics, it was time for some deep archaeology.
Literally digging in row-by-row to transactions data, determining whether stuff like a return amount was more correct on the new version in dbt vs the old stored procedure.
Once we spent a couple weeks getting all the way down to the metal, we were excited to share with our stakeholders on the finance team that we believed the new models to be “correct.”
So we had a new version of our data transformation pipeline, that was better, faster and stronger. We assumed that everyone at the company would (of course!) be eager to use it.
Selling the change
Change management is hard.
Even though we created the tables that were the same, there are still hundreds of Looker explores and queries, and all of the company dashboards that are pointing to the old versions and the old tables, and all those need to be updated.
Crucially, we had proved that the new version would work, so we could commit to only building new transformation logic (as for BOGO) in the ELT workflow with dbt + BigQuery. We knew that eventually, enough mass would accrue to the ELT version that the change would make itself.
But until finance fully signed off on the comparisons between the new and old, we were kind of stuck living in two worlds where we have both of these things running in parallel, at the same time, to create the same data. Maintaining this parody was the bane of our existence for a while.
It was a couple months of gathering signoff: the new transformations living in production, people using them, finding pieces that need to be reworked, and us refactoring things.
And after that, over the next six months we gradually transferred over all the Looker stuff and stripped out functionality from stored procedures that we’d neglected on the first pass.
And the development cycle really never ends! After a while we found ourselves at the beginning again, refactoring the transformation models that we’d already ported over into ELT with dbt.
A note on timelines: If we were to do it over again, I’d recommend committing as an organization to a specific timeline for migration, including a known parallel development period.
Since I led the migration as basically a skunkworks project, we weren’t armed up-front with the ability to pitch a timeline to finance & other stakeholders, but I sure wish we had.
Checking for shadow IT
The data coming from this pipeline (finance / orders reporting) was very senior in terms of priority to the organization—lots and lots of people used it downstream, in ways we as a data team weren’t always fully aware of.
People ran “shadow ETL” alongside the official pipeline, and built semi-automatic reporting in Excel that they were reluctant to move off of.
We often had no idea that these existed until we turned off the stored procedures, and heard from people that custom dashboards outside of our BI tool had broken.
If we could’ve repeated the process, maybe we’d test doing a shutdown of the stored procedure pipeline for a couple days in advance of the actual shutoff, just to see what shadow workflows popped out of the woodwork.
Quality of life gains #
I think we’d made the decision to move to an ELT workflow on dbt + BigQuery sometime in 2018, and it took until 2020 to actually see it live in production.
Actual pen-to-paper development time was a tiny fraction of that period—most of it was patiently waiting for conditions to be right to make the change. My hunch is that’s a relatively consistent experience across any significant change to technical infrastructure.
But it definitely was worth it! The productivity gains from moving from a brittle stored procedure-based ETL workflow to an ELT workflow on dbt were significant: a feature that’d take us 3 weeks to build the old way now took us 1 week.
Because the code migrated was paramount to the business, once we did have a working model in the new warehouse using dbt, it quickly became the backbone to build new models off of—stuff that we’d always wanted to build, but never had the time or resources.
And our developers now had a sandbox to play in, a place to try things and make mistakes, and a way to know that when they shipped transformation code, it wouldn’t break anything in the pipeline. As a data person, that’s all you really need to get by—freedom from the fear of breaking stuff.