/ /
Data movement patterns explained (ETL, ELT, CDC & more)

Data movement patterns explained (ETL, ELT, CDC & more)

Joey Gault

last updated on Mar 10, 2026

The evolution from ETL to ELT

The most fundamental shift in data movement patterns has been the transition from ETL (Extract, Transform, Load) to ELT (Extract, Load, Transform). This change reflects broader architectural trends driven by cloud data warehouses and their separation of compute from storage.

ETL: the legacy approach

ETL emerged when data storage and compute were expensive and tightly coupled. In this pattern, data is extracted from source systems, transformed in a separate processing layer, and then loaded into the target warehouse. The transformation happens before loading because warehouses were too slow and constrained to handle heavyweight processing themselves.

This approach minimizes storage costs by ensuring only transformed data lands in the warehouse. However, it introduces significant limitations. Transformations are scattered across different systems, leading to repeated work as teams implement ad hoc queries for each new project. The pattern is inflexible and difficult to scale, particularly as data volumes grow and use cases multiply.

ELT: the modern standard

ELT reverses this order. Raw data is extracted from source systems and loaded directly into the warehouse, where transformations happen using the warehouse's native compute power. This approach leverages the scalability and flexibility of cloud storage and computing, making it far easier to handle large volumes of data and growing numbers of use cases.

The advantages are substantial. ELT enables a more organized data architecture with transformations performed directly in the warehouse, creating a streamlined and efficient process. Tools like dbt have emerged specifically to support this pattern, bringing software engineering best practices (version control, testing, documentation, and modularity) to the transformation layer.

However, ELT introduces its own challenges. As organizations scale, the number of dashboards, tables, sources, and data products increases, leading to complex warehouse environments. Without proper tooling to manage this complexity, issues like data inconsistency, lack of documentation, and version control problems can undermine trust and usability.

Batch processing: the workhorse pattern

Batch processing remains the dominant pattern for data movement in most organizations. Data is extracted and loaded on a schedule (hourly, daily, or at some other interval) and transformations run in discrete jobs.

The batch hub-and-spoke architecture uses a central database as a hub that receives scheduled extracts from various source systems. The hub processes the data and pushes curated outputs downstream to BI tools or reporting marts. This pattern remains common in environments where source systems are on-premises or where compliance requires tight control over scheduling.

Batch workflows have clear strengths. They're well-understood, relatively simple to implement, and sufficient for many analytical use cases. When you're building dashboards to analyze revenue trends or cohort behavior, getting new data once a day may be entirely adequate.

The limitations become apparent as organizations mature. Rigid batch workflows result in lengthy refresh windows, making real-time analytics challenging or impossible. As jobs grow, monolithic scripts become harder to debug or extend. Without modularity, parallel development is constrained, and even small changes can trigger full pipeline reruns.

Change data capture and streaming

Change Data Capture (CDC) represents a fundamentally different approach to data movement. Rather than periodically extracting full or incremental snapshots, CDC detects and synchronizes changes as they occur in source systems, supporting near-real-time pipelines.

Two primary CDC approaches exist. Log-based CDC reads directly from database transaction logs, offering low-latency and low-overhead synchronization. Trigger-based CDC emits change events from within the application when direct log access isn't available.

The added complexity of CDC is justified when data freshness is essential. Use cases like personalization, fraud detection, and operational reporting benefit significantly from reduced latency. A delayed view in these scenarios can lead to poor decisions or missed opportunities.

To make CDC pipelines reliable, teams must design for exactly-once delivery and handle schema changes proactively. Transforming events into incremental models within the warehouse maintains clean, testable downstream logic. Tools like dbt support incremental materialization, processing only new or changed records after the initial model build, which reduces runtime and compute costs while maintaining data quality.

The major cloud data warehouses have begun supporting constructs that enable more real-time flows. Snowflake emphasizes streams functionality, while BigQuery and Redshift focus on materialized views. Both approaches move in the right direction, though neither fully solves the real-time challenge today.

Data virtualization and federation

Data virtualization allows teams to query data across systems without physically moving it. This pattern is useful for quick proofs of concept or when working with data that cannot be moved due to compliance or privacy requirements.

However, virtualization poses considerable challenges. Live queries can introduce latency and performance risk. Joins across systems are often fragile and slow. Permissions don't always flow downstream, complicating governance.

For most use cases, if data needs to be queried regularly, joined with other sources, or analyzed deeply, it's better to load it into a centralized platform like a data warehouse or lakehouse. Virtualization works best as an exception rather than a core pattern.

Reverse ETL: completing the feedback loop

An emerging pattern that's gaining significant traction is reverse ETL (moving data from the warehouse back into operational systems). This represents a fundamental shift in how organizations think about data movement.

Traditionally, data flows one way: from operational systems into the warehouse where it's analyzed. If that analysis is going to drive action, a human must pick it up and act on it manually. Reverse ETL changes this by feeding transformed data directly back into operational tools.

The use cases are compelling. Customer support staff can see key user behavior data directly in their help desk product. Sales professionals can access product usage data inside their CRM. Marketing teams can trigger automated messaging flows based on product clickstream data, all without maintaining separate event tracking implementations.

This pattern makes the data warehouse not just an analytical endpoint but a central nervous system that powers operational business systems. For teams writing transformation code in dbt today, this means their work will increasingly power production systems, not just internal analytics. This makes the job both more challenging and more impactful.

The modern data lake pattern

A newer pattern gaining adoption is loading data directly into open table formats like Iceberg or Delta in object storage, rather than into proprietary warehouse storage. This approach combines the organizational structure of a data warehouse with the flexibility and cost efficiency of a data lake.

In this pattern, data is loaded into formats like Iceberg in S3, creating an organized, warehouse-like structure in the data lake. Different query engines can then operate on top of this data (Databricks, Snowflake, Athena, or others) without duplicating storage.

This pattern relies on open catalogs to manage metadata and enable multiple engines to query the same data. It offers significant advantages: customers avoid paying for storage multiple times, can use different compute engines for different workloads, and maintain flexibility in their technology choices.

The pattern is still maturing. Early adopters are using it successfully, but it's not yet turnkey. Making this widely adopted will require continued ecosystem development to simplify implementation and improve interoperability.

Choosing the right pattern

No single pattern fits all use cases. The right choice depends on latency requirements, data volumes, organizational maturity, and specific business needs.

Batch processing remains appropriate for most analytical workloads where daily or hourly updates suffice. CDC and streaming make sense when real-time data drives operational decisions. Reverse ETL becomes valuable when data needs to flow back into operational systems to automate business processes. The modern data lake pattern suits organizations with diverse compute needs and large data volumes.

The key is understanding that these patterns aren't mutually exclusive. Mature data organizations often use multiple patterns simultaneously, choosing the right approach for each data source and use case.

Building for the future

As the data ecosystem continues to evolve, several trends are clear. Latency will continue to decrease as streaming capabilities mature. Data will increasingly flow bidirectionally, powering both analytics and operational systems. Open formats and standards will provide more flexibility in choosing compute engines.

For data engineering leaders, this means building with modularity and flexibility in mind. Invest in tools that support multiple patterns. Establish clear ownership and data contracts. Implement testing and observability from the start. Treat analytics code like application code, with version control and CI/CD.

The patterns for data movement have matured significantly, but they continue to evolve. Understanding the major patterns, their trade-offs, and their appropriate use cases positions data teams to build infrastructure that scales with their organization's needs.


Related resources:

- Data transformation best practices

- Understanding ETL vs ELT

- Data integration guide

- Getting started with dbt

Data movement FAQs

VS Code Extension

The free dbt VS Code extension is the best way to develop locally in dbt.

Share this article
The dbt Community

Join the largest community shaping data

The dbt Community is your gateway to best practices, innovation, and direct collaboration with thousands of data leaders and AI practitioners worldwide. Ask questions, share insights, and build better with the experts.

100,000+active members
50k+teams using dbt weekly
50+Community meetups