What are the most common data pipeline architecture patterns?

last updated on Mar 18, 2026
The evolution from ETL to ELT
The transition from ETL (Extract-Transform-Load) to ELT (Extract-Load-Transform) represents more than a simple reordering of operations. This shift reflects a fundamental change in how organizations leverage compute resources and structure their data workflows.
In traditional ETL architectures, transformation happens on dedicated middleware servers before data reaches the warehouse. This approach made sense when storage was expensive and compute resources were limited. Teams would extract data from source systems, apply transformations on separate ETL servers, and only then load the processed results into the warehouse.
ELT inverts this model. Raw data lands directly in the warehouse, where transformations occur using the platform's native compute capabilities. This pattern aligns naturally with cloud data warehouses like Snowflake, BigQuery, and Redshift, which offer elastic compute that scales with workload demands.
The ELT approach delivers several advantages for modern teams. Transformations become more transparent and easier to iterate on, since all logic executes in SQL within the warehouse. Multiple teams can work with the same raw data, applying different transformation logic for their specific use cases. Version control and testing become straightforward when transformation code lives in repositories rather than proprietary ETL tools.
dbt has emerged as the standard transformation layer in ELT architectures, bringing software engineering practices to analytics workflows. Rather than building transformations in GUI-based tools, teams write modular SQL that can be tested, documented, and deployed through standard CI/CD pipelines.
Batch hub-and-spoke architecture
The batch hub-and-spoke pattern remains prevalent in organizations with on-premises systems or strict compliance requirements. In this architecture, a central database serves as the hub, receiving scheduled extracts from various source systems that act as spokes. The hub processes incoming data and distributes curated outputs to downstream systems like BI tools or reporting marts.
This pattern offers predictable scheduling and tight control over data movement. Organizations can enforce specific processing windows, ensuring that transformations complete before business users need access to refreshed data. The centralized hub provides a clear point of governance and monitoring.
However, batch hub-and-spoke architectures face inherent limitations. Rigid batch windows create latency between when data changes in source systems and when those changes become available for analysis. As data volumes grow, these batch windows can extend uncomfortably, sometimes pushing into business hours. The monolithic nature of batch processing makes it difficult to isolate and debug specific transformation failures.
Scaling batch architectures often requires careful orchestration to manage dependencies between jobs. When one transformation fails, it can cascade through the entire pipeline, delaying all downstream processes. Teams must carefully balance the desire for comprehensive batch processing against the need for timely data availability.
Cloud warehouse as the central hub
Modern cloud data platforms have enabled a more flexible architecture where the warehouse itself serves as the central hub for all data operations. In this pattern, raw data flows directly into platforms like Snowflake, BigQuery, or Databricks, where all subsequent transformations occur using native compute.
This architecture aligns naturally with ELT workflows. Teams use ingestion tools like Airbyte or Fivetran to land raw data in the warehouse, then apply transformations using dbt. The warehouse's elastic compute scales to handle varying workloads, from lightweight development queries to production transformation jobs processing billions of rows.
The warehouse-as-hub pattern supports diverse analytical use cases. Data scientists can access raw data for exploratory analysis. Analytics engineers build tested, documented models for business intelligence. Machine learning teams can train models on the same data foundation that powers dashboards. This convergence on a single platform reduces data movement and eliminates the synchronization challenges that plague multi-system architectures.
Cost management becomes critical in warehouse-centric architectures. Without careful attention to compute usage, costs can escalate quickly. Teams need clear conventions for incremental processing, appropriate materialization strategies, and monitoring of query patterns. dbt helps manage these concerns through features like incremental models that process only new or changed data, reducing both compute time and warehouse costs.
Access control requires explicit configuration in warehouse-centric architectures. Unlike traditional enterprise systems with built-in role-based security, cloud warehouses require teams to thoughtfully design permission structures. However, this explicit approach also provides flexibility to implement fine-grained access controls that align with organizational needs.
The semantic layer pattern
As self-service analytics has expanded and teams work across multiple tools, the semantic layer has emerged as an essential architectural component. Rather than defining metrics repeatedly in different dashboards or applications, teams codify business logic once in a centralized semantic layer that serves all downstream consumers.
The dbt Semantic Layer exemplifies this pattern. Teams define metrics like "customer churn" or "monthly recurring revenue" as declarative assets within dbt. These definitions include the underlying SQL logic, dimensional attributes, and calculation rules. Once defined, these metrics become available through APIs that BI tools, notebooks, and applications can query.
This architectural pattern solves a persistent challenge in analytics organizations: metric inconsistency. When different teams calculate the same metric using slightly different logic, stakeholders lose confidence in data-driven insights. The semantic layer ensures that everyone works from the same definitions, eliminating discrepancies between reports.
Implementing a semantic layer requires upfront investment and cross-functional alignment. Teams must agree on canonical metric definitions before building models. Governance processes need to manage changes to these definitions over time. Performance optimization becomes important as the semantic layer serves queries from multiple consumers simultaneously.
When executed well, the semantic layer accelerates decision-making and improves data governance. Analysts spend less time reconciling conflicting numbers and more time generating insights. New team members can quickly understand available metrics without reverse-engineering dashboard logic. The semantic layer becomes the single source of truth that powers consistent analytics across the organization.
Streaming and real-time patterns
For use cases requiring low-latency data availability, streaming architectures process data continuously rather than in discrete batches. Change Data Capture (CDC) enables near-real-time pipelines by detecting and synchronizing changes as they occur in source systems.
Log-based CDC reads directly from database transaction logs, capturing inserts, updates, and deletes with minimal overhead on source systems. This approach provides low-latency replication suitable for operational reporting and real-time personalization. Trigger-based CDC emits change events from within applications when direct log access isn't available.
Streaming data typically lands in message queues or streaming platforms before being processed and loaded into the warehouse. Teams must design for exactly-once delivery semantics to prevent duplicate records. Schema evolution requires careful handling to ensure that changes in source systems don't break downstream processing.
dbt supports streaming architectures through incremental models that efficiently process new events. Rather than reprocessing entire datasets, incremental models append or update only changed records, maintaining performance as data volumes grow. This pattern works well for event streams, CDC feeds, and other continuously arriving data.
The complexity of streaming architectures is justified when data freshness directly impacts business outcomes. Fraud detection systems need immediate access to transaction data. Personalization engines require current user behavior to make relevant recommendations. Operational dashboards must reflect the latest system state. For these use cases, the investment in streaming infrastructure delivers clear value.
Hybrid and federated patterns
Some organizations employ hybrid architectures that combine multiple patterns. Data virtualization and federation allow queries across systems without physically moving data, which can be useful for proof-of-concept work or when compliance requirements prevent data movement.
However, virtualized approaches introduce performance challenges. Live queries across systems create latency and place load on source databases. Joins between federated sources are often slow and fragile. Permission models don't always translate cleanly across system boundaries, complicating governance.
For data that requires regular querying, joining with other sources, or deep analysis, loading into a centralized warehouse typically provides better performance and reliability. Data virtualization works best as a tactical solution for specific edge cases rather than as a core architectural pattern.
Choosing the right architecture
No single architecture pattern fits every organization. The right choice depends on latency requirements, data volumes, governance needs, and team capabilities. Many organizations employ multiple patterns simultaneously, using batch processing for historical analysis, streaming for real-time use cases, and a semantic layer to ensure consistency across tools.
The most successful architectures share common characteristics: they're modular and testable, with clear ownership and documentation. They incorporate automated quality checks and monitoring. They balance performance with cost efficiency. And they're designed to evolve as business needs change.
dbt provides the transformation backbone for these diverse architectural patterns. Whether teams are building batch pipelines, processing streaming data, or serving metrics through a semantic layer, dbt brings consistency to transformation logic. Version control, automated testing, and comprehensive documentation ensure that pipelines remain maintainable as they scale.
For data engineering leaders evaluating architecture patterns, the key is selecting approaches that align with organizational objectives while maintaining flexibility for future evolution. The modern data stack provides powerful building blocks; the challenge lies in assembling them into architectures that deliver reliable, timely insights at scale.
Data pipeline architecture pattern FAQs
VS Code Extension
The free dbt VS Code extension is the best way to develop locally in dbt.


