Data engineering tools: How do they fit together?

on Sep 12, 2025
Data infrastructure today centers on a modular modern data stack rather than a single Extract, Transform, and Load (ETL) suite. Businesses assemble dedicated tools for each stage of the lifecycle for ingestion, transformation, and storage.
This decomposition makes it simple to build, update, and scale each layer independently without rebuilding the entire pipeline. The outcome is not just a pile of tools. It’s a living system for creating dependable data products that’s greater than the sum of its parts.
But how do these components interact to convert raw, fragmented inputs into clean, production-grade outputs? In this article, we’ll deconstruct how these elements combine to create the cohesive, automated pipeline of the modern data engineering toolchain. We’ll also discuss the benefits and challenges teams encounter in building resilient systems.
Data ingestion: Getting raw data into the warehouse
A robust data ingestion layer decouples data extraction from transformation. Teams can add or modify data sources with little to no downstream effect. Collecting all raw data in a common storage tier like a data warehouse allows easy reprocessing or manipulation of data without re-extracting it.
Data ingestion platforms like Fivetran connect to a variety of sources, including Software as a Service (SaaS) applications, relational, and NoSQL databases. They can be used in batch and streaming mode, including micro-batch to support near-real-time requirements.
The majority of modern pipelines follow an ELT (Extract, Load, and Transform) pattern, where the tool extracts data sources and loads them in raw form. Transformation engines perform relevant operations to bring the data into a usable format. This is opposed to traditional ETL, which performs all complex transformations before centralizing the data.
Transformation: Modeling data with software engineering principles
A modern data transformation process treats transformations as code, with peer reviews, testing, and automated deployments. dbt is the standard tool for this, enabling modular SQL models to transform data no matter where it lives in your enterprise.
The raw tables ingested by ingestion tools are cleaned and transformed into business-ready fact and dimension tables, all expressed as SQL or Python code. Using version control driven by Git and automated deployment workflows, dbt supports an Analytics Development Lifecycle (ADLC) that mirrors a software development lifecycle (SDLC), with changes developed, tested, and released iteratively.
Teams can run their data pipelines on schedules or trigger them through orchestrators to keep data up-to-date. Tests can be run on incoming data to ensure continuous data quality and prevent drift.
Storage and compute: Cloud data warehouses
In the modern data stack, cloud warehouses have eliminated the need for special processing engines for analytics. Under the ELT paradigm, raw data is loaded first, then transformed directly within the warehouse for a specific business use case.
This positions the cloud warehouse as the central, governed layer connecting upstream ingestion and collection with downstream analytics and data products. The stack is built on modern warehouses such as Snowflake, Google BigQuery, and Amazon Redshift that offer decouple storage and compute into separately scalable services.
Cloud data warehouses use a fully managed massively parallel processing (MPP) architecture to scale SQL workloads across compute clusters, efficiently processing petabyte-scale data and complex joins. The architecture decouples storage and compute, allowing storage and query processing to scale independently, and billing at a granular level.
Observability: Monitoring pipeline health and data quality
Observability tools monitor data health and detect anomalies, maintaining quality in the complex pipeline. Data observability uses DevOps-like monitoring to data pipelines, giving you end-to-end visibility into data quality, freshness, lineage, and performance.
A good observability system reports standard metrics, run statuses, record counts, schema versions, and latency into the centralized backend with SLA thresholds and statistical alerts. Once signals are in place, teams can use an observability platform to automate anomaly detection, root-cause analysis, and incident workflows.
Analytics and BI: Delivering insights to users
Data engineering’s goal is to provide insights to the business. At the pipeline’s end, BI and analytics tools allow users to explore and visualize the transformed data.
Tools such as Looker, Tableau, and Mode can access the cloud warehouse and allow analysts to create reports and dashboards on top of the cleaned, transformed datasets. BI tools in a modern stack have a single source of information since the transformation logic is centralized in dbt and the warehouse.
In addition to direct querying, you can use the dbt Semantic Layer as part of BI semantic systems. The dbt Semantic Layer centralizes the definition and naming of critical cross-team business metrics so that they are accessible and consistent for everyone.
For example, the metrics layer and exported metadata in dbt can feed Looker and produce LookML definitions automatically. This integration synchronizes KPIs and measures coded in dbt with dashboards and reports.
Orchestration: Scheduling and dependency management
Orchestration tools provide reliable, repeatable processes for ingestion, transformation, testing, and BI refreshes. While dbt provides basic orchestration out of the box, tools like Airflow and Prefect can expand dbt’s capabilities by building custom integrations with other third-party systems.
Example end-to-end workflow
Let’s take a closer look at a production-grade data pipeline in action.
Ingestion
Fivetran utilizes CDC and batch connectors to replicate Salesforce objects and Google Analytics logs into a raw schema in Snowflake. It auto-maps fields, auto-adds columns, and coerces data types, landing as Snowflake micro-partitioned files. A webhook executes custom Airflow processing.
Transformation
dbt transforms the data using SQL, creating a dependency-aware directed acyclic graph (DAG) that ensures the order of execution. Staging models filter and cast raw data, parsing and deduplicating it, and then materialize as views or temporary tables.
Mart models insert or rebuild aggregated tables in-place using incremental or full-table materializations and allow partitioning and clustering setups in dbt.
The warehouse persists these tables in an optimized compressed columnar form on managed object storage, micro-partitions, and clusters to achieve efficient pruning and cost savings.
Testing
dbt runs schema and data tests (uniqueness, not-null, relationships) on freshly constructed models. Monte Carlo consumes metadata events provided by Fivetran, dbt, and the data warehouse to track SLA-based freshness and row-count anomalies, using statistical detection on metric distributions to raise alerts on unexpected schema changes. When it detects an anomaly, Monte Carlo will automatically analyze the impact by examining downstream assets.
BI refresh
A dbt post-hook or Airflow task calls the Tableau Server REST API to refresh Hyper extracts. The refresh is supported by incremental extract logic, which updates only partitions that change to optimize compute and I/O.
The orchestrator queries the queryTask endpoint and retries on failure, as well as logs execution metrics in the DAG UI. When it is done, the metadata catalog is refreshed through the API, and dashboards are pointed to the most recent tables.
Iteration
An analyst requests a new KPI through the issue tracker or chat. The analytics engineer forks the dbt repo, updates the SQL in the relevant model, and creates new tests.
The pre-merge diff engine that Datafold uses runs in the CI pipeline and compares the dev and prod model outputs to expose drift. CI runs dbt run and dbt test, and blocks merges on any test or diff failures.
Benefits of data engineering tools
The success of a modern data platform strongly relies on the selection of tools at every step of the data pipeline. Choosing the right tools brings a number of benefits, including:
- Agility and speed: Modeling workflows can be run in parallel with APIs, files, or event streams, lessening bottlenecks. The cloud-native infrastructure of a modern data pipeline allows scaling the compute, storage, and pipeline concurrency on demand with limited manual work.
- Scalability: Increasing data quantities are addressed through compute-setting (or schedule) tuning rather than pipeline redesign. Containerized stateless components can be horizontally scaled and can run parallel processes in high-throughput environments.
- Resilience: Modern data engineering tools integrate observability and orchestration to provide pipeline resilience. Observability monitors the performance and identifies anomalies in real time. Orchestration automates recovery using SLA sensors, retries, and conditional execution.
Trade-offs and challenges
Modern data engineering tools are more flexible and powerful, but also add architectural and operational complexity. A multi-tool stack with modules needs to be well-designed, integrated, and aligned by the team to avoid a brittle or fragmented stack.
- Tool sprawl: Supporting multiple platforms for ingestion, transformation, orchestration, and observability increases the number of integrations. Minimize tool sprawl through platform consolidation, integration standardization, and automation of provisioning to avoid environment drift.
- Debugging: The cause of a failure can be anywhere, such as upstream data issues, pipeline problems, model logic issues, or BI configuration issues. Without centralized observability and alerting, the resolution time rises, and incident response becomes reactive.
- Cost and complexity: Licensing, cloud services, and overhead costs of running various services can become costly. Teams can control costs and complexity with transparent tracking, strong governance, and a streamlined architecture.
- Learning curve: The new stacks require the ability to work with tools like orchestration frameworks, transformation layers, Git workflows, and cloud warehouses. Onboard new engineers through well-organized onboarding, project-based work, clear documentation, and practical exercises in all tools of the stack and workflows.
Conclusion
Modern data stacks utilize a layered architecture to transform raw data into trusted insights, enabling parallel processing, scaling, and rapid response to data requirements. Transformation is the backbone of this stack, turning fragmented inputs into clean, validated, and well-documented models.
dbt’s modular design and support for testing, automated deployment, and monitoring simplify data workflows and strengthen their resilience. To learn more, schedule a demo of dbt today.
Published on: Apr 25, 2025
Rewrite the future of data work, only at Coalesce
Coalesce is where data teams come together. Join us October 13-16, 2025 and be a part of the change in how we do data.
Set your organization up for success. Read the business case guide to accelerate time to value with dbt.
VS Code Extension
The free dbt VS Code extension is the best way to develop locally in dbt.