/ /
ELT best practices for Databricks workflows

ELT best practices for Databricks workflows

Joey Gault

last updated on Oct 15, 2025

When architecting ELT workflows on Databricks, your choice of compute infrastructure directly impacts both performance and cost efficiency. Databricks offers several compute options—SQL warehouses, All-Purpose Compute, and Jobs Compute—but for ELT workloads, SQL warehouses represent the optimal choice.

SQL warehouses are specifically optimized for SQL workloads and provide built-in features like query history for auditing and optimization. They scale both vertically to handle larger datasets and horizontally to support concurrent operations. Among the available warehouse types—Serverless, Pro, and Classic—serverless warehouses offer the most compelling advantages for ELT workflows.

Serverless warehouses dramatically reduce spin-up times and scale quickly when workloads demand additional resources. This eliminates the need to keep clusters idle, as serverless warehouses activate rapidly when work begins and shut down when complete. They also leverage Databricks' Photon engine automatically, providing optimal performance for both transformation and serving workloads.

Sizing your SQL warehouses requires balancing data volume, complexity, and latency requirements. A practical approach is starting with a Medium warehouse and adjusting based on performance observations. Larger warehouses often prove more cost-effective for complex workloads because they complete tasks faster—a Small warehouse taking an hour to complete a pipeline might finish the same work in thirty minutes on a Medium warehouse.

Consider provisioning separate warehouses for different workload types. ELT pipelines have different compute patterns than ad-hoc analysis, so dedicated "pipeline" warehouses sized for data volumes and SLAs work more efficiently than shared resources. Factor in dbt's thread count when sizing—higher parallelism requires more compute capacity to maintain performance.

Configure auto-stop settings aggressively for serverless warehouses. Since they spin up in seconds, setting auto-stop to five minutes (or as low as one minute via API) won't impact user experience while minimizing costs.

Architectural patterns: staging, intermediate, and marts

Successful ELT implementations on Databricks follow a layered approach that mirrors the medallion architecture: bronze (staging), silver (intermediate), and gold (marts). This structure provides clear data lineage, enables incremental development, and supports different consumption patterns across your organization.

The bronze layer handles raw data ingestion and initial storage. For datasets in cloud storage, leverage Databricks' COPY INTO functionality rather than staging external tables. COPY INTO operates incrementally and ensures data is written in Delta format, providing the performance, reliability, and governance advantages that Delta tables offer. You can implement COPY INTO as a pre-hook before building downstream models or invoke it using dbt's run-operation command.

While staging external tables remains an option for teams migrating from other cloud warehouses, this approach prevents you from leveraging Delta's benefits and requires additional maintenance like running repair operations for new partition metadata.

The silver layer focuses on cleaned, modeled data optimized for performance and cost. Many organizations implement incremental processing at this stage, processing only new or updated records rather than recreating entire tables. dbt's incremental model materialization facilitates this approach by creating temporary views with data snapshots and merging them into target tables. Configure the temporal range of your snapshots using conditional logic in your is_incremental blocks—the most common pattern merges data with timestamps later than the current maximum in the target table.

For teams with tight SLAs, several advanced optimization techniques can improve merge performance significantly. Enable auto compaction to maintain optimal file sizes between 32MB and 256MB. Databricks handles this automatically with optimized writes enabled by default in SQL warehouses, but opting into auto compaction provides additional benefits.

Leverage data skipping through Z-ordering for high-cardinality columns frequently used in joins or filters. The syntax is straightforward: OPTIMIZE table_name ZORDER BY (col1, col2, col3). Limit Z-ordering to three columns maximum and run it either as a post-hook after model builds or as a scheduled job on a regular cadence.

Maintain current statistics with the ANALYZE TABLE command to ensure optimal join plans. Run this for columns frequently used in joins, either as a post-hook or scheduled job: ANALYZE TABLE mytable COMPUTE STATISTICS FOR COLUMNS col1, col2, col3.

Consider implementing VACUUM operations to remove unused files from Delta tables. While deleted records are soft-deleted from the transaction log, underlying files remain in storage. VACUUM removes these files to reduce storage costs and improve merge performance, but be cautious about retention periods—vacuuming files older than seven days prevents restoring table versions that depend on those files.

The gold layer delivers business-ready marts that stakeholders access through BI tools. Apply the same optimization techniques as the silver layer, with particular attention to Z-ordering since these tables typically have stricter SLA requirements. Gold tables are ideal candidates for defining metrics using dbt's semantic layer capabilities, ensuring consistency across key business KPIs.

Performance optimization and troubleshooting

Effective performance management requires both proactive optimization and reactive troubleshooting capabilities. Databricks provides several tools to identify and resolve performance bottlenecks in ELT workflows.

The SQL warehouse query profile serves as your primary troubleshooting tool. It provides detailed information about query execution, including time spent in tasks, rows processed, and memory consumption. The profile offers both tree and graph views to identify slow operations and understand data transformation flows.

Common performance issues revealed by query profiles include inefficient file pruning, full table scans, and exploding joins. Address file pruning issues by reordering columns in your bronze-to-silver transformations. Databricks collects statistics on the first 32 columns by default, so position numerical keys and high-cardinality query predicates before the 32nd column, with strings and complex data types after.

Full table scans indicate queries scanning entire tables rather than leveraging file-level statistics. File compaction and Z-ordering techniques described earlier help alleviate this problem by improving data layout and enabling better file skipping.

Exploding joins produce result sets much larger than input tables, often creating Cartesian products. Prevent these by making join conditions more specific and preprocessing data through aggregation, filtering, or sampling before join operations.

For incremental models, rely on merge strategies as the recommended approach for most use cases. Databricks has significantly improved merge performance with low-shuffle merge and Photon optimizations. Optimize merge operations by reading only relevant partitions through filters and incremental_predicates, updating only necessary rows and columns, and defining single materialized keys for efficient lookups.

Monitoring and observability

Understanding your ELT performance requires comprehensive monitoring of both dbt operations and Databricks resource utilization. dbt generates metadata on timing, configuration, and freshness with each job run, accessible through the Discovery API—a GraphQL service that supports queries on this metadata.

Teams can analyze this data like any other business intelligence source by piping it into their data warehouse. The Model Timing tab in dbt provides visual identification of models requiring optimization or refactoring.

The Admin API enables pulling dbt artifacts from runs, allowing you to stage manifest.json files and model the data using the dbt artifacts package. This package helps identify inefficiencies and optimization opportunities across your dbt models.

Combine dbt monitoring with Databricks' native observability features to gain complete visibility into your ELT workflows. Monitor warehouse utilization, query performance, and cost trends to identify optimization opportunities and ensure efficient resource usage.

Implementation considerations

Successfully implementing ELT on Databricks requires careful attention to both technical and organizational factors. Start with clear data contracts that define schema expectations, freshness requirements, and ownership responsibilities between data producers and consumers. These contracts reduce friction and prevent silent failures as your pipelines evolve.

Implement version control and CI/CD practices from the beginning. Treat your analytics code like application code, with peer reviews, automated testing, and safe deployment practices. dbt's integration with Git-based workflows makes this straightforward while providing the collaboration benefits that modern data teams require.

Consider incremental adoption when migrating from existing ETL processes. Begin with less critical datasets to validate your approach and build team expertise before tackling mission-critical workflows. This approach reduces risk while allowing your team to develop best practices specific to your organization's needs.

Plan for schema evolution and data governance from the start. ELT's flexibility in handling schema changes is a significant advantage, but it requires thoughtful approaches to managing those changes across your pipeline. Establish conventions for handling new columns, data type changes, and structural modifications to source systems.

Conclusion

ELT on Databricks represents more than a technical architecture choice—it's a foundation for building scalable, maintainable, and collaborative data workflows. By leveraging serverless SQL warehouses, implementing layered data architectures, and following performance optimization best practices, data engineering leaders can deliver faster insights while reducing operational overhead.

The combination of Databricks' cloud-native capabilities with dbt's transformation framework provides a powerful foundation for modern analytics engineering. Teams that embrace these patterns find themselves spending less time on infrastructure management and more time delivering business value through trusted, well-governed data products.

Success with ELT requires commitment to software engineering best practices, proactive performance monitoring, and clear organizational processes around data ownership and quality. When implemented thoughtfully, these approaches enable data teams to scale efficiently while maintaining the reliability and trust that business stakeholders demand.

ELT for Databricks FAQs

Live virtual event:

Experience the dbt Fusion engine with Tristan Handy and Elias DeFaria.

VS Code Extension

The free dbt VS Code extension is the best way to develop locally in dbt.

Share this article
The dbt Community

Join the largest community shaping data

The dbt Community is your gateway to best practices, innovation, and direct collaboration with thousands of data leaders and AI practitioners worldwide. Ask questions, share insights, and build better with the experts.

100,000+active members
50k+teams using dbt weekly
50+Community meetups