Performance optimization strategies for dbt data models across data warehouses from Coalesce 2023
"My objective here is to share some tips with you…regardless of which engine you're using, to make your projects more scalable and ultimately tune the performance."
Juan Manuel Perafan, Analytics Engineer at Xebia, explains performance tuning in the context of dbt. He shares his insights and tips on optimizing performance, enhancing scalability, and reducing costs in dbt projects.
Use dbt Config Blocks for performance tuning
Juan emphasizes the importance of the dbt config block in improving performance. Partitions, clustering, and indexing can be defined within the dbt config block to significantly improve performance and scalability.
Juan advises, "The config blocks are your friends. Just add partitions, clustering, and indexing there. Experiment a lot with materializations. They work differently from warehouse to warehouse but being mindful about them can also save you a lot of time."
He also suggests turning Common Table Expressions (CTEs) into intermediate models to make complex projects more manageable and efficient. "Turn CTEs into intermediate models. Even if they're not exactly the same…if they have similar joints or similar tables, it's already worth it to isolate them," he recommends.
Importance of sampling and test optimization
Juan highlights sampling data in development as a key strategy in performance tuning. By focusing only on necessary data, the performance improved and the risk of overloading the system is also reduced. He also emphasizes the need for careful optimization of dbt tests to prevent excessive query execution.
"Sample your data in development…Only do it for the rows and the columns that you care about." He adds, "Always add WHERE clauses to your test and just iterate over what you need to iterate, instead of every single row."
Juan mentions that dbt tests can sometimes be the most expensive queries, so it's crucial to add WHERE clauses and only iterate over the necessary data.
The role of materializations in performance tuning
Materializations are key elements in performance tuning. Juan suggests experimenting with different materializations to find the most efficient approach for specific projects. He also discusses the benefits of specific dbt features, such as ephemeral models and materialized views, in improving performance.
"Materializations can make your projects more scalable and ultimately tune the performance," he says. "Materialized views often have a lot of the benefits of views, as in they're fresh, but they're also materialized so they're efficient. If this is an option, especially for consumption, check it out. Give it a try."
Juan emphasizes the importance of using ephemeral models when performing difficult transformations and isolating complex queries. However, he cautions that they can make debugging more difficult.
Juan’s key insights on boosting data model performance
- Performance tuning is crucial for developer experience, project scalability, and cost reduction
- Most literature on performance tuning is outdated or very engine-specific, with little information on dbt
- Using config blocks in dbt can help optimize performance
- Materializations can be a game changer if used correctly
- Turning Common Table Expressions (CTEs) into intermediate models can improve performance
- Sampling data is a vital practice, particularly in development environments
- Performance considerations should not be overlooked when conducting dbt tests