Optimizing data pipeline costs: 29 tactics

Cost Optimization Playbook

Optimizing data pipeline costs: 29 tactics

26 min

When it comes to optimizing costs in the data workflow, there is no one-size-fits-all solution. The reality is that working with data can be expensive. But it doesn’t have to be inefficient nor should it require large overhauls of process and technology. In fact, there are also smaller, tactical improvements data teams can make that add up to massive savings over time.

In this playbook, dbt Labs—along with our partners Data Clymer, phData, BDO Digital, and Hakkoda—share tips for reducing costs in your data work. Some ideas are specific to dbt, but many can be adopted across multiple platforms.

The tactics in this post are organized in three main categories:

How to optimize data platforms for efficiency and cost
Ways to optimize data development for cost
How to prioritize high-value work

Some of these tactics are specific to dbt, but many can be adopted across multiple platforms. Some of these strategies are technical and others are cultural.

But ultimately, all 29 of these are "no brainers".

How to optimize data platforms for efficiency and cost

1. Measure the cost of your data workflows

To really optimize the costs of your data workflows, you need to start by accurately measuring them. (You can’t gauge improvements if you don’t have a baseline.) Leverage Snowflake’s resource monitors and system views to actively monitor and report on credit consumption. BI tools such as Sigma Computing provide out-of-the-box templates, enabling you to quickly understand how much you are spending on tools like Fivetran and Snowflake with minimal time investment from your team. dbt offers run time monitoring to provide insight into which models are taking the longest, giving you further insight into which part of your DAG can be optimized.

These are just a few of the ways you can get a sense of existing costs, but be sure to check all the tools in your data stack to see which insights they can offer.

Data Clymer

2. Conduct a model timing analysis

Once you have a sense of existing costs, it’s time to dig into your model runs. Are any runs holding up other work? Is there a model that’s running way longer than expected? Excessive runtime can translate into large compute expenses in your cloud data platform—and thus opportunities for cost savings. To get a handle on model runs, you’ll need to conduct a model timing analysis.

If you’re using dbt, you can simply use the Model Timing tab to inspect logs and identify areas for tuning. In fact, the dbt Labs team used this feature to shave 90 minutes off our longest running model—which translated to $1800/month saved in Snowflake credits. The new dbt Explorer also provides metadata on dbt runs, such as the models with the longest execution time. You can use this information to assess trends in execution times and select areas for improvement.

Regardless of how you conduct your model timing analysis, it’s a best practice to regularly inspect your model runs. By understanding the holistic data environment, targeted optimizations can be made, thereby enhancing performance and reducing costs.

BDO Digital

3. Examine deployments

Similar to model runs, you should examine how your deployments are running and make sure that you take advantage of the advanced selection/exclusion methods built into dbt (state selection, source freshness selection, tag-based selection, etc). At the same time, examine for any models that can be consolidated into a singular data model to reduce redundancy and duplication. These two methods should always be the first levers you pull on to optimize performance and reduce compute and dbt runtime within your cloud data platform, as these tend to be the most straightforward way to optimize your pipeline.

phData

4. Socialize metrics

Next, integrate all these insights into your data governance processes. It’s important that data team members actually take the time to review trends and act upon them. Include cost reporting in your weekly or monthly standup meetings to socialize these metrics and ensure your team understands spending on the data platform and how they can help reduce these costs.

Cost optimization isn’t a one-person job—it takes the full team. By incorporating these metrics into regular reporting, you ensure that each team member is aware of the levers they can pull to help reduce costs.

Data Clymer

5. Use sampling for non-production environments

Have you heard of the four V’s of data? It’s velocity, veracity, volume, and variety. You can look at cost improvements through the lens of the four V’s.

For example, take a look at long-running development and CI/CD processes. These could be significant signs that you’re working with a lot of data that takes a while to develop and test, which can be resolved by working off of a sample of your data set, which could be implemented in a macro like:

Example of a dbt Jinja macro that conditionally applies a sample clause for non-production environments, using a customizable sample type and size.

This sample macro can be applied to all data sets, allowing for development and CI/CD pipelines to run with a fraction of the data, helping reduce compute and storage costs. This same sort of solution can also be used to help optimize tests that run on extremely large datasets, using the where config and overloading the get_where_subquery macro to look at certain running partitions of your data using something like:

Example of a dbt YAML config and custom Jinja macro that dynamically filters data using a rolling date window placeholder (__window__) in a model-level uniqueness test.

By reducing processing time and storage volumes in non-production environments, these levers provide ways to start optimizing datasets based on the needs of the four V’s for our data.

phData

7. Defer to production with dbt

Defer to Production is a handy feature in dbt that reduces model builds in dev environments, reduces build times for extremely active developers, and removes the need to import lots of data into the dev environment.

You can use defer to production in a few ways:

With the dbt CLI, it’s automatically enabled for all commands.
In the dbt IDE, you can simply toggle a button at the bottom right of the editor.

Defer to production ensures that when developers edit models, they can build, run, and test them in a development environment, without needing to run and build all the upstream models that came before them. Instead, dbt will automatically fetch the data for the non-edited models from the production environment, saving storage and compute costs that would come from rebuilding models that don’t need to be rebuilt.

dbt Labs

8. Manage keys in cloud data platforms

Managing cluster keys in Snowflake, partition keys in BigQuery, and sort/dist keys in Redshift is another way to effectively manage data platform compute costs. These keys play a pivotal role in determining how data is distributed and organized within warehouse tables, directly impacting query performance and resource utilization.

To start, strategically choose and fine-tune these keys to minimize the amount of data that needs to be processed during queries, reducing query execution times and, in turn, the associated costs. dbt Labs offers a native configuration to manage these keys directly in your dbt project. For more information, see the platform-specific configs for BigQuery, Snowflake, and Redshift. Keep in mind that managing these keys does usually incur additional costs. So be sure to do careful analysis on query usage patterns to ensure the performance gains outweigh the cost associated with managing the keys.

Data Clymer

9. Test only the relevant changes

It goes without saying that you shouldn’t re-run a full pipeline or dataset to test a small change. But, it happens. And when it happens, it’s expensive.

To avoid running more data than you need to (and incurring the related data platform costs) make sure to incorporate these practices into your team’s routine:

Enable and disable parts of the pipeline to test discrete units.
Use Slim CI in dbt to run and test only modified models.
Implement a write-audit-publish style deployment where a pipeline's particular results can be verified in a non-production environment and, once approved, swapped into production.
Use smaller windows for data testing on extremely large datasets with long-running tests.

phData & dbt Labs

10. Auto-cancel stale CI builds

Similarly, if you’re using dbt, you can also use the dbt scheduler to manage the auto-cancellation of stale CI builds. With this feature, dbt can detect and cancel any in-flight, now-stale checks in favor of executing tests on the newest code committed to a PR. Auto-cancellation improves team productivity while reducing data platform spending on wasteful CI runs.

dbt Labs

11. Centralize the code database

It’s not uncommon for data teams to learn that legacy code is still getting computed by the cloud data platform many years after it was written. Sometimes it turns out that no one on the current team has context for what this code accomplishes.

In these scenarios, it’s essential to centralize the code database to get a stronger grasp on exactly what code is getting executed and why. One team that undertook such an effort was the data team at Total Wine—they dug into their code database to make sure they understood every script and metric that was getting computed every night.

“We saw that certain old pieces of code were getting re-computed or re-recalculated, multiple kinds across several tables,” said Pratik Vij, senior manager of data engineering at Total Wine. “This core duplication was not only expensive but there was no added benefit of recomputing these values because the underlying data wasn't changing.”

These findings prompted Total Wine to centralize their code database. Once they did, Pratik noted, “We were able to save around 10 to 15 percent” on data platform costs.

Total Wine

Ways to optimize data development for cost

12. Establish quality foundations by providing transparency to data consumers

There are many tactics that help save time, resources, effort—and thus money—in the data development process. Ultimately, these tactics boil down to the same key idea: The team must write great code and build strong pipelines, which guarantees long-term stability.

To this end, teams must establish quality foundations early. You can incorporate quality metrics and data consumer feedback mechanisms that bolster the robustness of your data pipelines. For example, In dbt, you can leverage native support for in-lineage tests, continuous integration, and documentation.

Put another way: In some situations, data engineering can feel like a black box to the rest of the organization. Business stakeholders submit requests and hope the data model is built correctly. But too often, data products are met with the response of, “Oh that’s wrong, you have to retest things and go back.”

By leveraging tools and features that provide clarity and transparency into the data development process, data teams can get feedback sooner and ensure that necessary corrections are made early on in the process.

BDO Digital

13. Add comment blocks for context

Every engineer knows the pain of debugging old code without comments. Logic that may have made sense to a prior team can look like “alphabet soup” years later. Improving that code and figuring out how it impacts current work can take days of manual work—and that’s expensive time that could be better spent on other projects.

Adding comment blocks for context should be a regular part of your data development process. Make sure to consistently include comment blocks that give context about the "what," "why," and "how" of data models. It’ll save you and your team headaches—and critical resources—in the future.

Hakkoda

14. Reconfigure SQL logic to be more efficient

“We ended up reducing our monthly data platform spend by a little over 50%” just by reconfiguring SQL query logic to be more efficient, Mike Moyer, data engineer at Paxos said. His team noted that simple improvements like changing a sum to be incremental rather than fully recalculated had a huge impact when extended to tens of thousands of rows every time the model is run.

The Paxos team’s advice: Start optimization early and look for big wins first. The faster you get started, the sooner these savings kick into gear. Even if an optimization is seemingly obvious or unsophisticated, taking the time to improve these basic efficiencies can produce significant savings.

Paxos

15. Examine materialization types

If you’re using dbt for transformation, you’ll be aware that materializations are strategies for persisting dbt models in a cloud data platform. By default, dbt models are materialized as “views” but can also be built as tables, incremental models, ephemeral models, and materialized views. Custom materializations are available as well.

To save on data development costs, be sure to examine the type of materializations you’ve created. Remember, stacked views can take a long time to query, so with large volumes of data, it is often best to build data sets incrementally.

phData

16. Embrace a modular intermediate layer

Modular data modeling

The transition to a modular intermediate layer can mark a turning point in cost efficiency, and it’s easy to facilitate with dbt. For example, a BDO Digital client moved their upstream application stack from a monolithic architecture to a microservices architecture, but they were able to plan ahead by implementing dbt with a modular design even while using it with a monolithic architecture. This meant that when the team was ready to fully move their application stack to a microservices architecture, dbt was ready to adapt. Due to dbt's resilience for upstream changes, the team was able to avoid expensive data platform refactoring.

BDO Digital

17. Prioritize meaningful data quality tests

It’s essential to leverage tests for data quality. But be sure your team knows what they’re testing and why. In your data transformation tool (like dbt), emphasize impactful data quality tests over meeting arbitrary coverage targets. By eliminating superfluous tests, execution costs are reduced. Meanwhile, effective tests build data consumer trust and facilitate more confident, quicker development cycles.

BDO Digital

18. Set custom rules for SQL formatting

Code style isn’t about coding with style. It’s about enhancing your team’s development workflow with legible, reliable syntax. Tools like linters and formats can automate much of this work by analyzing code for errors, bugs, and style and formatting issues. In dbt, you can also customize linting rules directly in the IDE.

By setting custom SQL formatting rules, you ensure that your team writes code in a clean and consistent manner—with less time spent on simple syntax mistakes.

dbt Labs

19. Use the dbt Clone command

The dbt clone command allows you to inexpensively copy schema structures using zero-copy cloning. It facilitates safe modifications, avoids data drift, and supports dynamic sources. You can use dbt clone for creating sandboxes, making data accessible in BI tools, and ensuring smooth blue-green deployments. Ultimately, it's a valuable tool for cost-effective, flexible, and efficient data workflows.

dbt Labs

20. Rerun errored jobs from point of failure

When a job fails, it's typically due to a specific issue or error within a portion of the data transformation process. By rerunning the job from the point of failure, you can skip re-executing the successfully completed steps, thus conserving resources and avoiding unnecessary reprocessing of data.

Of course, you can use dbt to easily identify any failed steps and examine the error message to implement a fix. Rerunning the job from the point of error, instead of from scratch, also saves the team time and leads to a faster resolution of issues.

dbt Labs

How to prioritize high-value work

21. Run concurrent CI checks

With dbt, CI runs can execute concurrently. This means you can reduce overall compute time with dbt’s ability to intelligently understand dependencies and run select jobs in parallel. This also saves the team time, as teammates never have to wait to get a CI check review.

dbt Labs

22. Use dashboard tiles to verify data quality

The work of a data team doesn’t exist in a silo. It’s often leveraged by downstream consumers in tools like BI platforms to make sense of data and make key decisions.

Between the time when the data is modeled and when it’s queried in another platform, though, questions can arise about the freshness and quality of that data. And if decisions are made based on outdated data, you can bet there will be more work down the road to correct those mistakes.

To avoid such inefficiencies, make sure your data consumers have “sanity checks” they can reference in their downstream platforms. With dbt, for instance, you can embed dashboard status tiles in downstream platforms that show whether data in that dashboard has passed quality and freshness criteria. If any data freshness check fails, the dashboard status tile will alert a consumer that the data may be stale.

dbt Labs

23. Standardize on methods that scale

Trendy tools come and go and data development methods are constantly advancing. As the industry changes and more roles get involved in data modeling or analysis, it’s important to make sure your team is centered around practices that scale.

SQL and Python, for instance, are the two most popular coding languages for data analysis. They aren’t going away. But some data development tools use GUI (graphical user interface) methods for data modeling, which require users to learn an analysis method specific to that tool. Over time, these niche tools and methods become outdated, which can lead to vendor lock-in on archaic tools or the proliferation of tribal knowledge (only a few people on the team know how to work the tool) which slows down the rest of the organization.

Set your team up to move fast now and in the future by selecting tools built for scale, adaptability, and flexibility. Additionally, train users (data producers and consumers alike) on standard and widely used languages like SQL and Python. If you use a data platform that the whole team can standardize on, you’ll also save countless hours of confusion about which data logic or metric is correct. With a unified platform, everyone has insight into which data models are accurate, and questions about the data are easy to debug.

dbt Labs

24. Strategically allocate workflows

Focus on defining 2-3 key platform objectives (like cost, quality, consumer trust, security, development velocity, or self-service capabilities) and align design choices accordingly. For example, prioritizing optimizations for a cloud data platform and streamlined pipeline design led a client of BDO Digital to see a dramatic reduction in execution times.

This approach ensures effective trade-offs are consistently made, as keeping these objectives top of mind helps the team focus on what matters, rather than trying to tackle everything at once. It’s an especially helpful strategy for large or legacy organizations that may have more data complexity to deal with.

BDO Digital

25. Reference and re-use code

There’s no sense in rewriting code from scratch when you could just reference or re-use existing code. For data transformation, dbt Labs provides a hub of community packages that developers can use. This way, teams spend more time focusing on their unique business logic, and less time implementing code that someone else has already spent time perfecting.

In addition, dbt Catalog makes it easy to reference existing business logic in dbt. This means that if someone on your team has already built and validated data assets, the rest of the team can easily discover and reference those assets, instead of having to rebuild them.

dbt Labs

26. Automate documentation

Documentation is a necessary part of data development—but the manual effort required to create and maintain documentation can be burdensome.

Luckily, there are ways to automate code documentation. For example, dbt will automatically generate documentation for data models, tests, and transformations based on comments and descriptions in your SQL code. The documentation is stored centrally within dbt, making it easily accessible to all team members. This central repository also ensures that the documentation is version-controlled and aligned with the latest code changes, fostering accuracy and reliability.

This automation not only saves time for data teams but also promotes better collaboration and understanding of the data transformations within the organization—which ensures teams can move faster in the future.

dbt Labs

27. Use AI to take the load off manual work

The capabilities for AI in data work are already strong and growing by the day. The opportunities to quickly diagnose errors, write better code, developer software faster, and analyze patterns seem endless. Of course, it’s necessary to include human checks for accuracy and make sure AI isn’t exposed to company secrets. In addition, training and deploying AI models can incur significant computational costs, especially when working with large datasets or complex models.

At the same time, there are ways to leverage AI to improve data team productivity at scale. For example, the data team at Sharp Healthcare used AI to accelerate a transition process that would have been onerous and time-consuming to do by hand. They needed to convert 15,000 SQL server views into dbt models, but each conversion came with some CTEs and Snowflake syntax that needed to get taken care of. So, they looked into how AI could help.

“We took about three weeks massaging prompts and everything else and standing up a pipeline where we can go through all those views, and then generate the corresponding dbt.sql file,” said Clay Townsend, director of analytics at Sharp Healthcare. “It was almost like having a data engineer intern there as part of the pipeline. It's been great.”

Sharp Healthcare

28. Schedule runs and tests

It takes time for data to run. But that time doesn’t have to overlap with typical work hours. Schedule model runs and tests so that data transformations can be processed automatically and at specified intervals.

Automation saves time by reducing the need for manual intervention in running data transformations. It also allows teams to schedule jobs to run during non-business hours, which optimizes resource usage and ensures that data is ready for analysis when it’s needed at the start of the next work day.

dbt Labs

29. Up-skill data consumers (with guardrails in place)

The most expensive resource a data team has? Engineering time.

In an ideal organization, engineers spend their time focused on deeper problems that only they can solve. But in reality, engineers at many organizations often get swamped in manual tasks, infrastructure maintenance, responding to endless ticket requests, and debugging issues that could have been avoided in the first place.

Many of the strategies included in this list will help protect engineering time by establishing quality data foundations early on. However, it’s important to think about the self-service aspect of data analysis too.

Data doesn’t just sit with the engineering team—it’s used by teams throughout the organization to make key business decisions. Training and enabling data consumers to analyze—and in some cases even model—that data themselves will ensure other teams can get their answers sooner without requiring engineering support. With more people fully ramped on how to use data tools, the organization can operate more efficiently.

Of course, it goes without saying that strong guardrails are necessary for any initiative like this. But with proper governance measures in place, more people can understand and leverage data—saving time for every team and allowing engineers to focus on what matters most.

dbt Labs

VS Code Extension

The free dbt VS Code extension is the best way to develop locally in dbt.

Install free extension

Ready to continue your journey with dbt?

Click below and discuss your use case with a product expert.

Book a demo

Accelerate speed to insight

Democratize data responsibly

Build trust in data across business