4 Experts on How to Optimize Costs in Your Data Work & Pipelines
Data workflows can be expensive. They don’t have to be inefficient. In our cost optimization webinar, we talked with data leaders from Paxos, Total Wine, Sharp Healthcare, and dbt Labs about how they found ways to reduce costs in the data workflow and get more out of their existing investments.
We’ve recapped some of the insights below, but be sure to watch the webinar recording to see the full conversation.
Experience and advice
“We ended up reducing our monthly Snowflake spend by a little over 50%” just by reconfiguring SQL query logic to be more efficient, Mike Moyer, data engineer at Paxos shared. His team noted that simple improvements like changing a sum to be incremental rather than fully recalculated had a huge impact when extended to tens of thousands of rows every time the model is run.
The Paxos team’s advice: Start optimization early and look for big wins first. The faster you get started, the sooner these savings kick into gear. Even if an optimization is seemingly obvious or unsophisticated, taking the time to improve these basic efficiencies can produce significant savings.
When the Total Wine team transitioned to dbt Cloud, they rebuilt their data pipelines from the ground up.
During this rebuild, Total Wine rewrote code that recalculated the same models many times over. They also centralized their entire code base. This consolidation improved the efficiency of the code itself and the efficiency with which teams could work with it.
“We were able to save around 10 to 15 percent” just through centralization, said Pratik Vij, senior manager of data engineering at Total Wine.
The Total Wine team’s advice: When taking on an optimization task, divide the task into components: e.g., data ingestions, storage, transformation, and visualization. Then, set up dashboards for each component and optimize your workflows based on the biggest impact you can find. Finally, engage with various stakeholders to find the best solutions for the optimization at hand.
At Sharp Healthcare, the optimization effort wasn’t about saving money – their expenditures stayed flat. Instead, it was about doing more with the same price. With dbt Cloud, the team unlocked new possibilities for delivering value. The IDE and the new CI/CD process were especially impactful.
Sharp also found that dbt Cloud unlocked new agility and possibilities. “I can bring in business analysts now - they can get into dbt Cloud, create models, and deploy them to Snowflake,” said Clay Townsend, principal data architect at Sharp HealthCare. “And I’m aware of it. I can see the code. We have automated tests firing. You get all these passive benefits that we didn’t have before.”
The Sharp Healthcare team’s advice: Optimization isn’t just about making numbers go down. Evaluate the value your tools deliver, and look for ways to push that value further so that even if numbers stay flat, each dollar delivers as much value as it can. To that end, it’s key to maintain benchmarks of past builds so that you can track improvements.
The team here at dbt Labs focused on improving the efficiency of our data platform and computing expenditures, as well as spending time reworking our architecture.
“We intentionally spend time surfacing this work in an easy-to-consume way,” said Elize Papineau, senior data engineer at dbt Labs. “The recommendations helped codify for our team that, if you’re seeing performance hits, you know you can tune specific things in Snowflake. We’ve given our teams thresholds to use to understand when and how they can turn these knobs effectively.”
The team’s advice: There are many knobs you can turn on cost optimization. As you’re optimizing, take time to develop different ways of quantifying your progress. Quantification helps in demonstrating value. It also helps build intuition about which kind of optimization gives the best ROI. Then, you can use that intuition to focus your optimization on the highest-value problems.
The future of cost optimization, self-service, and AI
Things are changing rapidly in the data space. We asked our guests their thoughts on the current push for optimization, self-service data, and AI tools. Here’s where the consensus landed:
The current moment in cost optimization
Economic conditions are driving companies to focus on consolidation and internal efficiencies. Services like dbt Cloud that offer multiple benefits in one package are helping companies bring their siloed stacks into a centralized state. The growth of this cloud service market allows companies to purchase computing capability directly without installing hardware. With lower barriers, capital can be invested with more agility, responding to bottlenecks as soon as they appear.
At the same time, AI tools are lowering the barrier to entry for data development. Elize Papineau at dbt Labs noted, “Tools like ChatGPT are boosting people’s confidence, giving folks the push to try something new,” and enabling them to get their hands on the data stack. When other teams can be plugged into the data pipelines, central data teams spend more time on value-added tasks than on firefighting.
Engineering time is the most valuable resource data teams have. That makes self-service analysis a powerful part of cost optimization. Building self-service tools allows data teams to focus on priority engineering tasks rather than filling queries from business teams.
However, to deliver full value, self-service needs to be done right. Mike Moyer shared Paxos’ self-service hierarchy of priorities: “(1) security and avoiding misuse of data, (2) making it easy to get value out of self-served data, and (3) cost optimization of 1 & 2.”
Thinking in terms of this hierarchy, effective self-service means:
- Implementing robust governance to make sure that privacy and security are maintained, even as more users are given access to data;
- Making your self-service system easy to use, and training your organization to use the system; and
- Taking advantage of the engineering time saved to deliver more and more efficiencies.
AI in cost optimization
AI coding tools like GitHub copilot are already being incorporated into development workflows, speeding up development cycles so teams can deliver value faster. Business teams can also use these tools to plug into self-service systems. They can even become a part of the development process themselves.
For example, the Sharp Healthcare team used OpenAI Azure endpoints as a part of a platform transition, taking SQL server views and converting them into dbt models. According to Clay Townsend of the Sharp team, “That’s quite an effort to do that by hand…when you convert them into dbt…you’ve got some CTEs, some Snowflake syntax you want to take care of.”
Suffice to say, it would be a tedious, non-standardized process. After massaging prompts for a few weeks, however, the team built a pipeline that handled their more than 15,000 records.
Optimize costs with dbt Cloud
Here at dbt Labs, we’re building dbt Cloud to deliver the best experience for your data-developing team. That means we’re finding thoughtful ways to help your team manage costs that come from the data workflow. Here are just a few ways you can use dbt Cloud to optimize your spending:
Defer to Production
Defer to production allows developers to bring production assets into their development pipeline. This deferment means, for example, if a new developer joins a project, they can start working with the models and products that make up that project without rebuilding them independently—saving on storage and compute costs
Deferment also makes sure the assets you are developing around are completely up-to-date. This ensures that what you’re building matches what you deliver without frequent, expensive rebuilds.
dbt Cloud Slim CI
With Slim CI in dbt Cloud, you can retest only the parts of your model that you changed rather than retesting the entire pipeline.
Slim CI makes sure that valuable computing time isn’t spent confirming the stability of stable models. Rather, it tests updated or inserted models and any downstream dependencies, maintaining the pipeline’s integrity while saving on computation costs.
Rerun from point of failure
Rerun from point of failure is an error recovery protocol that allows developers to restart a pipeline computation from the model it failed on, keeping any prior, completed computations saved. Saving these computations allows debugging to proceed efficiently, only rerunning the code that must be fixed and avoiding expensive recomputations.
To learn more about cost optimization mindsets and tools watch the full webinar for free.
Want to learn more about how you can optimize costs in your data teams with dbt Cloud? Book a demo today.
Last modified on: Feb 9, 2024