dbt
Blog Implementing common data transformation techniques with dbt

Implementing common data transformation techniques with dbt

Data-driven use cases require high volumes of high-quality data. ‌Unfortunately, high-quality data isn't something we get for free. Data transformation is critical for cleaning, verifying, and changing data into a format that stakeholders can readily use to drive key business decisions.

There are a few common data transformation techniques you'll use in almost every data transformation pipeline. Implementing these in a given language, such as SQL or Python, is easy enough. Managing data transformation at scale—designing, committing, reviewing, testing, documenting, monitoring, and reusing your transformation code across all the data stores in your enterprise—is another matter.

A data control plane provides data teams with a standardized and cost-efficient way to build, test, deploy, and discover analytics code, regardless of where data is stored. In this article, we'll look at why you should use dbt Cloud as a data control plane for managing common data transformations.

Types of data transformation

Transforming data usually means applying one of a fixed set of operations to change data to a more usable format. Sometimes, this means addressing errors or inconsistencies in your data. ‌It also involves changing data into a format that's more readily usable for your use cases.

Data transformation pipelines save time and money by bringing consistency to data. Without data transformation pipelines, everyone—analysts, data engineers, and business users—would be slicing and dicing data their own way, wasting time and introducing data inconsistencies.

Most of the time, you'll be applying one of the following transformations to your data:

  • Cleaning. ‌Removing errors in inconsistencies from your data—missing fields, inaccurate entries, duplicated data, etc.
  • Aggregation. ‌Rolling up critical values for faster access—for example, sales data for a given customer or time period.
  • Generalization. ‌Breaking up a single data unit into a hierarchy, such as an address.
  • Discretization. Transforming continuous data, such as ages, into a set of ranges (e.g., ages 18-29) to make it easier to drive initiatives such as targeted marketing.
  • Normalization. ‌Enforcing standards for the format of certain fields and rationalizing data types and identifiers. Example: converting currency data into a single standard currency, such as USD.
  • Validation. Ensuring that data is in the correct format. One example is verifying that phone numbers have the correct number of digits, that they have a valid country code, etc.
  • Enrichment. Also called attribute construction, enrichment adds additional data to enable enhanced decision-making — e.g., adding weather data to scheduled shipment information to warn customers about potential delays.
  • Integration. Bringing in data from multiple sources to create a single, consistent data set that doesn’t require complex joins or high-latency connections across different databases.

dbt Cloud: A data control plane for data transformations

Data engineers can perform data transformations and create data pipelines in a number of ways, including ad hoc SQL transformations or one-off Python scripts. Using dbt Cloud as your data control plane brings a number of benefits:

  • Turns analytics into code. ‌Instead of transforming data using one-offs, with dbt Cloud, you can save all prospective changes to data in either SQL or Python and place them under source control so they can be tracked, reviewed, deployed, discovered, reused, and rolled back as needed.
  • Provides a consistent, vendor-agnostic framework. dbt Cloud provides a single approach to creating data transformations for all major data warehouses and data stores. Because it supports both SQL and Python, it makes it easier for anyone with basic knowledge of either language to contribute to creating and maintaining data pipelines. This democratizes data and prevents your data engineering team from becoming overwhelmed with requests.
  • Supports testing data changes. ‌A major problem with an ad hoc approach to data transformation is that incorrect code can inject errors that can lead to broken reports or poor decision-making. ‌With dbt cloud, you can write tests alongside your data transformations that are run automatically at various points to ensure the code is correct before it touches production data.
  • Offers built-in documentation support. ‌dbt Cloud also supports documentation that you can generate and deploy automatically with every release. It also automatically generates data lineage maps, so that perspective users of data and data engineers investigating data quality issues can see how data flows from source to destination across your data estate. This makes data sets easier to use and increases your stakeholders’ confidence in the quality of data, which in turn increases adoption.

Avoiding common pitfalls in data transformation with dbt Cloud

Writing data transformation code is only half the battle. ‌Your data engineers also need a way to safely deploy changes into production without potentially wreaking havoc on existing users. Meanwhile, your analysts, business users, and application engineers need a way to discover the great work done by your data engineers so they can use it in their BI tools and data-driven apps.

Besides bringing greater rigor to data transformations overall, dbt Cloud adds additional features that address the issues involved in creating and shipping high-quality datasets:

  • Corrupted values from data transformation bugs
  • Haphazard deployment processes
  • Inability to rollback
  • Lack of data discoverability

Let’s look at each of these issues in detail and how dbt Cloud addresses them.

Corrupted values from data transformation bugs

Bugs in your code can create havoc if they make it all the way to production. That's why dbt Cloud provides multiple mechanisms for verifying that changes are defect-free before putting them in front of customers.

Since all changes in dbt Cloud are represented as code, data engineers who have changes to ship start by first checking them into a Git-based version control system. A team using Git maintains its production code in a single branch of a Git repo. Engineers who want to make changes can fork this repo and work in a separate branch. This keeps changes in progress isolated from tested and approved production code.

Putting all code under source control also reduces defects by encouraging reuse. dbt Cloud supports creating reusable projects that contain common data transformations, eliminating the need for every team to “reinvent the wheel.” Teams can use code that’s already well-tested and proven instead of re-implementing shared logic from scratch.

When an engineer is ready to promote changes to production, they file a pull request (PR) to merge their changes into the main repo. ‌This triggers two additional quality processes:

  • Another engineer reviews and approves the developer's code before allowing it to be merged; and
  • The PR triggers an automated run of any tests associated with new code, running them against pre-production data.

Thanks to dbt’s recent acquisition of SDF, engineers and analysts have another easily accessible tool for catching defects. Using SDF, analytics code developers can emulate today’s most popular data stores on their local development machines, allowing them to run tests before they even check any code into source control. ‌By testing their transformations themselves, engineers can reduce churn time spent fixing errors discovered during the PR and automated testing processes.

Haphazard deployment processes

For years, software engineering has had a rigorous process for deploying changes in the form of DevOps and the Software Development Life Cycle (SDLC). Rather than push large, risky changes live in a huge batch, DevOps encourages using a combination of process improvements and automation to deploy small, well-tested changes through a repeatable, multi-stage pipeline.

By contrast, for years, the approach for data deployment has largely been “someone runs a script in production.” This process has little repeatability and few safeguards in case something goes wrong (and something inevitably goes wrong). It also requires giving engineers overprivileged access to potentially sensitive customer data, which is a data security and compliance nightmare.

DataOps and the Analytics Development Life Cycle (ADLC) solve this by bringing the same rigor found in software engineering deployments to deploying analytics code. With dbt Cloud, you can implement a DataOps approach by creating Continuous Integration/Continuous Deployment (CI/CD) pipelines in dbt Cloud that test your changes in pre-production environments before pushing them live.

CI/CD pipelines reduce the time it takes to deploy changes by automating most of the steps that data engineers in the past would have had to perform manually. That enables more people to deploy analytics code to production more frequently and with higher quality.

Inability to rollback

Even with rigorous testing and reviews, some defects might still slip through the cracks. There's often no way to test code efficiently against every possible permutation it might encounter in production data.

This is where having source code control and an automated CI/CD pipeline comes in handy. Since all code is versioned, it's easy to roll back and re-deploy an older version of a data transformation pipeline. That way, even if an error does slip through, engineers can revert to a previously known good state while they analyze the new code to discover and fix the root cause.

Lack of data discoverability

It's one thing to publish a great new data model. However, that work can go to waste if no one knows how or where to find it.

One of the problems with an ad hoc approach to data pipelines is that, even if the data engineering team produces a clean new data set, stakeholders might never be aware it exists. Even if they can find it, they may not know how to use it or be convinced they can trust it.

Using dbt Cloud as your data control plane means data stakeholders can use dbt Explorer to find models and use them in any data-driven report or application. They can also see all associated documentation and data lineage, meaning they can verify and use the new dataset without help from engineers. This enables a new level of self-service analytics not possible with ad hoc pipelines.

Get started with better data transformation today

You can implement common data transformations using a number of languages. ‌A data control plane powered by dbt brings a new level of consistency, quality, and reuse to your analytics code not possible with an ad hoc approach.

With dbt Cloud as your control plane for data, data teams have a standardized and cost-efficient way to build, test, and deploy analytics code. Meanwhile, data consumers can tap into actionable data with self-serve interfaces that are governed and scalable.

To try it for yourself, create a free account today.

Last modified on: Mar 07, 2025

dbt Developer Day

Join us on March 19th to hear from dbt Labs product leads about exciting new and coming-soon features designed to supercharge data developer workflows.

Set your organization up for success. Read the business case guide to accelerate time to value with dbt Cloud.

Read now

Recent Posts