How can you use AI for data transformation?
Aug 28, 2024
InsightsGenerative AI (GenAI) is saving employees hundreds of hours per month by automating many mundane, and even some complex, tasks. It's poised to do the same for data engineering workflows.
Data transformations are the lifeblood of an organization, transforming raw data into high- quality information that drives business decision-making. Creating high-quality data transformation code, though, requires spending hours not just writing code but also testing, documenting, optimizing, and deploying it.
Used as part of a larger analytics workflow process, GenAI can enable your teams to produce higher-quality data transformations in less time than ever. We’ll look at how AI fits into the larger data analytics lifecycle—and how dbt Copilot enables you to integrate context-aware AI seamlessly into your daily analytics workflows.
What makes data transformation challenging?
Raw data is never in a format good enough to use out of the box. The data business users need is always spread across multiple systems and riddled with issues—malformatted fields, missing data, redundant entries, inconsistencies, etc.
Data transformation extracts data from multiple sources, stores it in a new destination, and turns it into a format suitable for querying by data consumers. This ELT (Extract/Load/Transform) process identifies and corrects issues—unclear column names, data type issues, table relationship issues, overly granular data—in raw data, enabling various teams to transform raw data in multiple ways to support their specific use cases. The full process, the data pipeline, runs on a fixed schedule or on-demand, constantly extracting and transforming new data as it’s available.
Every company that depends on data needs solid, high-quality data transformation processes. However, getting there can be a challenge:
- Data transformations take time for engineers to write and debug
- To ensure they’re accurate, they also require extensive test beds, which also take time to write and debug
- Even when data transformations work from a business standpoint (i.e., they provide correct output), they may still be slow and inefficient, requiring fine-tuning
- Writing great data transformation code requires proficiency in SQL and possibly Python, limiting the number of people who can produce transformation code to the most technically savvy
- Good data transformation code also needs to be documented so that data consumers know what data means, where it comes from, and how it’s calculated. Data sets without this information are apt to go unused as they often fail to win data consumers’ trust
How AI can help—and how to introduce it
The good news is that AI can help here by eliminating a lot of the grunt work involved in engineering a data transformation pipeline. Large Language Models (LLMs), the large deep-learning models like GPT and Claude that are pre-trained on vast amounts of data, have shown themselves adept at generating base code that experienced engineers can refine, test, and ship in less time than it would’ve taken them to write everything from scratch.
LLM-driven AI copilots have already proven successful at boosting developer productivity. When Accenture integrated GitHub Copilot into their daily workflows, for example, it led to a 90% increase in developer satisfaction, with 67% of participants saying they use it five days a week.
A context-aware copilot can bring the same benefit to data engineering. That’s why we’ve developed our own solution, dbt Copilot, that integrates into every step of your analytics workflows.
However, AI copilots aren’t a silver bullet. They work best as part of a mature analytics workflow process. We call this process the Analytics Development Lifecycle (ADLC) - a process that enables producing and collaborating on data at scale, producing new analytics code with both high quality and high velocity.
The ADLC, which is modeled after the Software Development Lifecycle (SDLC), uses several processes and checkpoints to ensure the quality of all data transformation code shipped to production:
- On the business side, it ensures requirements are well-understood and align with business objectives and any relevant KPIs. It also scopes all changes to analytics code changes to the smallest possible unit, with a focus on shipping impactful, well-tested changes with high velocity.
- On the technical side, it ensures all data changes are represented as code, checked into source control and traceable, reviewed before release, thoroughly tested before deployment, and monitored continuously in production.
Introducing AI copilots outside of this overarching process won’t help with the quality of your data transformation code. In fact, it could make things worse. GenAI is far from infallible; any generated code must go through a full review and testing cycle before releasing it to the public.
dbt Copilot is integrated into dbt Cloud, which inherently supports the ADLC with features such as version control, testing, automated data transformation deployment pipelines, automatically generated documentation and data lineage, and data discovery and management. It creates guardrails for AI use, boosting productivity without sacrificing the quality, security, and performance of your analytics code.
This close coupling of Copilot with your code also enables better code generation. Copilot analyzes the other models, relationships, metadata, and lineage in your project to customize its output to your teams. This created a refined and governed dataset for generating high-quality data for analytics and AI systems.
AI’s role in the data transformation process
With these guardrails in place, everyone on your team—data engineers, analytics, and business decision-makers—can generate data transformations and query the resulting data faster using natural language queries. An AI data transformation like dbt Copilot can:
- Generate and optimize data transformation code
- Generate data tests
- Generate documentation
- Generate semantic models for metrics
Let’s look at each area in detail.
Generate and optimize data transformation code
Rather than write SQL code from scratch, data producers can create inline SQL using a natural language description. This can shorten the time required to write code, eliminating common inaccuracies. It also ensures that generated code follows your organization’s naming conventions and best practices.
Users of all experience levels can leverage Copilot to improve their data transformation code. Users who want to contribute to data transformation pipelines but whose SQL skills might be a little rusty can use AI to give them a jump start. Meanwhile, experienced engineers can rely on AI to optimize existing SQL code or create complex transformations quickly. This enables a wider range of users to contribute to data pipelines, which promotes data democratization.
dbt Copilot can generate SQL code directly inside the dbt Cloud visual IDE. You can leverage it to write advanced transformations, apply bulk edits to a project, and produce complex regex patterns.

Additionally, with dbt Copilot, you can enforce a custom style guide to guarantee SQL best practices and code quality. That can reduce code review time and ensure consistency across projects.
Generate data tests
We’ve long believed in the necessity of analytics code testing. That’s why dbt supports writing tests alongside your data transformation code. You can run these tests not only during development but at several points in your analytics workflow—e.g., when checking in a pull request and before shipping changes to production. This ensures that analytics code is safe and accurate before it finds its way to data consumers.
An AI solution like dbt Copilot can automatically generate a testbed for your data transformation code based on your dbt models and schema relationships. It adds these tests directly to your dbt project, ready to run during each build.
Generate documentation
Data transformation code isn't of much use if data consumers don’t know how to use the resulting data. Documentation gives data producers and data consumers shared context for data, enabling stakeholders to find and understand it more easily. It also builds trust in data by showing where data derives from and explaining how critical fields are calculated.
Creating documentation—particularly for legacy projects containing hundreds of large models—requires a concerted effort. An AI solution like dbt Copilot cuts this effort significantly, using SQL logic, past queries, and metadata to generate docs automatically, providing explanations for otherwise cryptic names. Team members can then build on this automatically generated documentation over time as they work with and become more familiar with each model.
Generate semantic models for metrics
While models provide business users with data, there’s still a risk that BI tools users might use the data to derive key metrics differently. For example, two teams might calculate revenue using two markedly different formulas with different assumptions.
A semantic layer represents key metrics centrally using domain-specific language. By moving metrics out of the BI layer and into the data modeling layer, semantic layers eliminate inconsistencies between teams and ensure everyone in the business speaks the same language.
Using the dbt Semantic Layer, part of dbt Cloud, you can model metrics using similar syntax you use to create data transformations. dbt Copilot can not only generate this scaffolding for you—it’ll even recommend key metrics based on your data models.
Accelerate your data transformations with AI today
Used wisely, AI-enhanced data transformation workflows can reduce the time spent writing, documenting, and even reviewing code. That frees up time for your teams to work on even bigger and larger data projects, accelerating the delivery of high-quality data your business needs for both analytics and AI applications.
With dbt Cloud as your data control plane for all of your analytics workflows, your organization can develop, ship, and manage high-quality data workflows faster than ever. Ask us for a demo to see first-hand how dbt Cloud and dbt Copilot work seamlessly together to redefine how you approach data engineering.
Last modified on: Mar 25, 2025
Early Bird pricing is live for Coalesce 2025
Save $1,100 when you register early for the ultimate data event of the year. Coalesce 2025 brings together thousands of data practitioners to connect, learn, and grow—don’t miss your chance to join them.
Set your organization up for success. Read the business case guide to accelerate time to value with dbt Cloud.