Building an AI-ready data platform that supports generative AI

on Sep 24, 2025
Every company wants to get started with implementing generative AI (GenAI) use cases. You probably have multiple ideas and initiatives ready to roll, ranging from customer support to marketing to easier employee onboarding.
The challenge, as always, is the data.
AI use cases require massive amounts of data. That’s true whether you’re building your own Large Language Model (LLM), fine-tuning an existing one, or adding context to calls against a commercial LLM.
Unfortunately, many companies are finding that the legacy Extract, Transform, and Load (ETL) systems they’ve hobbled along on for years can’t scale to the speed and volume demanded by modern AI apps. Meanwhile, companies that have gotten by with homegrown data pipelines are realizing they’re not getting the data quality, consistency, and governance they need to make their AI projects successful.
In this article, we’ll look at the new challenges that AI presents for data. We’ll talk about why existing systems tend to miss the mark, and how to evolve your current data architecture into an AI-ready data platform.
How GenAI use cases differ from analytics use cases
First, some good news. At the most basic level, data for analytics use cases and data for AI use cases share a lot in common. Both require access to high-quality data that’s secure and well-governed.
In other words, preparing data for AI doesn’t mean starting from scratch. Everything you know about data quality applies to AI and analytics data equally.
There’s a key difference, however, in how the data’s used:
- Analytics applications are deterministic. The results are numeric calculations derived using various mathematical formulas directly from the underlying data, and the results are directly connected to the data.
- AI applications, such as LLMs, are probabilistic. They’re based on neural networks that are trained on large amounts of data containing billions or even trillions of parameters. Questions are interpreted and answered using a transformer architecture, and the outputs aren’t directly relatable to the inputs.
If you ask an analytics application a question, you’ll always get the same answer. If you ask an AI application a question, you may get a different answer each time.
This has multiple implications for how we handle data. It means that GenAI apps need:
- Rich context to return accurate, up-to-date results. That requires bringing in high-quality data from multiple sources across the enterprise.
- Access to a wide range of data in various formats. This includes not just the structured data used by analytics but semi-structured and unstructured data (i.e., data that was difficult to mine for value prior to the advent of GenAI).
- Rigorous testing (of both inputs and outputs) prior to production use. While this is true for analytics, the importance of data testing increases with AI, as the connection between inputs and outputs is more opaque.
- Additional security and governance precautions to avoid security issues unique to LLMs.
The challenges involved in enabling AI use cases
The problem is getting the right data at the right level of quality, subject to the right governance. Existing ETL and home-grown solutions tend to underperform here for four key reasons:
- A lot of data remains siloed
- Legacy ETL systems can’t scale to meet demand
- Data quality is inconsistent
- Metrics definitions vary from team to team
A lot of data remains siloed
Many ETL and home-grown data pipelines rely on data centralization to make production datasets discoverable across the company. This almost always requires that a centralized data team take on new data pipeline projects as they have capacity.
This common chokepoint means a lot of key data remains locked away in data silos. Much of this data is exactly the semi-structured and unstructured data from which GenAI applications could most benefit.
Siloed data is nearly impossible to find. It’s also hard to work with, as it’s often of low quality, out of date, or out of compliance with corporate governance standards.
Data silos are a significant drag on businesses. A McKinsey study found that companies collectively are losing around $3.1T every year due to the lost opportunities locked away in siloed data.
Breaking down these data silos isn’t just a technical problem. It’s a business culture problem. Unlocking this data requires that it be discovered, transformed, documented, and made available for use and collaboration.
This is a team effort. It requires a set of data tools that are accessible, not just to technically-savvy data producers, but to data stakeholders with varying levels of technical acumen.
Legacy ETL systems can’t scale to meet demand
Many companies are still dependent on legacy ETL systems. These systems come to us from the pre-cloud world and were built primarily to work in resource-constrained environments.
Many legacy ETL tools are trying to adapt to modern times. For example, a number of tools now support Extract, Load, Transform (ELT) pipelines as well. Whereas ETL pipelines would transform the data before loading it into a data warehouse, ELT pipelines load the raw data and transform it later. This enables better data traceability, prevents data loss, and allows different teams to transform the same base data for different use cases.
However, even when an ETL system has been retrofitted to support ELT workflows, the fact remains that they weren’t originally designed to handle the volume, speed, and variety of data that companies manage today. That leaves them struggling to keep up with the massive demand for data created by today’s AI solutions.
Data quality is inconsistent
Your data doesn’t live in one place. It’s scattered across dozens or hundreds of file systems, object stores, relational databases, data warehouses, and data lakes/lakehouses.
This architectural heterogeneity means there often isn’t “one true way” to create data transformation pipelines in a company:
- Some teams will use legacy ETL systems or dedicated data orchestration solutions like Apache Airflow or Prefect.
- Others will program transformations directly into their Snowflake data warehouses.
- Still others may have an ad hoc collection of Python scripts running as scheduled cron jobs scattered across their cloud infrastructure.
This variability means there isn’t a single place to manage or monitor data quality. As a result, data quality differs wildly from org to org, and even team to team. This makes it hard to determine when data is reliable and “AI-ready.”
Semantic definitions vary from team to team
Another downside of this scattershot approach to data transformation is that teams may have totally different definitions of common business concepts. For example, one team might use a different formula to calculate adjusted revenue compared to another.
This lack of semantic consistency is a roadblock to data sharing and collaboration. The Finance and Sales teams can’t coordinate their work easily if they’re both using different definitions of basic concepts.
AI systems require consistency. Faced with two different definitions of revenue, the system will be at a loss as to which one to use. In the worst case, it may return different answers to different users.
To overcome this issue, AI systems require a semantic layer—a single, centralized framework that defines key metrics and business logic. Without a semantic layer, your AI initiatives are doomed to fail.
dbt as your AI-ready data control plane
The solution to these challenges isn’t to move everything into a centralized data store. That’s often an impossible and ill-advised effort.
A better solution is to have a single data control plane for all your AI data. A data control plane is an architectural layer that sits over all of your end-to-end data activities. It enables integration, access, governance, and protection so you can manage the behavior of people and processes in a distributed and dynamic data environment.
dbt is a data control plane that provides a standardized and cost-efficient way to build, test, deploy, discover, and monitor data for both analytics and AI. It’s flexible, vendor-independent, and collaborative, making it a perfect solution for your GenAI data needs.
Leverage AI to transform data in a uniform manner
Using dbt, you can connect to a wide ecosystem of data warehouses and other tools across your data stack. Teams can use dbt to represent data transformations using dbt data models, a combination of YAML code and SQL or Python. Engineers and analysts alike can leverage dbt Copilot to accelerate analytics and AI workflows with our AI-powered assistant.
Teams can check all their data transformation code into source control. That makes it easy for data producers to track changes, discover existing code, and collaborate on data workflows.
Test and monitor for data quality
Data testing is often one-off and ad hoc. dbt supports creating data tests alongside your data models so that your teams can verify the quality of their data transformation code before deploying it to production. Data health signals enable data teams and governance specialists to track data quality across all teams using dbt.
Document data for users (and AI)
Most data is lightly documented, if it’s documented at all. With dbt, you can create rich documentation for every single data model, so that data consumers know where data comes from and what it means. You can also supply these docs to AI as context—e.g., using the dbt MCP (Model Context Protocol) Server)—to give GenAI invaluable context for understanding your data.
Define global metrics for AI-powered apps
dbt also supports AI data quality through the dbt Semantic Layer, which you can leverage to represent key business metrics in one central location. You can supply these metrics directly to your LLMs and other AI solutions, eliminating confusion between conflicting metrics definitions.
Implement governance to protect sensitive data
Using both dbt models and the dbt Semantic Layer, you can ensure proper governance of all dbt-managed data:
- Govern workflows by standardizing on a single platform and a single source of truth for metrics
- Manage access to data models via role-based access control (RBAC)
- Leverage auto-generate docs, version control, and visual data lineage to reduce errors and make auditing easy
Create your AI-ready data platform today
Building an AI-ready data platform isn’t an overnight task. The good news is that it doesn’t require tearing everything you’ve built down and starting from scratch, either.
dbt connects seamlessly to the existing elements of your data stack. That enables onboarding the critical datasets you need for AI workloads today while leaving your existing data infrastructure in place. Over time, you can move more data pipelines over to leverage the improvements that dbt brings in data quality, scalability, observability, and data discovery.
To learn more about how to leverage dbt as the data control plane for your AI journey, book a demo with a dbt expert today.
Published on: Sep 24, 2025
Rewrite the future of data work, only at Coalesce
Coalesce is where data teams come together. Join us October 13-16, 2025 and be a part of the change in how we do data.
Set your organization up for success. Read the business case guide to accelerate time to value with dbt.
VS Code Extension
The free dbt VS Code extension is the best way to develop locally in dbt.