dbt
Blog dbt Cloud on Microsoft Fabric: The recipe for high-quality AI data

dbt Cloud on Microsoft Fabric: The recipe for high-quality AI data

Sep 20, 2024

Product

Earlier this year, our partners at Tableau and Salesforce found that 86% of data leaders in a survey agreed that AI outputs are only as good as their inputs. Unfortunately, creating a petabyte data architecture at scale with high-quality data is a challenge that many companies are still struggling to crack.

One way to overcome this challenge is through the one-two punch of Microsoft Fabric and dbt Cloud. In this article, I’ll show how you can use both systems together to deliver consistently high-quality data with minimal architectural overhead.

Microsoft Fabric: Data architecture as a service

First, let’s look at the current landscape for ML, AI, and data. This graphic from Firstmark Capital shows, at a glance, the challenges that data engineers are up against.

MAD Landscape

Source: The 2024 MAD Landscape

It’s super complex. It’s overwhelming. This array of different technologies leaves engineers and other data users at a loss. Which of these hundreds of technologies should I use? Which ones integrate well with one another?

Microsoft Azure faces some of the same challenges. It offers a family of individual Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and SaaS features to store, secure, and analyze data at a petabyte scale.

Many of these are great services. However, they’re piecemeal. You have to stitch them together to create an enterprise-grade solution.

Enter Microsoft Fabric. Fabric combines new and existing capabilities into a single, unified SaaS platform that serves as a one-stop shop for data analytics.

Fabric’s components all work in concert to enable full, end-to-end data management:

  • Data Factory provides the capability to move and transform data, while Data Science enables your ML flows and modeling.
  • Real Time Intelligence provides access to low-latency data for IoT and real-time visualization use cases.
  • Power BI enables visualizing and reporting on data, no matter where it’s stored.
  • Data Activator adds observability to your data stack. You can configure real-time alerts—emails, Teams messages, etc. - on actions such as when a lead or a buyer requires follow-up.

Across all of these features, OneLake provides data storage for any data object type. Meanwhile, Microsoft Purview for data cataloging solves longstanding challenges that data engineers and analytics engineers face in finding and using data.

Additionally, Artificial Intelligence (AI), made available through Microsoft Copilot, cuts across all of these services, differentiating Fabric from competitors. For example, data engineers can leverage Copilot for everything from creating complex SQL joins to choosing data model types or even writing Spark code. Meanwhile, business analytics or Power BI developers can leverage AI to develop new insights or visualizations on data.

dbt Cloud: A data control plane for AI

Microsoft Fabric is a great architecture for delivering the next generation of AI-powered solutions. However, when talking with dbt customers, it becomes clear a lot of companies aren’t enabled for AI at scale.

The problem isn’t the architecture but the data. Most people’s greatest challenge is determining how much they can trust their data. When they look at a dashboard, their first thoughts are: what am I looking at? And where did it come from?

Unfortunately, with the advent of AI, many companies have just implemented a Large Language Model (LLM) or other AI solution on top of that data. Without verifying the veracity of their underlying data, the answers they get from the resulting LLM may be inaccurate. They may be hallucinogenic. This leap—from faulty data straight to AI solutions—undermines trust in AI projects, just as delivering a faulty analytics dashboard would.

High-quality data requires documenting what your data does and where it comes from. There are only two places where this documentation happens. One is in a data catalog or other centralized data repository. The other is during the transformation phase—which is where dbt lives.

dbt Cloud helps companies transform their data via declarative statements about how that data ought to be organized. It provides a data control plane that not only documents these transformations but makes them available through orchestration, observability, a data catalog, and a semantic layer.

With a data control plane, you do more than point AI at your data. You point it at the transformation logic so that it now knows how the data ought to be structured and how you think about that data.

These layers are all powered by the work that data teams perform in transforming and shaping the data. The transformations, in essence, represent metadata. That metadata powers both the analytics and the AI apps that organizations aim to build.

A cheesy example

Using dbt Cloud, you can bring the industry standard for data transformations to Microsoft Fabric. dbt Cloud makes it easier for people who work with data to build data transformations. ‌It brings more people across the company into the data fold by lowering the technical bar.

Additionally, like Fabric, dbt Cloud works with a variety of data sources and destinations across the data industry. Customers use dbt Cloud to help them connect the dots between Fabric and the other touchpoints of their data ecosystem. That breaks down siloes, making previously isolated data part of a single, cohesive data estate.

Let’s see how this works in practice with a cheesy (literally) example. Boss Big Slice, the boss of pizza chain Big Pizza, wants to know which province in Canada consumes the most pizza. So they call in their data experts, Mr. Ketchup and Miss Pepper.

Mr. Ketchup is the company’s seasoned data analyst who powers all of the company’s Power BI dashboards. The problem is, he’s overworked. He’s the company’s one-stop shop for analytics and he’s just drowning in CSV files. He’s burnt on the bottom and ready to deliver himself to a vacation.

In comes Miss Pepper, the organization’s new data engineer. She was hired to help the company build an enterprise-grade data infrastructure. Together with Mr. Ketchup, they’re ready to help take Big Pizza’s data operations to the next level.

However, before they can get started, Boss Big Slice discovers during a meeting that their dashboards contain data from restaurants that are closed. In other words, their data quality is in shambles. They need to fix it—and fast.

Miss Pepper steps in. She verifies the problem—but, since there’s no documentation for this data outside of the column names, it takes her a while to figure out a solution (filtering by companies she knows to be closed). She knows there’s gotta be a better way.

How dbt Cloud works with Microsoft Fabric

Miss Pepper’s grand plan is a new data architecture powered by Microsoft Fabric and dbt Cloud.

In this new architecture, the team moves data from sources such as AWS, Snowflake, and Azure SQL via an ingestion process. The transformed data is stored centrally in OneLake.

At the heart of this architecture within Microsoft Fabric is dbt Cloud. dbt Cloud supplies modeling and data transformation, testing, version control, and a CI/CD release management system for data code changes. This ensures that data is always high-quality.

Finally, data access is provided through tools such as Power BI, which analysts can use to power reports, and Data Activator, which provides data analytics and data event notifications via real-time analytics. This guarantees that every slice of data is fresh and actionable.

Miss Pepper creates the data pipeline in dbt Cloud that powers this transformation. This pipeline orchestrates the movement and transformation of data from its source to the reports. She finds the data she needs from multiple sources to capture information on the restaurant’s current operational status.

She sets this transformation to run on a schedule so the data updates periodically. After the transformation is completed, she runs a Fabric notebook that provides additional insights by performing sentiment analysis on the reviews.

Because she’s using dbt Cloud, Miss Pepper can easily create a test to verify that she’s only ingesting data from currently open restaurants. She tests this by deleting the restaurant status filter from her code and running the pipeline in a staging environment. Sure enough, the data pipeline test fails and alerts her to the issue ASAP.

She can then look at the failure in the dbt Cloud dashboard and perform root cause analysis to find where the failure occurred. ‌Using dbt Cloud, she can see a full data lineage map. This shows her data, its sources, and all of the tables it impacts downstream.

Clicking through this table will give her a description of the table, the tests that failed, and the table’s relationships. Using this, Miss Pepper can find the specific column that caused the failure.

Miss Pepper fixes the code, which triggers dbt Cloud to perform an automatic rebuild and run all of her tests. Since this code change caused the test to pass, she can now push this code change live and run the production Fabric pipeline.

After the change, the pipeline runs successfully, creating all the required tables in Fabric Synapse Data Warehouse. Miss Pepper was even able to do some advanced sentiment analysis on reviews, classifying them as positive, mixed, neutral, and negative—all thanks to the built-in AI services offered by Fabric.

Since Fabric is SaaS, Miss Pepper didn’t have to provision any infrastructure to build this new architecture. Creating a data pipeline in Fabric also gives her access to all of its data connectors. That means she could improve the report over time by including additional data sources.

Boss Big Slice sees the final report and is very pleased. They can finally answer the question: which province in Canada has the best pizza? According to this 2018 pizza data set, the answer is Quebec, which had an impressive 81% average pizza rating score in 2018.

But that's not all. This map was updated to show only the pizza restaurants with a perfect rating score. Miss Pepper was also able to use sentiment analysis to score the reviews of the restaurants. Out of the pizza restaurants with perfect ratings, Stripes Pizza and Kitchener had the highest number of positive reviews.

Conclusion

dbt Cloud brings an enhanced developer and user experience, powering a DataOps approach to data management. Its SQL-based framework makes it accessible to a wide range of users.

dbt Cloud’s low-code approach to testing and documentation streamlines the development process. Additionally, dbt automatically calculates dependencies and lineage, reducing developer workload and accelerating development. It also allows business users to access ‌documentation, making data more transparent and accessible.

Microsoft Fabric’s SaaS platform means that companies can manage data without managing infrastructure. Its large number of data connectors and options like database mirroring and shortcuts help manage and maintain data gravity. Fabric supports AI and analytics across the platform, providing Copilot tools to speed up daily data operations, along with options for advanced machine learning.

Used together, dbt Cloud and Microsoft Fabric provide a robust, end-to-end strategy for delivering high-quality data to power the next generation of AI applications.

Last modified on: Oct 16, 2024

Build trust in data
Deliver data faster
Optimize platform costs

Set your organization up for success. Read the business case guide to accelerate time to value with dbt Cloud.

Read now ›

Recent Posts