dbt
Blog Data product examples: What data products look like in practice

Data product examples: What data products look like in practice

Mar 25, 2024

Learn

Data products are an important method for organizing data development. By shipping new data-driven projects as data products, data teams can make it easier to find, reuse, and version data sets across a company, while also eliminating redundancy and waste.

However, even with a detailed definition of data products, it can be hard to understand what a data product is. In this article, we’ll demystify the subject by walking through some detailed examples of what data products look like and how they work.

Attributes of data products

A data product is a data container or a unit of data within a given business domain that directly solves a customer or business problem.

As outlined by the creator data products, Zhamak Dehghani, a domain data product consists of a number of attributes:

Discoverable. It should be easy to find - e.g., via registration in a data catalog.

Addressable. It has a unique, labeled location from which anyone with the correct authorization can retrieve it.

Trustworthy & truthful. It implements metadata, data lineage, and other mechanisms to show data consumers who owns the data, where it came from, and when it was last updated

Self-describing. It uses code, metadata, and documentation to describe what the data product is, how it’s structured, its business use, and what each field means.

Interoperable. It can work via some mechanism (SQL queries, API calls, etc.) with other data products.

Secure & governed. It relies on federated computational governance to enable teams to manage their own data while also ensuring security, compliance, and high data quality.

As a deliverable, a data product can be as simple as a spreadsheet or as complex as a machine learning model. However, it’s also more complex than the deliverables themselves. Key components of a data product include:

  • A contract that specifies what fields the data product contains and any associated constraints
  • Versioning applied to the contract, so that new releases or fixes don’t break existing data consumers
  • Any other assets needed to create and maintain the data product, such as APIs, storage, tests, and an orchestration pipeline

Examples of data products

That leaves the question of what a data product actually looks like in practice, how data producers package and deploy it, and how data consumers access it. To make these concepts more concrete, let’s walk through three examples of a data product:

  • Cleaned user information table
  • Lookup view
  • Joined supertable

Clean user information table

Explanation

A user information table is a core component that multiple applications need to leverage. Without a single, centralized source for this information, every application or report will end up scraping data from the underlying raw tables in different ways.

A cleaned user information table consolidates and standardizes user data from multiple sources, providing a single, reliable source of user information. This data product addresses common issues such as data inconsistency, duplication, and incomplete records by implementing robust data cleaning and transformation processes.

To accomplish this, a single team - the one that owns the underlying domain data - takes responsibility for creating a data pipeline that automatically merges and cleans the data. It can do this, for example, by defining a dbt model that defines the structure of the destination tables. It can then run the model continuously as a Continuous Integration (CI) job, pushing changes to the underlying data to the clean model.

The team can also define a contract for the model to enforce data constraints. Once run, it publishes the contract and model to a data catalog, such as dbt Explorer. Here, other teams can find the data and incorporate it into their own apps and reports.

If the team needs to release breaking changes - e.g., renaming or removing a field - it can do so by versioning the model. This will enable data consumers of the current version to continue using it for a period of time until they can upgrade their code to handle the new data format.

Business value

  • Provides a single source of truth for critical business data that’s discoverable in a centralized catalog
  • Avoids wasting computing resources and data engineering time by provide a single, accurate definition of the data set across teams
  • Prevents sudden downstream work stoppages when releasing a breaking change

A time-filtered table lookup view

Explanation

Some data sets can grow incredibly large. This makes it harder to query them efficiently for data. If a cluster of use cases only needs data for a limited time window (e.g., the last three months), it makes sense to provide performant access to this critical subset.

A time-filtered table lookup view provides efficient access to a subset of large datasets based on time-based criteria. This data product is designed to optimize performance for applications that typically need only the most recent data, reducing computational overhead associated with queries against large tables.

A time-filtered table lookup may be as simple as creating a materialized view that queries data for the last three months. Most relational databases (like PostgreSQL) support the notion of a materialized view, which pre-calculates the result of a complex query and stores it for faster access.

Like all models, materialized views in dbt can have associated contracts and versions, providing protection for consumers against breaking changes.

Business value

  • Improves query performance by focusing on most relevant data and reduced need for full table query
  • Reduces compute resources and costs associated with querying large datasets
  • Can be reused and integrated into multiple applications

A joined supertable

Explanation

Multiple teams may be using queries that use a large number of JOIN statements to combine data from multiple different tables. For example, a report on customers who have subscribed to insurance plans may pull data from Customer, CustomerDetails, Plan, PlanOption, and PlanOrder tables. Running this query once can be slow and costly - and running it repeatedly, for multiple teams, can make it even more expensive.

A joined supertable runs this operation once for multiple teams and makes the results available in a new table containing the most relevant data from each joined table. This reduces resource consumption by providing a single, precomputed location for this data. This data can then be accessed via a simple (and cheap) SQL SELECT query.

Business value

  • Eliminate redundant join operations across multiple queries, optimizing resource usage
  • Simplify data access by providing a consolidated view of related data, making it more accessible to team and team members who aren’t SQL experts
  • Open the door for exploratory analysis, encouraging users to find relationships in combined data they might not have seen previously

Planning the design of a data product

Successfully designing a data product like one of the above examples isn’t just a matter of writing a new data transformation. A data product is a reusable data deliverable that covers a number of use cases and takes into account the needs of multiple users. As such, it requires some forethought and planning to be successful.

Perform data discovery and define use cases

The first step is identifying which data you’ll need and which use cases it fits. Your data products should generally contain all the data required within a data domain. Furthermore, that collection of data should be unique across all data products.

To identify these data products, look for data that’s consumed by a large number of teams across a diverse set of use cases. For example, in the user table example, teams as diverse as Product Development, Customer Support, Finance, Compliance, Sales, and Marketing will all use this core data in some form or another.

Look in particular for data that currently has a lot of duplication and overlap. This is a key signal that the data in question could benefit from rationalization into a data product.

Set goals and gather requirements

Define your goals and any KPIs for your data product. These may include onboarding x teams to the data product within y months, eliminating a set amount of redundant computing spend, improving data quality metrics for the underlying data, maintaining a specified average query performance, or similar benchmarks.

This is also the point to reach out across impacted teams and gather their requirements. What data do they need to see in the data product for them to use it? What would they like to see in the near future? Failing to gather requirements beforehand could limit the data product’s utility and prevent stakeholders from onboarding.

Design the data product

With the groundwork laid, it’s time to build the architecture of the underlying data product. This includes addressing each of the attributes listed earlier - discoverability, addressability, etc.

Other key elements of the data product’s design should include:

  • A focus on ensuring data quality. This can be accomplished through a combination of data cleaning, validation, and testing.
  • Metadata and documentation. Exposing key metadata will make it easier for data consumers to discover, increasing its reach and utility. Solid documentation - such as that enabled by dbt models - makes your data products easier to understand and use.
  • Performance. Using optimized queries and making efficient use of resources will help ensure data consumers continue to leverage your data product well into the future.

Conclusion

A data product can deliver high quality data that reduces data redundancy, simplifies access to critical data sets, and prevents breaking changes from interrupting the flow of business. As demonstrated above, you can package a variety of deliverables as data products, making it easier for other teams to discover and use core data sets without reinventing the wheel.

Last modified on: Oct 15, 2024

Build trust in data
Deliver data faster
Optimize platform costs

Set your organization up for success. Read the business case guide to accelerate time to value with dbt Cloud.

Read now ›

Recent Posts