Understanding data grain in dbt

Guide to data grain

on Jun 03, 2025

Data grain is the combination of columns at which records in a table are unique. Ideally, this is captured in a single column, a unique primary key, but even then, there is descriptive grain behind that unique id. Let’s look at some examples to better understand this concept.

user_id	address
1	123 Jaffle Ln
2	456 Waffle St
3	789 Raffle Rd

In the above table, each user_id is unique. This table is at the user grain.

user_id	address
1	123 Jaffle Ln
1	420 Jaffle Ln
2	456 Waffle St
3	789 Raffle Rd

In the above table, user_id is no longer unique. The combination of user_id and address creates a unique combination, thus this table is at the user address grain. We generally describe the grain conceptually based on the names of the columns that make it unique. A more realistic combination you might see in the wild would be a table that capture the state of all users at midnight every day. The combination of the captured updated_date and user_id would be unique, meaning our table is at user per day grain.

Creating surrogate keys

In both examples listed in the previous paragraph, we’d want to create a surrogate key of some sort from the combination of columns that comprise the grain. This gives our table a primary key, which is crucial for testing and optimization, and always a best practice. Typically, we’ll name this primary key based on the verbal description of the grain. For the latter example, we’d have user_per_day_id. This will be more solid and efficient than testing than repeatedly relying on the combination of those two columns.

The importance of data grain in data modeling

Thinking deeply about grain is a crucial part of data modeling. As we design models, we need to consider the entities we’re describing, and what dimensions (time, attributes, etc.) might fan those entities out so they’re no longer unique, as well as how we want to deal with those. Do we need to apply transformations to deduplicate and collapse the grain? Or do we intentionally want to expand the grain out, like in our user per day example?

There’s no right answer here, we have the power to do either as it meets our needs. The key is just to make sure we have a clear sense of our grain for every model we create, that we’ve captured it in a primary key, and that we’re applying tests to ensure that our primary key column is unique and not null.

Choosing the right type of data grain for your use case

Selecting appropriate grain is critical for effective data modeling success. Here are some considerations to keep in mind:

Analysis requirements: The questions you need to answer determine your grain selection. Deeper questions often require finer grain structure.
Data volume: Larger data volumes challenge storage and processing capabilities. Higher grain levels reduce storage needs but sacrifice analytical detail.
Query performance: Finer grain means more records to process during queries. Performance tuning might require pre-aggregation at higher grain levels.
Update frequency: More frequent data updates favor transactional grain. Less frequent updates work well with snapshot approaches.
Process visibility: Tracking multi-stage processes requires accumulating snapshots. through workflows.
Reporting cadence: Regular reporting schedules influence grain selection. Daily reporting needs different grain than quarterly analysis.
Historical analysis: Long-term trend analysis has unique grain requirements. Consider how historical comparisons will work with your chosen grain.

Ready to bring clarity and consistency to your data models?

Start building trusted, well-documented data pipelines with dbt. Try dbt for free and join the data teams building with confidence at every grain.

Related reading:

FAQs for data grain

Data grain is the combination of columns at which records in a table are unique. Ideally this uniqueness is captured in a single column called a primary key, but even then, there is descriptive grain behind that unique ID.

For example, in a table where each user_id is unique, the table is at the "use grain"; but if user_id appears multiple times with different addresses, the table is at the "user address grain".

Grain is critically important because it:

Determines what questions can be answered with your data
Impacts storage requirements and query performance
Affects the complexity of your data model
Influences data integration capabilities.

Here's a really basic example: a table with unique user_ids (user grain) compared to a table where users can have multiple addresses (use address grain).

A more realistic example would be a table that captures the state of all users at midnight every day, where the combination of updated_date and user_id would be unique (user per day grain). In both cases, creating a surrogate key that combines these fields establishes a proper primary key.

Make sure you have a clear sense of the grain for every model you create and capture it in a primary key column. Apply tests to ensure that your primary key column is unique and not null.

When dealing with multiple columns that define uniqueness, create a surrogate key named based on the verbal description of the grain (like user_per_day_id). There's no single right approach — you can choose to collapse or expand grain as needed — but clarity and consistency are essential.

Creating a surrogate key from the combination of columns that comprise the grain gives your table a primary key, which is crucial for testing and optimization. This practice is more solid and efficient than repeatedly relying on testing combinations of columns. For example, in a table at the "user per day" grain, you might create a surrogate key called "user_per_day_id" that combines user_id and updated_date to ensure uniqueness.

Published on: Jul 01, 2024

Don’t just read about data — watch it live at Coalesce Online

Register for FREE online access to keynotes and curated sessions from the premier event for data teams rewriting the future of data.

Secure your spot now

Set your organization up for success. Read the business case guide to accelerate time to value with dbt.

Read now

Latest posts

Community6 min

Here’s why you (and your team) should attend Coalesce 2025

Daniel Poppy

on Aug 05, 2025

Learn16 min

From stored procedures to dbt: A modern migration playbook

Kathryn Chubb

on Aug 04, 2025

Product4 min

The dbt Fusion engine public beta is now available on Redshift

Azzam Aijazi

on Aug 04, 2025

The dbt Community

Join the largest community shaping data

The dbt Community is your gateway to best practices, innovation, and direct collaboration with thousands of data leaders and AI practitioners worldwide. Ask questions, share insights, and build better with the experts.

Join the Community Explore the community

100,000+active members

50k+teams using dbt weekly

50+Community meetups

Guide to data grain

Creating surrogate keys

The importance of data grain in data modeling

Choosing the right type of data grain for your use case

Ready to bring clarity and consistency to your data models?

FAQs for data grain

What is data grain?

How does grain influence data modeling?

What are examples of different grains?

What best practices should I follow with data grain?

Why is creating a surrogate key important?

Don’t just read about data — watch it live at Coalesce Online

Latest posts

Here’s why you (and your team) should attend Coalesce 2025

From stored procedures to dbt: A modern migration playbook

The dbt Fusion engine public beta is now available on Redshift

Join the largest community shaping data

Guide to data grain

Creating surrogate keys

The importance of data grain in data modeling

Choosing the right type of data grain for your use case

Ready to bring clarity and consistency to your data models?

FAQs for data grain

What is data grain?

How does grain influence data modeling?

What are examples of different grains?

What best practices should I follow with data grain?

Why is creating a surrogate key important?

Don’t just read about data — watch it live at Coalesce Online

Share this article

Latest posts

Here’s why you (and your team) should attend Coalesce 2025

From stored procedures to dbt: A modern migration playbook

The dbt Fusion engine public beta is now available on Redshift

Join the largest community shaping data