Understanding data modeling

Joey Gault

on Dec 18, 2025

Data modeling is the process data practitioners use to transform raw data into meaningful business entities within a data warehouse. While it may seem abstract, data modeling forms the foundation of reliable analytics work. Through structured SQL transformations, data teams convert messy source data into organized, trustworthy datasets that power business intelligence, operational workflows, and strategic decision-making.

What data modeling is

At its core, data modeling involves designing how data is organized, stored, and connected across an entire system. The process shapes raw data into the story of a business, establishing repeatable patterns for how schemas and tables are structured, models are named, and relationships are constructed.

Raw data rarely arrives in a format suitable for analysis. Source systems optimize for operational efficiency, storing information in ways that serve transactional needs rather than analytical ones. Data modeling bridges this gap, preparing data for consumption by business users who need to derive insights without wrestling with technical complexity.

The distinction between a data model and individual transformation files matters. A data model represents the complete blueprint: the architecture defining how all pieces fit together. Individual transformation files, such as those created in dbt, are the building blocks that implement this broader design. Think of the data model as a recipe, with transformation files as the ingredients that combine to create the final product.

Why data modeling matters

Well-designed data models determine whether business users trust and adopt data products. When end users lack confidence in data quality or find datasets difficult to navigate, they retreat to familiar tools like spreadsheets, burying business logic in one-off analyses that create silos and inconsistency.

Data modeling addresses this challenge by creating intuitive, navigable structures. A solid modeling approach produces data warehouses where relationships are clear, naming conventions are consistent, and business logic lives in centralized, version-controlled locations rather than scattered across individual reports.

The impact extends beyond user experience. Proper data modeling enables self-service analytics, allowing business users to answer their own questions and explore new ones without constant data team intervention. This scalability frees data practitioners to focus on higher-value work rather than repeatedly answering similar ad-hoc requests.

Performance considerations also drive modeling decisions. Analytics workloads consume large datasets, making query speed essential. Data models optimized for read performance (through appropriate denormalization, pre-aggregation, or materialization strategies) deliver faster results and lower compute costs.

Key components of data modeling

Modern data modeling typically organizes work into distinct layers, each serving a specific purpose in the transformation pipeline.

Staging models form the foundation, cleaning and standardizing raw source data. These models perform light transformations: casting field types, renaming columns for readability, and filtering out deleted records. Staging creates consistency for downstream work, providing a defensive layer that absorbs changes in source systems. When source data structures shift, fixing the staging layer propagates corrections throughout the pipeline without manual intervention.

Intermediate models handle complex transformations that don't fit neatly into staging or final output layers. When business logic becomes too nested or complex to maintain readability, intermediate models break transformations into manageable pieces. These optional layers improve code comprehension and debugging speed.

Mart models apply business logic to create core data assets for analysis. This layer typically produces fact and dimension tables: the building blocks of dimensional modeling. Facts represent measurable events and actions (sales transactions, website visits, payments), while dimensions provide descriptive context (customers, products, time periods). Together, these models form the business entities that stakeholders interact with in BI tools.

The separation of concerns across layers creates modularity. Rather than building monolithic transformations from raw data each time, practitioners reference foundational work completed by others. This approach reduces duplication, improves maintainability, and makes dependencies explicit through clear lineage.

Common data modeling approaches

Several methodologies guide how data teams structure their models, each offering different tradeoffs.

Relational models store data in tabular formats connected via foreign keys. While foundational to most approaches, pure relational models lack the structure needed to scale effectively across large organizations. They serve as building blocks for more sophisticated techniques.

Dimensional modeling categorizes entities into facts and dimensions, optimizing for analytical workloads. Originally developed to address storage costs and query performance in pre-cloud environments, dimensional modeling remains relevant because it aligns data structures with how businesses think and operate. Business users can understand dimensional models without deep technical knowledge, enabling self-service analytics. Star schemas (where fact tables connect directly to dimension tables) provide the simplest implementation, while snowflake schemas normalize dimensions further at the cost of added complexity.

Data vault modeling abstracts entities into hubs (unique business keys), links (relationships between entities), and satellites (descriptive attributes). This approach excels at tracking data changes through an insert-only pattern, making it valuable for high-governance environments requiring comprehensive audit trails. However, the prescriptive structure can feel repetitive and requires technical sophistication to manage effectively.

Entity-relationship modeling focuses on business processes, defining how entities, their attributes, and their relationships connect. While useful for understanding database architecture at a high level, ER modeling often serves as a stepping stone toward more complete approaches like dimensional modeling.

The choice between normalized and denormalized structures depends on use case. Normalized models optimize for write operations, making them suitable for transactional systems. Denormalized models optimize for read performance, which analytics workloads require. Modern cloud data warehouses (where storage is cheap but compute drives costs) have shifted the calculus. Keeping tables highly normalized can become expensive when every analysis requires multiple joins that consume compute resources.

Practical challenges

Data modeling involves navigating ambiguity. Determining whether an entity should be treated as a fact or dimension depends on analytical needs rather than rigid rules. An appointments table at a medical clinic could reasonably be modeled either way depending on whether the focus is patient-level analysis or appointment-level detail.

The decision to create wide, denormalized tables versus maintaining separate fact and dimension tables similarly depends on context. When end users write SQL comfortably or use BI tools that handle joins well, separate tables provide flexibility. When users lack SQL skills or tools struggle with joins, pre-joined wide tables may be necessary despite increased redundancy.

These decisions require understanding both the data and stakeholder needs: two of the most difficult aspects of data work. Collaboration within data teams helps navigate these choices, as does establishing clear conventions that guide consistent decision-making.

Maintaining readability as models grow presents another challenge. Individual transformation files should remain concise enough for team members to quickly understand their purpose and logic. Modular SQL blocks (implemented through macros in tools like dbt) help keep files readable by abstracting repetitive patterns into reusable components.

Best practices

Successful data modeling starts with business needs rather than technical preferences. Every model should solve for a specific business process, enabling users to answer questions and explore data independently. This orientation ensures models deliver real value rather than technical elegance disconnected from actual use.

Defining grain (the level of detail captured in fact tables) requires careful consideration. Building at the wrong grain creates technical debt that's expensive to fix. Starting at the lowest level of granularity that might be needed makes sense, since aggregating up is easier than disaggregating.

Conformed dimensions (master data elements like customer and product dimensions) should be built once and shared across all subject areas. This ensures consistency and makes metrics tie out across different business processes. Rebuilding the same dimension multiple times creates maintenance burden and risks introducing discrepancies.

Naming conventions bring order to data warehouses. Without consistent patterns, teams may rebuild models that already exist or join data in duplicative, low-performance ways. Clear prefixes (stg_ for staging, fct_ for facts, dim_ for dimensions) make project structure immediately comprehensible to anyone navigating the codebase.

Documentation and testing enforce quality standards. Models should include descriptions of their purpose, grain, and key relationships. Tests validate uniqueness, referential integrity, and business rules, preventing model drift and catching issues before they reach production.

Governance processes manage evolution. As business needs change, models must adapt without breaking downstream dependencies. Establishing clear processes for handling changes (including how breaking changes are communicated and implemented) prevents outages and maintains trust.

Building for the long term

The accessibility of modern transformation tools can create a false sense of simplicity. Teams often start by delivering quick wins through one-off models, only to discover their approach doesn't scale as complexity grows. The cost of rebuilding a fundamentally flawed architecture is substantial.

Investing in thoughtful data modeling upfront avoids this pain. While it requires more initial effort than building ad-hoc solutions, proper architecture pays dividends as the data warehouse grows. Models become easier to maintain, debug, and extend. New team members can contribute effectively because clear structure provides guardrails.

The goal is creating a data warehouse that serves as a reliable foundation for analytics: one where business users find data intuitive to work with, where data teams can efficiently build and maintain transformations, and where the architecture can evolve alongside changing business needs. Achieving this requires treating data modeling as a first-class concern rather than an afterthought, establishing conventions early, and maintaining discipline as the system grows.

Frequently asked questions

What is data modeling?

Data modeling is the process data practitioners use to transform raw data into meaningful business entities within a data warehouse. It involves designing how data is organized, stored, and connected across an entire system through structured SQL transformations. The process shapes raw data into the story of a business, establishing repeatable patterns for how schemas and tables are structured, models are named, and relationships are constructed.

What are the types of data models?

There are several common data modeling approaches, each with different tradeoffs. Relational models store data in tabular formats connected via foreign keys and serve as building blocks for more sophisticated techniques. Dimensional modeling categorizes entities into facts and dimensions, optimizing for analytical workloads and business user understanding. Data vault modeling abstracts entities into hubs, links, and satellites, excelling at tracking data changes in high-governance environments. Entity-relationship modeling focuses on business processes and how entities, attributes, and relationships connect.

What is the data modeling process?

The data modeling process typically organizes work into distinct layers. Staging models form the foundation by cleaning and standardizing raw source data through light transformations like casting field types and renaming columns. Intermediate models handle complex transformations that don't fit into other layers, breaking complicated logic into manageable pieces. Mart models apply business logic to create core data assets for analysis, typically producing fact and dimension tables that represent measurable events and descriptive context respectively.

VS Code Extension

The free dbt VS Code extension is the best way to develop locally in dbt.

Share this article