4 data modeling techniques for modern data warehouses
I remember the first time I showed a business user “under the hood” of the dbt project for our organization:
Them: “Wow! There are so many models and there’s so much work here; I would have never known that because of how intuitive the data is to use in Looker.”
Me: “😅😅…I know”
These data models formed the heart of analytics at our organization—powered every single dashboard in our BI tool— and the majority of business users at that time thought we were conducting magic behind the scenes to get the data in the proper state. The validation of the work and energy that went into those data models from my colleague who depended on that data was gratifying.
This post is about data modeling and the techniques (relational, ER, dimensional, and data vault) that help data teams succeed in modeling their data. It’s also recognition that this is often murky work—more art than science at times—and data practitioners navigate these spaces of ambiguity with as much rigor as their tooling, resources, and data allow.
Despite how it may appear, data modeling is not some fantastical process made up of *waves hands airily* magic 🪄. Formally, it’s the process data practitioners use to make raw data into meaningful business entities in their data warehouse. There’s a whole lot of SQL (and pointedly not magic) involved during the transformation process, and data teams rely on tried and true techniques to create predictable and scalable data models.
Data modeling shapes raw data into the story of a business, as well as establishes a repeatable process that can help create consistency in a data warehouse: how schemas and tables are structured, models are named, and relationships are constructed. At the end of the day, a solid data modeling process will produce a data warehouse that is navigable and intuitive, with data models that represent the needs of the business.
The different data modeling techniques
Data modeling…the perfect balance between art and science, chaos and structure, and pain and joy 😂. At the end of the day, there are many ways to model your data and get it in a place that’s best suited for your business needs. Your downstream use cases, data warehouse, raw source data, and the skills of your data team will all help determine which data modeling approach is most appropriate for your business.
This page is going to cover the four most common types of data modeling techniques we see used by modern analytics teams (relational, dimensional, entity-relationship, and data vault models), what they are at a high level, and how to unpack which one is most appropriate for your organization.
A relational data model is a broad way to represent data and their relationships to other data. In a simple relational model, data is stored in a tabular format (think: rows and columns), and connects to other tables via foreign keys. At the end of the day, almost all of the data modeling techniques we discuss here and see in the real world are a type of relational model—relational models are the building block of most modeling techniques.
Relational models themselves don’t necessarily give data teams the best way to scale their data models because of their focus on relationships and not business logic or use cases; instead, it offers a foundation to build more robust relational modeling processes, such as dimensional modeling, to establish data models that scale in an organized and repeatable manner.
Good fit for:
- Low volume data and a small number of data sources
- Simple use cases for the data
- Low-medium concern around data warehousing costs (because relational models often need joining to be made meaningful)
Dimensional data model
Dimensional data modeling is a type of relational model that puts entities into two buckets of facts and dimensions (aka as the bread and butter of analytics work 😂). Dimensional modeling is one of the most predominant types of data modeling used in modern data stacks, as it offers a unique combination of both flexibility and constraint.
A fact is a collection of information that refers to an action, event, or result of a business process. Facts typically describe the verbs of your business (ex. account creations, payments, and email events). A dimension, on the other hand, describes who or what took that action (think: users, accounts, customers, and invoices—the nouns of your business). Together, fact and dimension models form the tables that describe many modern businesses and offer an intuitive way to design the entities in your data warehouse.
Before the rise and wide adoption of cloud data warehouses—where storage is cheap and compute costs became the larger concern—dimensional models were often designed into star or snowflake schemas to concretely define fact and dimension tables. This clean breakup of models was aimed to reduce the cost of large, wide tables with many columns.
Good fit for:
- Medium-to-large number of data sources
- Centralized data teams
- End-use case for data is primarily around business intelligence and providing insights
- Teams that want to create an easily navigable and predictable data warehouse design
Read more here on how we like to structure our dbt projects and data models!
Entity-relationship data model
Another type of relational model, entity-relationship (ER) data models have entities at the heart of their modeling. ER modeling is a high-level data modeling technique that is based on how entities, their relationships, and attributes connect together:
- Entities: Typically represent core business objects, events, or functions; some examples of an entity could be customers, accounts, and products.
- Attributes: The columns that describe an entity.
- Relationships: The actions or relationships that join entities together. For example,
productsentities can be joined by a
The focus of ER modeling is often around the processes of a business; while relational models can connect any which way, ER models are meant to form an architecture that is relatively concise and based on actual business procedures (events) and entities.
However, ER modeling can still be a bit vague around long-term database structure; it’s more of a way to show how tables can connect together and is useful to understand database architecture at a higher level. To make ER models more meaningful and scalable, data teams will often take the foundational learnings from an ER model to form more complex data model architectures like dimensional modeling.
Good fit for:
- Low complexity data that connects neatly together
- Simple, business-focused downstream use cases for the data
- Central data teams that have deep knowledge of the facets of their data
Many data sources you ingest into your data warehouse via an ETL tool will have ERDs (entity relationship diagrams) that your team can review to better understand how the raw data connects together. Slightly different from an ER model itself, ERDs are often used to represent ER models and their cardinality (ex. one-to-one, one-to-many) in a graphical format. These ERDs will often look a little like the relational model shown earlier, demonstrating how tables connect together. Using these diagrams with a data modeling technique of your choice, such as dimensional modeling, helps data teams efficiently wade through raw data and create business entities of meaning.
Data vault model
Data vault is a type of data model that abstracts entities, their relationship to each other, and their attributes into separate tables called hubs, links, and satellites:
- Hubs: Contain the unique key of the entity they represent, as well as a hashed representation (as a surrogate key). Some example hubs for a typical DTC shop would potentially be
- Links: Contains the foreign keys to other hubs. In our example shop above, a
h_customerhub would likely have a link to a
h_paymenthub by having a link containing the customer id and a payment id.
- Satellites: Contain the attributes, or descriptors, of an entity. For a
h_customerhub, a satellite could contain information like customer shipping address, email, NPS, etc.
Data vault architecture was invented to easily track data changes by having an insert-only mindset; whenever a row changes in a classical data model, a new row is either added or the existing row is modified. In a data vault world, data updates are represented only by new rows.
Like other data models that encourage the use of staging and mart layers, data vault architectures also support the use of raw, business, and information vaults. The consistency in the structure of a data vault encourages data teams to build a layer of organization for their data; this “safety” layer allows some teams to build other types of data models on top of these foundational models.
A considerable amount of data vault modeling can feel very repetitive or prescriptive given the consistent structure of hubs, links, and satellites. With dbt, you can use the dbtvault package to speed up the development time of fundamental data vault models to focus on writing the SQL that really matters to your business.
Good fit for:
- Enterprise teams where the ability to audit data is a primary concern
- Teams that need flexibility and who want to make large structural changes to their data without causing delays in reporting
- More technical data teams that can manage and govern the network-like growth of data vault models
The best data modeling technique is a consistent technique
At the end of the day, there’s no clear “right” or “wrong” data modeling technique for your data and business. However, there are ones that are probably more appropriate and aligned with the skillsets of your team. Before you commit to a data modeling technique, ask yourself (and your team members):
- What are my primary end use cases for the data?
- Who needs to be able to query my final data or navigate through my data warehouse?
- How many data sources am I dealing with? How complex is my data?
- Which data modeling technique is best suited for the raw data formats and data warehouse I’m working with?
- What would governance and standardization look like with this modeling technique?
Like most analytics work, there’s a fine line between rigidity and flexibility with data modeling; you’ll likely never have a full warehouse of perfect dimensional data models that meet Kimball or Inmon’s standards because you adjust these techniques to your data and the needs of the business.
And while data modeling is definitely more art than science at times, there’s still a vital need for governance in this practice. Using a data transformation tool that supports any data modeling technique, peer review, integrated testing and documentation, and version control is the first step to governing and creating a consistent process for your data modeling.
Ready to learn more about how dbt can support your data modeling efforts? Take a look at some of the resources below to see how modern data teams are transforming (😉) the way they tackle data modeling with dbt:
- Data transformation: The foundation of analytics work
- Data modeling techniques for more modularity
- Data Transformation for Snowflake Data Cloud
Last modified on: Nov 29, 2023