dbt Labs

Analytics Playbook: Create a log of historical data for richer reporting

on Jun 03, 2024

At early stage tech companies, the application database is often the richest source of data available for analysis. Getting this data into a data warehouse like Redshift enables your analysts to examine nearly every part of the business. Unfortunately, while this data is rich in meaning, it is also usually chaotic. Records are regularly deleted and updated as dealing with business edge cases takes priority over having the cleanest possible database.

A common example of this is order status. When an item is shipped out of your warehouse, or a customer service representative cancels an order, or a developer runs a mass update to reconcile a bug, records get updated in-place. While this makes it easy to update orders in day-to-day operations, it also robs you of the crucial ability to look at historical data.

This is a really big problem! Without this data, there are entire areas of your business that you simply cannot explore. Luckily, there are some ways to deal with it.

First, you can ask your developer team to create historical tables, restrict access, refine practices, and so on. This ends up being very difficult to do effectively. There are always more edge cases that require editing the database. You lose the benefit of the "just update the database" quick fix. And, most significantly, it requires building a massive table in the production database for analytics purposes only.

Fortunately for us, there is an alternative: we can create a historical record right in the warehouse. In the rest of this post, we'll explore a simple example and look at how to recreate this historical data.

The rest of this post uses Postgres/Redshift-compatible SQL.

The Data

Let's look at some data to clarify the problem. We have an orders table, which contains information about orders that have been placed, and what stage of processing they are in:

If we look at the first order (id 1), we can see that it was placed on July 8 and completed on July 17 (when the order was last updated). We can infer that this order was shipped at some point between July 8 and July 17, but we don't know exactly when!

An idea

What we need is a table that looks like this:

Here we can see the historical statuses, and the associated dates. Order 1 was placed on July 8, shipped on July 10, and marked completed on July 17. valid_to is null for that last record, meaning that this is the current version of this record in the production data. This gives us a complete view of how these orders have progressed. So, how do we get this in the warehouse?

A Solution

It's totally possible to create this intermediate table. We need a query that identifies changed records. Then, for each record that has changed, we need to insert a new row into the log table, and update the previous "current" row to make it not the current row any longer.

I've come up with a monster query to generate & update the desired table. Brace yourself, the SQL to accomplish this is somewhat nasty. (Don't fret too much over how this SQL works, there's a better way to be found below.)

If you run this whole thing periodically, it will keep updating the orders_log table over time with more and more historical data. But, it's painful to run and manage this query. Each time the schema changes in the source data, we'll have to update it. Each additional log table is equally burdensome to set up and maintain. Fortunately, there's a simpler way...

Enter dbt archive

I'm one of the developers of an open source data modeling and warehouse management tool called dbt that does all this out of the box.

All we do is create a dbt project, and specify some YAML configuration:

Now, you can simply run:$ dbt archive

to generate the log table. You can also see it in action below.

Not only does this accomplish everything we laid out here, it also handles schema changes, and can support additional tables easily.

You can find the example from this post (complete with dbt project) on Github.

Interested in dbt? Join us on Slack, or check out the documentation.

BUT!

Of course, historical data will only be available from the date you begin running this process. So, if you have frequently changing tables in your warehouse, you should set up this process sooner rather than later!

Published on: Jul 18, 2017

2025 dbt Launch Showcase

Catch our Showcase launch replay to hear from our executives and product leaders about the latest features landing in dbt.

Watch the launch event replay

Set your organization up for success. Read the business case guide to accelerate time to value with dbt.

Read now

Latest posts

Learn13 min

How dbt and Tableau bring better governance to analytics

Kathryn Chubb

on Jul 11, 2025

Product7 min

How Roche unified global data, enabled AI at scale, and improved operational efficiency

Hrishi Kulkarni,Ernesto Ongaro

on Jul 10, 2025

Product6 min

Virgin Media O2 rebuilds its data stack and speeds up the customer experience

Hrishi Kulkarni,Ernesto Ongaro

on Jul 10, 2025

The dbt Community

Join the largest community shaping data

The dbt Community is your gateway to best practices, innovation, and direct collaboration with thousands of data leaders and AI practitioners worldwide. Ask questions, share insights, and build better with the experts.

Join the Community Explore the community

100,000+active members

50k+teams using dbt weekly

50+Community meetups