The surrogate key macro

This post is a part of the SQL love letters—a series on the SQL functions the dbt Labs data team members use and love.

We all know one of the most fundamental rules in data is that every table should have a primary key. Primary keys are critical for many reasons:

  • They ensure that you don’t have duplicate rows in your table
  • They help establish relationships to other tables
  • They allow you to quickly identify the grain of the table (ex: the customers table with a PK of customer_id has one row per customer)
  • You can test them in dbt, to ensure that your data is complete and unique

If you are lucky, your data will come to you with a unique primary key already in place. Maybe it is an id generated by your internal product systems, or maybe you are pulling data in from a third party source that generates the id for you.

Sometimes, however, you find yourself in a situation where you don’t have a natural primary key.

For example, maybe you have joined your users to the set of features that they use in your product, such that your data should be expected to be unique on the user_id + product_id basis.

The question is - in this situation, how are you supposed to set a primary key? Meet the surrogate key.

What’s a surrogate key?

A surrogate key is a primary key that, instead of existing in your underlying dataset, is derived in the analytics layer itself.

Learning when to use surrogate keys and bring them into your project is a critical skill for any analytics professional.

Knowing when to use a surrogate key is actually quite easy: you should have a surrogate key on any table that doesn’t already have a unique primary key.

Knowing how to create a surrogate key can prove to be much more challenging. The reason for this is that it isn’t always easy to know the best way to implement surrogate keys. Should you just be blocking off time on your calendar every day to individually name each of your rows?

Turns out this is a relatively well-solved problem. To create a surrogate key, you traditionally follow these two steps.

  1. Concatenate together all of the fields required to make a unique row (for example, user_id and product_id)
  2. Apply a function to create a cryptographic hash (usually using the md5 function) on top of these to generate a unique id per combination of unique values

While the process of creating a surrogate key is relatively well understood, you will be shocked (SHOCKED I SAY) to hear that SQL syntax can have subtle differences across dialects and databases.

Surrogate keys in BigQuery, Databricks, Redshift and Snowflake

BigQuery, Redshift and Snowflake’s concat functions returns null if any of the referenced columns for that row returns a null, so to create a proper surrogate key you’d need to wrap each column in a coalesce before hashing with an md5 function:

md5 ( concat ( coalesce(column1, ''), coalesce(column2, '') ) )

Databricks’ concat function docs don’t specifically reference returning null for the concat if one column is null, but I believe that’s what’s meant by The result type matches the argument types.

You could also separate your columns with pipes (||) rather than using the concat function, but I generally stay away from pipes (one comma > two pipes).

Surrogate keys in Postgres

Postgres’ concat function ignores nulls, which saves you from having to wrap each column in a coalesce function to default nulls to blank.

If you used || instead of concat, one null column would cause the entire statement would return a null, breaking your concatenation. .

So in plain old PostgreSQL, you’d use:

md5 ( concat (column1, column2) )

The null value problem in surrogate keys

The primary annoyance when creating surrogate keys comes when you try and concatenate a row that has a null value for one or more columns. If any value is null, then often the entire concatenated string is returned as null - no good!

with 
example_ids as (
  123 as user_id,
  123 as product_id
  union all
  select
  123 as user_id,
  null as product_id
  union all
  select
  null as user_id,
  123 as product_id
)
select
  *,
  concat(user_id, product_id) as _surrogate_key
from example_ids

output:

USER_ID PRODUCT_ID _SURROGATE_KEY
123 123 123123
123 null null
null 123 null

You can get around this by wrapping each of your columns in a coalesce function to default nulls to blank (‘’), which is pretty tedious. You can also run into problems if the fields are different datatypes (string vs numeric), so sometimes you need to cast as well.

...
select
  *,
  concat(
    coalesce(cast(user_id as string), ''),
    coalesce(cast(product_id as string), '')
    ) as _surrogate_key
from example_ids

output:

USER_ID PRODUCT_ID _SURROGATE_KEY
123 123 123123
123 null 123
null 123 123

At first glance, this looks like it works, but in reality, there should be three unique IDs and instead, there are two: 123123 & 123, which could be a problem in the off chance there is any potential overlap in the sequencing used by the two IDs.

To remedy this, you need to add a separator between fields you wish to concatenate.

...
select
  *,
  concat(
    coalesce(cast(user_id as string), ''),
    '|',
    coalesce(cast(product_id as string), '')
    ) as _surrogate_key
from example_ids

output:

USER_ID PRODUCT_ID _SURROGATE_KEY
123 123 123|123
123 null 123|
null 123 |123

Let’s take a look at how generating surrogate keys specifically looks in practice across data warehouses, and how you can use one simple dbt macro (dbt_utils.surrogate_key) to abstract away the null value problem.

A surrogate_key macro to the rescue

Thanks to a handy function called surrogate_key in the dbt_utils package, you can fire yourself from the business of wrapping your columns in coalesce every time you want to generate a surrogate key.

Forming your surrogate keys with this macro has the benefit of elegant + DRY null handling.

Rather than wrapping your columns in a coalesce function when concatenating them, the macro loops through your columns and coalesces on your behalf, so that you can avoid repeating yourself.

When you call ``, behind the scenes dbt compiles SQL on your behalf, looping through each field and generating the correct number of coalesce statements with type casting:

  coalesce(cast(" ~ field ~ " as " ~ dbt_utils.type_string() ~ "), '')

and with conditional logic, adding separator between fields:


  {%- if not loop.last %}
    {%- set _ = fields.append("'-'") -%}
  {%- endif -%}

What does this mean in practice?

Well, you simply don’t have to think about your surrogate keys all that much. On any data warehouse, nulls or no nulls, it just works. Because honestly, who wants to spend more time than they need to thinking about surrogate keys?

Last modified on: Jul 27, 2022

Join us for Coalesce 2022

The annual conference dedicated to advancing analytics engineering