Write better SQL: In defense of group by 1

last updated on Oct 15, 2024
I love group by 1. And every so often you'll see me use  group by 1, 2. My love for this little snippet of code developed in the summer of 2016. I was a fledgling data analyst, and I was writing BI queries on Redshift.
Here were my two options:
select  
  date_trunc('month', order_created_at) as order_month,
  
  count(order_id) as n_orders,
  sum(revenue) as total_revenue
from core.fct_orders
group by date_trunc('month', order_created_at)not_group_by_1.sql
select
  date_trunc('month', order_created_at) as order_month,
  
  count(order_id) as n_orders,
  sum(revenue) as total_revenue
    
from core.fct_ordersnngroup by 1group_by_1.sql
Already, we can start to see the problem of not using group by 1: I have to repeat my code.
What if I wanted to convert the timezone? Here are my two options again:
select
  date_trunc(
    'month',
    convert_timezone('Australia/Sydney', created_at)
  ) as order_month,
  
  count(order_id) as n_orders,
  sum(revenue) as total_revenue
from core.fct_orders
group by date_trunc('month', convert_timezone('Australia/Sydney', order_created_at))
not_group_by_1_even_more_complicated.sql
select
  date_trunc(
    'month',
    convert_timezone('Australia/Sydney', created_at)
  ) as order_month,
  count(order_id) as n_orders,
  sum(revenue) as total_revenue
from core.fct_orders
group by 1
still_just_group_by_1.sql
When you're intent on writing out your  group by columns, as the complexity of that first column increases, it gets harder to maintain and read the group by statement.
In my opinion, the second query is cleaner and easier to read. If I want to know what grain my report is at, it's easy for me to parse the 1, jump up a couple of lines of SQL and realize it's at the grain of  order_month.
Plus, as someone that writes a lot of SQL, I very much appreciate the fact that 1  is exactly one character.
In late 2018, Redshift introduced lateral column aliasing, meaning I could instead group by order_month , surely I'd switch to that!
-- post August 2018 option, thanks to lateral column aliasing
select
  date_trunc(
    'month',
    convert_timezone('Australia/Sydney', created_at)
  ) as order_month,
  
  count(order_id) as n_orders,
  sum(revenue) as total_revenue
  
from core.fct_orders
group by order_monthgroup_by_order_month.sql
I admit, that's pretty darn readable! And readability trumps DRY code. But yet, I still continue to group by 1 . In fact I have become even more ardent in my love of it. So much so that I may or may not own this keytag:
 are actually the easiest way to win Christmas.](/_next/image?url=https%3A%2F%2Fcdn.sanity.io%2Fimages%2Fwl0ndo6t%2Fmain%2F5a2ff159d544db4e6377ec37e72daf93a9c1fa06-772x340.png%3Ffit%3Dmax%26auto%3Dformat&w=1920&q=75)
So why do I still  group by 1 ? Well, there's a few reasons:
- Not every warehouse that I regularly use supports lateral column aliasing (I'm looking at you BigQuery and Postgres 👀).
- If you happen to alias a column with the name of an existing column, your SQL engine will assume you are referring to the existing column in your  group byclause (it's a rare case, but happens from time to time when you write enough SQL).
- 1is a single character (and I'm lazy, OK?!).
- I believe  group by 1helps you to write better SQL.
Let's dive into that last one a little more.
In my opinion, every reason against  group by 1  literally doesn't matter if you write good SQL. In fact, using  group by 1  can even help you recognize when your SQL has room for improvement. So let's go through some of the most common objections to  group by 1 . But first, let's change up our example to something that highlights this point a little better.
These days, as an analytics engineer, it's much less common that I'm writing a query to aggregate to a reporting period --- I let my BI tool do that aggregation for me. Instead, I am often writing queries to build dbt models, where I  group by  a foreign key to calculate some facts about an entity. Here's an example:
select
  order_id,
  
  count(payment_id) as n_payments,
  sum(case when payment_method = 'credit_card' then amount else 0 end) as credit_card_amount
  sum(amount) as total_amount
from staging.payments
group by 1
order_payments.sql
Let's use this as the basis of our examples for the rest of the article as we wade through the most common objections to  group by 1 .
"But then people write group by 3, 4, 7 , which is awful." --- a luddite
Agreed! That is awful. But in my opinion, the columns you are grouping by should always be listed first in your query.
In this case,  group by 3, 4, 7  would help you recognize that you haven't followed this convention. If you had instead grouped by a column name, you might not realize that the order of your columns can be improved.
"But how will I know the grain of my model?  1  doesn't tell me that!" --- probably a flat-earther
I always use a line break to visually break up the columns I'm grouping by, and the aggregates (see the examples above). So my brain uses that to understand the grain.
Further, I also tend to put aggregating queries in a CTE, and use the CTE names to also call out the grain of the model, like so:
with payments as (
  select * from staging.payments
),
  
-- this model is at the grain of one per ordern-- you can tell, because it starts with `order_`
order_payments as (
  select
    order_id,
    
    count(payment_id) as n_payments,
    sum(case when payment_method = 'credit_card' then amount else 0 end) as credit_card_amount
    sum(amount) as total_amount
    
  from payments
  
  group by 1
)
select * from order_paymentsorder_payments.sql
Side note: This "all logic in a CTE" pattern is one we use a lot --- it's especially helpful in queries with lots of CTEs, as it helps us chain CTEs together, and is quicker to debug. If you're not yet using it, give it a go and you'll see what we mean.
"But  group by 1  ends up in  group by 1, 2, 3, 4, 5, 6, 7 , which is hard to read!" --- actually my mortal enemy
I mean, is it really any easier to read  group by order_id, customer_id, order_date, status, ...  ?
But why is this happening in the first place? Why is a query grouping by so many columns? In my experience, there's two reasons why people  group by  more than one or two columns:
- The query is both joining and aggregating
- The query is using a  group byto deduplicate
And neither of them are actually very good patterns!
1. The query is both joining and aggregating
Consider this SQL, where some fields on  fct_orders  are being calculated from  payments  :
with orders as (
  select * from staging.orders
),
payments as (
  select * from staging.paymentsn),
    joined as (
      select
        orders.order_id,
        orders.customer_id,
        orders.order_date,
        orders.status,
        count(payments.payment_id) as n_payments,
        sum(case when payments.payment_method = 'credit_card' then payments.amount else 0 end) as credit_card_amount
        sum(payments.amount) as total_amount
        
      from orders
        
      left join payments using (order_id)
        
      group by 1, 2, 3, 4
    )
select * from joined
fct_orders.sql
I mean it looks tidy enough! But, it's hard to extend. What happens if you want to re-order the columns and put your dates at the end? Suddenly, you're breaking the convention of having the columns you want to group by listed first. What happens when you have more columns, or want to join another table to this? It gets really hard to manage the complexity.
Here, there's actually a better pattern --- group first in a CTE, then join it to your table.
with orders as (
  select * from staging.orders
),
  
  payments as (
  select * from staging.payments
),
  
order_payments as (
  select
  order_id,
  count(payment_id) as n_payments,
  sum(case when payment_method = 'credit_card' then amount else 0 end) as credit_card_amount
  sum(amount) as total_amount
  from payments
  group by 1
),
joined as (
  select
    orders.order_id,
    orders.customer_id,
    orders.order_date,
    orders.status,
    order_payments.n_payments,
    order_payments.credit_card_amount,
    order_payments.total_amount
  from orders 
  left join order_payments using (order_id)
)
select * from joinedfct_orders.sql
It's a little longer, but much more maintainable, and in my opinion, easier to understand what's going on.
(Bonus: this pattern is also far more performant!)
(Bonus bonus: if you're using dbt, consider breaking this CTE out into a separate model so you can test that  order_payments  only has one record per order!)
2. The query is using a  group by  to deduplicate
Consider this  group by  that's being used to deduplicate source data to be one record per  order_id .
with orders as (
  select * from staging.orders
),
  
deduplicated as (
  select
    order_id,
    customer_id,
    status,
    max(order_date) as order_date
  from orders
  group by 1, 2, 3
)
select * from deduplicated
stg_orders.sql
OK, sure that's not so bad. But what if your table has a dozen columns? That's a lot of columns to group by. And what if the  status  changed values? You'd end up with two records for the same  order_id , so you won't have effectively deduplicated!
Generally, group by  to deduplicate is a pretty limited pattern. Here are two other patterns we prefer --- heads up: they both use window functions.
Option 1: Create a row_number  and use a  where 
This  row_number  pattern is really useful for wide tables as it extends for extra columns really well!
with orders as (
  select * from staging.orders
),
orders_with_row_number as (
  select
    *,
    row_number() over (
        partition by order_id
      order by order_date desc
    ) as row_number
    
  from orders
),
deduplicated as (
  select * from orders_with_row_number
  where row_number = 1
)
select * from deduplicated  stg_orders.sql
Option 2: Use  distinct  with window functions
This pattern is a little tricky, but it can come in really handy for narrow tables, like creating a two column table to map one user identifier to another (the example below is still for the orders table though). It's one I only learned recently!
with orders as (
  select * from staging.orders
),
deduplicated as (
  select distinct
    order_id,
    last_value(customer_id) over (
      partition by order_id
      order by order_date
      rows between unbounded preceding and unbounded following
    ) as customer_id,
    max(order_date) over (
      partition by order_id
    ) as order_date
  from orders
)
select * from deduplicatedstg_orders.sql
We deviated a little from talking about  group by  here, but it's important to note that in my opinion, the concern about grouping by too many columns is more a symptom of a poorly structured query than a problem with the  group by 1  syntax. As a rough guess, I'd say that 95% of my  group by statements are group by 1, and most of the rest are  group by 1, 2 . It's very rare that I group by more than that.
Final thoughts
OK, I've intentionally written this to ruffle a few feathers, and I'm fully aware that some SQL dialects don't support  group by 1 (e.g. mssql).
Opinions can change --- for example, I used to be very pro-UPPER-CASE-FOR-SQL-KEYWORDS, and even, pro-leading commas.
In writing all of this out, I think I've even softened my opinion on group by column_name a little (heck, maybe it is better? When the from and group by are next to each other, maybe you can parse it a little more quickly. After all, brain time is the most expensive resource of all! Though, it is fewer characters...)
At the end of the day, be opinionated in your SQL style, put it in a git repo and stick with it consistently. Here's ours, and here's another one I love --- yes the latter recommends group by column_name and I still love it.
In all honesty, I can't wait for the day that someone builds a really good SQL linter based on a definable set of conventions, so that no matter which conventions you adopt, it's the job of a computer (and not me) to rant about breaking them.
{% endrant %}
Live virtual event:
Experience the dbt Fusion engine with Tristan Handy and Elias DeFaria on October 28th.
VS Code Extension
The free dbt VS Code extension is the best way to develop locally in dbt.



