04 May 2016
Have you ever had a GROUP BY query and wished you could simply have the totals as well to validate against a control? Well guess what, SQL does provide a way to include totals and even sub-totals to your aggregated results. The GROUPING SETS clause allows you to define 1 or more levels to aggregate on and it will insert those summaries into your returned results.
Below is a simple example.
select
territory
, sum(sales) sales
from fact_sales
group by GROUPING SETS (
(territory)
,()
)
and the results look like this
TERRITORY |
SALES |
North |
1,000,000 |
South |
3,000,000 |
East |
500,000 |
West |
1,500,000 |
(null) |
6,000,000 |
That last line with (null) for territory is our grand total calculated by the GROUPING SETS. Let’s breakdown the syntax because it can be a little confusing at first glance.
First of all, the GROUPING SETS clause is part of the GROUP BY, so you still start off with GROUP BY
immediately followed by the GROUPING SETS
.
GROUPING SETS accepts a variable number of sets delineated by commas with each set defined within a pair of parentheses. The collection of sets together are also contained in a pair of parentheses.
group by -- start GROUP BY clause
GROUPING SETS -- start GROUPING SETS clause
( -- opening parenthesis for GROUPING SETS
(territory) -- first GROUPING SET
,() -- second GROUPING SET
) -- closing parenthesis for GROUPING SETS
The grouping set ()
denotes that it should total across all the results i.e. a grand total.
If you had only specified the (territory)
set then it would have been equivalent to a standard GROUP BY territory.
Things get a little more interesting when you have additional attributes to group on. Here is another example.
select
territory
, product_family
, sum(sales) sales
from fact_sales
group by GROUPING SETS (
(territory, product_family)
,(territory)
,()
)
The result of this query will include the sales totals by territory and product_family, the grand total, and also the sub-totals for each territory.
TERRITORY |
PRODUCT_FAMILY |
SALES |
North |
Produce |
500,000 |
North |
Drinks |
500,000 |
South |
Produce |
1,000,000 |
South |
Meat |
1,600,000 |
South |
Candy |
400,000 |
East |
Produce |
200,000 |
East |
Candy |
300,000 |
West |
Meat |
1,000,000 |
West |
Drinks |
500,000 |
North |
(null) |
1,000,000 |
South |
(null) |
3,000,000 |
East |
(null) |
500,000 |
West |
(null) |
1,500,000 |
(null) |
(null) |
6,000,000 |
You can use any mix of grouping sets that you need as long as each non-aggregated column is included in at least one grouping set. For example, omitting the (territory, product_family)
grouping set from the above query would have resulted in an error because product_family is in the SELECT portion but not in the GROUP BY (which is what happens even when not using GROUPING SETS).
If you do not want the totals to show as NULL, then you can use COALESCE to give a description.
select
coalesce(territory, 'Total') territory
, sum(sales) sales
from fact_sales
group by GROUPING SETS (
(territory)
,()
)
GROUPING SETS is useful. Unfortunately not every database supports the syntax. Most major databases do (Oracle, MS SQL Server, PostgresQL, etc.) Most notably, Amazon’s Redshift does not support GROUPING SETS. However, I imagine that will change as the product matures.
The ROLLUP and CUBE clauses provide similar functionality and may be supported by your database of choice.
21 Apr 2016
You have some data that tracks a sequence of events for a group of entities.
For example, a sequence of transactions for a set of bank accounts
or a list of orders for a group of customers.
Your manager is interested to see all the details on the latest order from all your customers.
Your orders data likely has a lot of info, but most importantly it will have
some reference to the customer, some unique way to identify individual orders,
and some way to determine the chronological order of the orders. For simplicity,
assume the order id is an incremental sequence (1, 2, 3, etc).
ORDER_ID |
CUSTOMER |
DATE |
AMOUNT |
DESCRIPTION |
1 |
BOB |
2015-05-02 |
105 |
Learn SQL |
2 |
CHRIS |
2015-05-02 |
55 |
Video Editing |
3 |
JILL |
2015-05-04 |
33 |
Cooking |
4 |
BOB |
2015-06-04 |
250 |
Learn Java |
5 |
BOB |
2015-06-04 |
89 |
Learn CSS |
6 |
JILL |
2015-06-22 |
50 |
Start a home business |
7 |
KIM |
2015-07-04 |
25 |
Learn Psychology |
8 |
CHRIS |
2015-07-06 |
45 |
Travel Italy |
9 |
JILL |
2015-07-12 |
65 |
Build mobile apps |
10 |
KIM |
2015-07-23 |
70 |
Best Surfing Locations |
I typically see analysts new to SQL initially attempt to answer this question with a query like
select
max(order_id)
, customer
, date
, amount
, description
from orders
group by
customer
, date
, amount
, description
Fortunately, even new SQL users are quick to realize that this is not giving them
the answer they want. The MAX aggregate function simply shows the greatest value within the set of values
being grouped on.
So let’s change this to at least tell us the latest order id for each customer.
select
customer
, max(order_id)
from orders
group by
customer
Our small sample set would then result in
CUSTOMER |
MAX(ORDER_ID) |
BOB |
5 |
CHRIS |
8 |
JILL |
9 |
KIM |
10 |
A good first step, but our manager wants to know the details on these
latest orders too.
One solution that is a natural next step from the query we just wrote is to use
this query as way to limit another query back against
the orders table. There is a couple ways to do this. The most common form is
similar to this.
select
order_id
, customer
, date
, amount
, description
from orders
inner join (
select
customer
, max(order_id) max_order_id
from orders
group by
customer
) m
on orders.order_id = m.max_order_id
This works, but it is not my preferred approach.
My preferred solution does not use the aggregate MAX function but instead
utilizes the analytic RANK function.
select
order_id
, customer
, date
, amount
, description
, rank() over (partition by customer order by order_id desc) rank_order
from orders
The result will look like
ORDER_ID |
CUSTOMER |
DATE |
AMOUNT |
DESCRIPTION |
RANK_ORDER |
1 |
BOB |
2015-05-02 |
105 |
Learn SQL |
3 |
2 |
CHRIS |
2015-05-02 |
55 |
Video Editing |
2 |
3 |
JILL |
2015-05-04 |
33 |
Cooking |
3 |
4 |
BOB |
2015-06-04 |
250 |
Learn Java |
2 |
5 |
BOB |
2015-06-04 |
89 |
Learn CSS |
1 |
6 |
JILL |
2015-06-22 |
50 |
Start a home business |
2 |
7 |
KIM |
2015-07-04 |
25 |
Learn Psychology |
2 |
8 |
CHRIS |
2015-07-06 |
45 |
Travel Italy |
1 |
9 |
JILL |
2015-07-12 |
65 |
Build mobile apps |
1 |
10 |
KIM |
2015-07-23 |
70 |
Best Surfing Locations |
1 |
Sweet! Instead of using that ugly subquery I simply added a single line to a
basic SELECT against the orders table. We are not at the desired result yet,
but let’s break down what is happening here.
The over
clause identifies
this form of RANK as an analytic function. Next, the partition by customer
piece instructs
RANK to rank order each subset of rows that have an identical customer value. If we had
omitted the partition by
entirely, then it would treat the entire table as one subset. Finally, the order by
is required for RANK and instructs
in what way the rows should be ranked. In our case, order by order_id desc
,
the order_id value in descending order.
The result is now each of the rows we are interested in has a RANK_ORDER = 1.
So we just need to modify the query slightly to keep those and remove the others.
Unfortunately, you cannot simply add a WHERE clause on a RANK() function.
Analytic functions are not permitted in a WHERE clause. So we have to wrap our
query as a subquery.
select * from (
select
order_id
, customer
, date
, amount
, description
, rank() over (partition by customer order by order_id desc) rank_order
from orders
) o
where
o.rank_order = 1
So now we have our desired result. Apart from the little extra cruft needed to
filter the rows, this last query is more readable. It is also quick to change
if you need to answer a variety of similar questions.
Customers’ first orders
rank() over (partition by customer order by order_id) rank_order
Customers’ largest orders
rank() over (partition by customer order by amount desc) rank_order
Largest order by day
rank() over (partition by date order by amount desc) rank_order