Query the latest records in a data set
21 Apr 2016You have some data that tracks a sequence of events for a group of entities. For example, a sequence of transactions for a set of bank accounts or a list of orders for a group of customers.
Your manager is interested to see all the details on the latest order from all your customers.
Your orders data likely has a lot of info, but most importantly it will have some reference to the customer, some unique way to identify individual orders, and some way to determine the chronological order of the orders. For simplicity, assume the order id is an incremental sequence (1, 2, 3, etc).
ORDER_ID | CUSTOMER | DATE | AMOUNT | DESCRIPTION |
---|---|---|---|---|
1 | BOB | 2015-05-02 | 105 | Learn SQL |
2 | CHRIS | 2015-05-02 | 55 | Video Editing |
3 | JILL | 2015-05-04 | 33 | Cooking |
4 | BOB | 2015-06-04 | 250 | Learn Java |
5 | BOB | 2015-06-04 | 89 | Learn CSS |
6 | JILL | 2015-06-22 | 50 | Start a home business |
7 | KIM | 2015-07-04 | 25 | Learn Psychology |
8 | CHRIS | 2015-07-06 | 45 | Travel Italy |
9 | JILL | 2015-07-12 | 65 | Build mobile apps |
10 | KIM | 2015-07-23 | 70 | Best Surfing Locations |
I typically see analysts new to SQL initially attempt to answer this question with a query like
select
max(order_id)
, customer
, date
, amount
, description
from orders
group by
customer
, date
, amount
, description
Fortunately, even new SQL users are quick to realize that this is not giving them the answer they want. The MAX aggregate function simply shows the greatest value within the set of values being grouped on.
So let’s change this to at least tell us the latest order id for each customer.
select
customer
, max(order_id)
from orders
group by
customer
Our small sample set would then result in
CUSTOMER | MAX(ORDER_ID) |
---|---|
BOB | 5 |
CHRIS | 8 |
JILL | 9 |
KIM | 10 |
A good first step, but our manager wants to know the details on these latest orders too.
One solution that is a natural next step from the query we just wrote is to use this query as way to limit another query back against the orders table. There is a couple ways to do this. The most common form is similar to this.
select
order_id
, customer
, date
, amount
, description
from orders
inner join (
select
customer
, max(order_id) max_order_id
from orders
group by
customer
) m
on orders.order_id = m.max_order_id
This works, but it is not my preferred approach.
My preferred solution does not use the aggregate MAX function but instead utilizes the analytic RANK function.
select
order_id
, customer
, date
, amount
, description
, rank() over (partition by customer order by order_id desc) rank_order
from orders
The result will look like
ORDER_ID | CUSTOMER | DATE | AMOUNT | DESCRIPTION | RANK_ORDER |
---|---|---|---|---|---|
1 | BOB | 2015-05-02 | 105 | Learn SQL | 3 |
2 | CHRIS | 2015-05-02 | 55 | Video Editing | 2 |
3 | JILL | 2015-05-04 | 33 | Cooking | 3 |
4 | BOB | 2015-06-04 | 250 | Learn Java | 2 |
5 | BOB | 2015-06-04 | 89 | Learn CSS | 1 |
6 | JILL | 2015-06-22 | 50 | Start a home business | 2 |
7 | KIM | 2015-07-04 | 25 | Learn Psychology | 2 |
8 | CHRIS | 2015-07-06 | 45 | Travel Italy | 1 |
9 | JILL | 2015-07-12 | 65 | Build mobile apps | 1 |
10 | KIM | 2015-07-23 | 70 | Best Surfing Locations | 1 |
Sweet! Instead of using that ugly subquery I simply added a single line to a basic SELECT against the orders table. We are not at the desired result yet, but let’s break down what is happening here.
The over
clause identifies
this form of RANK as an analytic function. Next, the partition by customer
piece instructs
RANK to rank order each subset of rows that have an identical customer value. If we had
omitted the partition by
entirely, then it would treat the entire table as one subset. Finally, the order by
is required for RANK and instructs
in what way the rows should be ranked. In our case, order by order_id desc
,
the order_id value in descending order.
The result is now each of the rows we are interested in has a RANK_ORDER = 1. So we just need to modify the query slightly to keep those and remove the others.
Unfortunately, you cannot simply add a WHERE clause on a RANK() function. Analytic functions are not permitted in a WHERE clause. So we have to wrap our query as a subquery.
select * from (
select
order_id
, customer
, date
, amount
, description
, rank() over (partition by customer order by order_id desc) rank_order
from orders
) o
where
o.rank_order = 1
So now we have our desired result. Apart from the little extra cruft needed to filter the rows, this last query is more readable. It is also quick to change if you need to answer a variety of similar questions.
Customers’ first orders
rank() over (partition by customer order by order_id) rank_order
Customers’ largest orders
rank() over (partition by customer order by amount desc) rank_order
Largest order by day
rank() over (partition by date order by amount desc) rank_order