Daily Archives: September 12, 2014

The Joy of QUALIFY

SQL is a set-based language and so when you write a SQL statement what you are really doing is taking a set of data and moving it through a series of transformations to produce new sets of data which you may then subsequently move through even more transformations. Eventually you will arrive at the end result, which is a set of data that (hopefully) answers a question that you or others were asking. Your ability to write and use SQL 1 is highly related to your ability to visualize sets of data in your head, and also to your ability to look at a current set of data (Point A), have an idea of the end-state set of data which represents the response to your question (Point Z), and understand what intervening steps are necessary to get the data from Point A to Point Z. Most useful questions of any complexity involving a database will require more than one transformation or step to get the answer you want. In some cases you may have ten steps, twenty steps, or more. It may be difficult enough to visualize or troubleshoot the transformation process that staging tables are necessary along the way to examine data in intervening steps.

I find the WITH clause very useful for factoring SQL transformations into multiple, subsequent steps, and allowing me to troubleshoot and visualize each step along the way. You can use multiple nested subqueries and more or less obtain equivalent functionality, but I think they are much harder to read and troubleshoot. This is especially true if you or others have to go back and read your code months later.

However, I digress. Today I would like to specifically discuss the QUALIFY statement, which to the best of my knowledge is only available in Teradata, and in my humble opinion is very useful. To be able to understand the QUALIFY clause you first have to understand Analytic functions, Aggregate functions, and the difference between the two. Analytic functions are a large topic unto themselves and I could write many other blog posts about them.

If you do not understand Analytic functions in SQL, stop reading now and go look at them first :)! This will only make sense if you understand Analytic functions fairly well. By the way, if you are using an Aggregate function on a table to compute some values and then self-joining back to the same original table you ran the Aggregate function on — you probably want an Analytic function. The Analytic function will be much easier to read and understand, and more importantly will save a lot of database resources. In my experience using an Analytic function for this results in one table scan instead of two table scans and then the join for the Aggregate self-join method.

Having gotten that out of the way, I often find myself in the situation that I want to write a query for a set of data out of a table, use an Analytic function to compute a value for each row, and then filter the set only for the rows which match a particular analytic function result. For example, say I have a set of data that represents a set of bank account transactions for a group of users for a day. Here is the original data set:

User Transaction_Amount Transaction_Date Transaction_Time
jsmith $10.50 9/10/2014 1:12 PM
jsmith $4.25 9/10/2014 8:30 AM
jsmith $8.42 9/10/2014 4:00 AM
jsmith $50.35 9/10/2014 8:00 PM
mjane $5.90 9/10/2014 5:30 AM
mjane $12.45 9/10/2014 5:00 PM
mjane $80.43 9/10/2014 10:00 AM

Within a given day, I want to order the transactions by transaction time, and then only get the last transaction by transaction time for each user. So, the first part is writing an analytic function to add a calculated column to the above dataset to do this ordering:

SELECT
ACCOUNT.*,
RANK() OVER (PARTITION BY USER ORDER BY TRANSACTION_TIME DESC) AS RANK_TRANSACTION_TIME
FROM ACCOUNT

That should yield this dataset:

User Transaction_Amount Transaction_Date Transaction_Time RANK_TRANSACTION_TIME
jsmith $10.50 9/10/2014 1:12 PM 2
jsmith $4.25 9/10/2014 8:30 AM 3
jsmith $8.42 9/10/2014 4:00 AM 4
jsmith $50.35 9/10/2014 8:00 PM 1
mjane $5.90 9/10/2014 5:30 AM 3
mjane $12.45 9/10/2014 5:00 PM 1
mjane $80.43 9/10/2014 10:00 AM 2

Now, I want only the rows with a RANK_TRANSACTION_TIME of 1 (there was a reason I used an ORDER BY DESC statement in the Analytic function!) Unfortunately, in Oracle and MS SQL Server you can’t reference a column created/calculated by an analytic function in the same SQL statement scope. In other words I would like to do this:

SELECT
ACCOUNT.*,
RANK() OVER (PARTITION BY USER ORDER BY TRANSACTION_TIME DESC) AS RANK_TRANSACTION_TIME
FROM ACCOUNT
WHERE RANK_TRANSACTION_TIME = 1

But if I try to do that, here is the error message that Oracle will give me:


ORA-00904: "RANK_TRANSACTION_TIME": invalid identifier
00904. 00000 - "%s: invalid identifier"

What about this?

SELECT
ACCOUNT.*,
RANK() OVER (PARTITION BY USER ORDER BY TRANSACTION_TIME DESC) AS RANK_TRANSACTION_TIME
FROM ACCOUNT
WHERE RANK() OVER (PARTITION BY USER ORDER BY TRANSACTION_TIME DESC) = 1

Nope:


ORA-30483: window functions are not allowed here
30483. 00000 - "window functions are not allowed here"
*Cause: Window functions are allowed only in the SELECT list of a query.
And, window function cannot be an argument to another window or group
function.

Why is this? Well, it has to do with the order of evaluation in SQL Statements. Here is how a standard relational database goes through evaluating a single SQL statement — there is an order to how each part of the SQL statement is computed/evaluated:

filtering example 1

Database servers normally evaluate a WHERE clause and only after that do they go back and compute an Analytic function. You can use this feature purposefully to limit the number of rows that you compute an Analytic function over, but in the case that we are discussing today that does not help. So what do you normally end up having to do? Something like this:

WITH SUBQUERY AS
(SELECT
ACCOUNT.*,
RANK() OVER (PARTITION BY USER ORDER BY TRANSACTION_TIME DESC) AS RANK_TRANSACTION_TIME
FROM ACCOUNT)
SELECT * FROM SUBQUERY WHERE RANK_TRANSACTION_TIME = 1

Or the alternate subquery format (like I said before, I prefer the WITH clause although in this small example it isn’t a big deal):

SELECT * FROM
(SELECT
ACCOUNT.*,
RANK() OVER (PARTITION BY USER ORDER BY TRANSACTION_TIME DESC) AS RANK_TRANSACTION_TIME
FROM ACCOUNT)
WHERE RANK_TRANSACTION_TIME = 1

This is not that bad in our current situation, but in more complex queries you often end up needing to compute analytic functions multiple times and then select the rows from the result sets. Wouldn’t it be nice if you had a way to filter only for rows with particular results from your analytic functions without having to use WITH or a subquery? Well, in Teradata you can! It is called the QUALIFY clause, and it looks like this:

SELECT
ACCOUNT.*,
RANK() OVER (PARTITION BY USER ORDER BY TRANSACTION_TIME DESC) AS RANK_TRANSACTION_TIME
FROM ACCOUNT
QUALIFY RANK() OVER (PARTITION BY USER ORDER BY TRANSACTION_TIME DESC) = 1

No subquery, no WITH clause! It might just be syntactic sugar, but I find it useful in simplifying my queries and logic. I believe only Teradata currently offers this feature. So here is what the whole flow-of-logic appears to me to be for the various steps that comparable vendors go through in their DB engines 2 :

 

filtering example 2

Voila, the desired result set:

User Transaction_Amount Transaction_Date Transaction_Time RANK_TRANSACTION_TIME
jsmith $50.35 9/10/2014 8:00 PM 1
mjane $12.45 9/10/2014 5:00 PM 1