Wednesday, December 14, 2011

If you can't JOIN 'em, ...

The question's simple, why should the answer be complicated?


Question: How do I select the first row in each partition of a table?
Here's the data:
CREATE TABLE t ( 
   partition_id     INTEGER NOT NULL,
   entry_id         INTEGER NOT NULL,
   data             VARCHAR ( 10 ) NOT NULL,
   PRIMARY KEY ( partition_id, entry_id ) );

INSERT t VALUES ( 10, 1, 'aaa' );
INSERT t VALUES ( 10, 2, 'bbb' );
INSERT t VALUES ( 10, 3, 'ccc' );

INSERT t VALUES ( 20, 1, 'ppp' );
INSERT t VALUES ( 20, 2, 'qqq' );
INSERT t VALUES ( 20, 3, 'rrr' );

INSERT t VALUES ( 30, 1, 'xxx' );
INSERT t VALUES ( 30, 2, 'yyy' );
INSERT t VALUES ( 30, 3, 'zzz' );
Here's what the result should look like:
SELECT [the first row in each partition of t];

partition_id,entry_id,data
10,1,'aaa'
20,1,'ppp'
30,1,'xxx'
If you didn't already know, could you tell what the question was by reading the answers in these earlier articles?

First, there was this Old School solution using MIN and GROUP BY:
SELECT t.* 
  FROM t INNER JOIN ( SELECT partition_id     AS partition_id,
                             MIN ( entry_id ) AS min_entry_id
                        FROM t
                       GROUP BY partition_id
                    ) AS first_row
          ON t.partition_id = first_row.partition_id
         AND t.entry_id     = first_row.min_entry_id
 ORDER BY t.partition_id;
Second, there was an OLAP solution using RANK and PARTITION BY:
SELECT partition_id,
       entry_id,
       data 
  FROM ( select partition_id, 
                entry_id, 
                data,
                rank ( ) OVER partition_id_window AS entry_rank
           FROM t
         WINDOW partition_id_window AS (
                   PARTITION BY partition_id
                   ORDER BY entry_id
                ) ) DT
 WHERE entry_rank = 1
 ORDER BY partition_id;
Both get points for being obscure, neither look much like an answer to the question.

Let's Try Again


Question: How do I select the first row in each partition of a table?
Let's start by breaking the question in two parts, and solving each part separately:

Question 1: How do I select all the unique values for partition_id?

Answer: Use DISTINCT when selecting all the values of partition_id.

There's no question what this code is doing, it's selecting all the unique values of partition_id just like the question asked:
SELECT DISTINCT
       t.partition_id 
  FROM t
 ORDER BY t.partition_id;

partition_id
10
20
30
Question 2: How do I select the first row in a single partition?

Answer: Use a WHERE clause to pick the partition, then FIRST and ORDER BY to select the first row.

Once again, it's clear from reading the code what the question was:
SELECT FIRST * 
  FROM t
 WHERE t.partition_id = 10
 ORDER BY t.entry_id;

partition_id,entry_id,data
10,1,'aaa'

What now?


Now comes the hard part: Combining those two result sets to get the answer to the original question.

Let's try this:
SELECT first_t.*
  FROM ( SELECT DISTINCT
                t.partition_id 
           FROM t
          ORDER BY t.partition_id ) AS unique_partition
  INNER JOIN
       ( SELECT FIRST * 
           FROM t
          WHERE t.partition_id = unique_partition.partition_id
          ORDER BY t.entry_id ) AS first_t 
 ORDER BY first_t.partition_id;
You knew that wouldn't work, didn't you? INNER JOIN needs an ON clause:
Could not execute statement.
There is no way to join 'unique_partition' to 'first_t'
SQLCODE=-146, ODBC 3 State="42000"
Line 1, column 1
Let's try the old-fashioned "comma join operator", it doesn't need an ON clause:
SELECT first_t.*
  FROM ( SELECT DISTINCT
                t.partition_id 
           FROM t
          ORDER BY t.partition_id ) AS unique_partition,
       ( SELECT FIRST * 
           FROM t
          WHERE t.partition_id = unique_partition.partition_id
          ORDER BY t.entry_id ) AS first_t 
 ORDER BY first_t.partition_id;
Different problem this time; you can't refer to a column in the left-hand side of a join in the right-hand side:
Could not execute statement.
Illegal reference to correlation name 'unique_partition'
SQLCODE=-824, ODBC 3 State="42S02"
Line 1, column 1
One last try; let's move the WHERE clause to an ON clause so INNER JOIN will work:
SELECT first_t.*
  FROM ( SELECT DISTINCT
                t.partition_id 
           FROM t
          ORDER BY t.partition_id ) AS unique_partition
  INNER JOIN
       ( SELECT FIRST * 
           FROM t
          ORDER BY t.entry_id ) AS first_t 
  ON first_t.partition_id = unique_partition.partition_id 
 ORDER BY first_t.partition_id;
Yeah, it "works" all right, but it only returns one row instead of three; the predicate really needs to be inside the inner SELECT, not the outer one:
partition_id,entry_id,data
10,1,'aaa'

You've seen that before, haven't you?


Let's be honest, you've seen those error messages and wrong results before, haven't you? Like when you were trying to solve a problem that should have been easy (like this one) but you had to fiddle around with complex subqueries or temporary tables or even (gasp!) cursor fetch loops, and you ended up with a solution that looked like it was designed for a whole different problem.

It turns out there is a join operator that makes both of those error messages go away:
  • it doesn't need an ON clause, and

  • it does allow references from the right-hand side back to the left-hand side.

It's called CROSS APPLY...


...and it looks like this:
SELECT first_t.*
  FROM ( SELECT DISTINCT
                t.partition_id 
           FROM t
          ORDER BY t.partition_id ) AS unique_partition
  CROSS APPLY
       ( SELECT FIRST * 
           FROM t
          WHERE t.partition_id = unique_partition.partition_id
          ORDER BY t.entry_id ) AS first_t 
 ORDER BY first_t.partition_id;

partition_id,entry_id,data
10,1,'aaa'
20,2,'bbb'
30,3,'ccc'
CROSS APPLY is like the comma-join and CROSS JOIN operators in that it doesn't need (and cannot have) an ON clause. It is different from other join operators in that predicates on the right-hand side can refer to columns on the left-hand side.

So, like comma-join and CROSS JOIN, the CROSS APPLY joins every single row on the left-hand side with every single row on the right-hand side...

but...


...exactly what constitutes "every single row on the right-hand side" depends on each row on the left-hand side.

In this case, for each row on the left-hand side, the set consisting of "every single row on the right-hand side" is boiled down to one single row, the first row with matching partition_id.

So, CROSS APPLY is like INNER JOIN with special powers.



References:


3 comments:

Anonymous said...

Great insight, as usual:)

However, I feel the need to ask "Do you really think this CROSS APPLYcation is less obscure than the two queries cited in the first paragraphs?"

Regards
Volker Feels Gamma Now

Breck Carter said...

@Volker: Yes, IMO the CROSS APPLY is as clear as SQL can be. Transliterate the left-hand side into English and you get "select each distinct partition_id" and the right-hand side is "select the first row by entry_id for each partition_id" which is exactly what the question asked. Try to transliterate the GROUP BY and OLAP solutions directly into English, see what you get: the very definition of "obscure".

Anonymous said...

@Breck: After some thought (and your new blog article) I agree with you - the query is much less obscure, and so is its structure.

It's primarily the fact that CROSS APPLY itself is not that well-known that makes it more difficult IMHO - it just requires some more learning on my part.

Thanks for helping me to do so:)

Best regards
Volker Now Thinks Learning OLAP Is Lots Harder