Wednesday, September 7, 2011

Unrequited OLAP

OLAP is so attractive from afar, it solves so many Big Important Problems, how can anyone not love it?

Well, it's easy not to love OLAP: just try something simple, something that should be easy given all the expressive power built in to the WINDOW clause and all the related window functions like, say, FIRST_VALUE.


Question: How do I select the first row in each partition of a table?

Here's an example:
CREATE TABLE t ( 
   partition_id     INTEGER NOT NULL,
   entry_id         INTEGER NOT NULL,
   data             VARCHAR ( 10 ) NOT NULL,
   PRIMARY KEY ( partition_id, entry_id ) );

INSERT t VALUES ( 10, 1, 'aaa' );
INSERT t VALUES ( 10, 2, 'bbb' );
INSERT t VALUES ( 10, 3, 'ccc' );

INSERT t VALUES ( 20, 1, 'ppp' );
INSERT t VALUES ( 20, 2, 'qqq' );
INSERT t VALUES ( 20, 3, 'rrr' );

INSERT t VALUES ( 30, 1, 'xxx' );
INSERT t VALUES ( 30, 2, 'yyy' );
INSERT t VALUES ( 30, 3, 'zzz' );

SELECT [the first row in each partition of t];

partition_id,entry_id,data
10,1,'aaa'
20,1,'ppp'
30,1,'xxx'

Answer: OLAP to the rescue! It's got the PARTITION keyword, and ORDER BY, and RANGE, and FIRST_VALUE, everything you need!

Dilbert.com

Hubba hubba, here we go! ...
SELECT partition_id,
       FIRST_VALUE ( entry_id ) OVER partition_id_window AS first_entry_id,
       FIRST_VALUE ( data     ) OVER partition_id_window AS first_data
  FROM t
WINDOW partition_id_window AS (
         PARTITION BY partition_id
         ORDER BY entry_id
         RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING )
 ORDER BY partition_id;

partition_id,first_entry_id,first_data
10,1,'aaa'
10,1,'aaa'
10,1,'aaa'
20,1,'ppp'
20,1,'ppp'
20,1,'ppp'
30,1,'xxx'
30,1,'xxx'
30,1,'xxx'

Well, that sucks!

That's not even close!

You can't code RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED PRECEDING, that throws a syntax error. In fact, the RANGE clause seems to be singularly useless, it has so few options.

It's almost like the WINDOW clause was designed by a standards committee.

Old School

Remember GROUP BY? Yeah, we used to think THAT was complicated before the WINDOW clause came along.
SELECT t.* 
  FROM t INNER JOIN ( SELECT partition_id     AS partition_id,
                             MIN ( entry_id ) AS min_entry_id
                        FROM t
                       GROUP BY partition_id
                    ) AS first_row
          ON t.partition_id = first_row.partition_id
         AND t.entry_id     = first_row.min_entry_id
 ORDER BY t.partition_id;

partition_id,entry_id,data
10,1,'aaa'
20,1,'ppp'
30,1,'xxx'

Tell me it ain't so!

Please, someone tell me the WINDOW clause isn't so hard to get.

That may be wishful thinking, however...

"Window operations don’t add expressive power to the SQL language because other constructions, which may require multiple statements, can provide the same functionality as window functions."
Glenn Paulley in The Window operator and window aggregate functions, March 31st, 2009

3 comments:

Anonymous said...

I would agree that WINDOW clauses are not that easy (and I'm using them too rarely to make noticable improvements)...

However, in the current case I guess a simple SELECT DISTINCT would do. Though I agree that I would have expected FIRST_VALUE() to do that on its own, based on experience with MAX() and other aggregate functions.

And now we're waiting for Glenn's explanation...

Volker

Glenn Paulley said...

It's helpful to think of WINDOW as augmenting the result set that is generated by the FROM/WHERE clause - WINDOW doesn't *just* return the aggregate values for groups (if PARTITION is used), it adds to the projection of the result set.

Consequently, if what you want is *the* row that is the "lowest" of a partition, you have to eliminate all of the other rows from result set.

FIRST_VALUE does indeed return the first value in the partition. However, you need to restrict the result set to eliminate those rows from t where the row is not the first value for that partition - and that's a bit harder to do because it's possible to have duplicate values for either entry_id or data within each partition.

A better way of getting what you want is to use the RANK function, which will return 1 for the lowest value in each partition (and will label ties similarly). The query is

SELECT partition_id,
entry_id,
data
FROM ( select partition_id, entry_id, data,
rank ( ) OVER partition_id_window AS entry_rank
FROM t
WINDOW partition_id_window AS (
PARTITION BY partition_id
ORDER BY entry_id
) ) DT
WHERE entry_rank = 1
ORDER BY partition_id;

Anonymous said...

Breck, having Glenn's thorough explanation:

Now I'd like to know whether you find the WINDOW clause easier to use (and(or understand) than before...

Volker