Monday, September 12, 2011

OLAP Counseling

In his great novel Brave New World Aldous Huxley divided The World State into five heavily conditioned castes: the fully developed Alphas, and the arrested development Beta, Gamma, Delta and Epsilon castes. I know I'm not an Epsilon because I know what an Epsilon is and they don't.

But I am just as surely not an Alpha, and that means I'm not an Alpha Plus, the higher caste-within-a-caste, the leaders and intellectual elites.

Let's say I'm a Beta. OK, Gamma, but let's pretend for a moment, shall we? (Hey! Who shouted "Delta"? :)


"Alpha children wear grey. They work much harder than we do, because they're so frightfully clever. I'm really awfully glad I'm a Beta, because I don't work so hard. And then we are much better than the Gammas and Deltas. Gammas are stupid. They all wear green, and Delta children wear khaki. Oh no, I don't want to play with Delta children. And Epsilons are still worse. They're too stupid to be able ..." - Wikiquote


Here's what I'm afraid of:

Only Alphas understand OLAP


The WINDOW clause and the associated window functions were invented by Alpha Pluses for use by the Alpha caste, that's what I'm afraid of. (Either that, or I'm a Delta... that's always possible :)

Of course, The World State is fictional, this is the real world, and in the real world it is possible for the upper castes to teach the rest of us about new stuff. Here's Glenn Paulley responding to my earlier complaint about the WINDOW clause not answering the question "How do I select the first row in each partition of a table?":

It's helpful to think of WINDOW as augmenting the result set that is generated by the FROM/WHERE clause - WINDOW doesn't *just* return the aggregate values for groups (if PARTITION is used), it adds to the projection of the result set.

Consequently, if what you want is *the* row that is the "lowest" of a partition, you have to eliminate all of the other rows from result set.

FIRST_VALUE does indeed return the first value in the partition. However, you need to restrict the result set to eliminate those rows from t where the row is not the first value for that partition - and that's a bit harder to do because it's possible to have duplicate values for either entry_id or data within each partition.

A better way of getting what you want is to use the RANK function, which will return 1 for the lowest value in each partition (and will label ties similarly). The query is
SELECT partition_id,
       entry_id,
       data 
  FROM ( select partition_id, 
                entry_id, 
                data,
                rank ( ) OVER partition_id_window AS entry_rank
           FROM t
         WINDOW partition_id_window AS (
                   PARTITION BY partition_id
                   ORDER BY entry_id
                ) ) DT
 WHERE entry_rank = 1
 ORDER BY partition_id;


Glenn's efforts to educate are most appreciated, especially since he knows it's an uphill task:

"Unfortunately, WINDOW functions remain beyond the experience of the majority of application developers, who typically have a working knowledge of the considerably simpler SQL:1992 standard."


So...

Let's Compare and Contrast


Here's the original problem: "How do I select the first row in each partition of a table?"
CREATE TABLE t ( 
   partition_id     INTEGER NOT NULL,
   entry_id         INTEGER NOT NULL,
   data             VARCHAR ( 10 ) NOT NULL,
   PRIMARY KEY ( partition_id, entry_id ) );

INSERT t VALUES ( 10, 1, 'aaa' );
INSERT t VALUES ( 10, 2, 'bbb' );
INSERT t VALUES ( 10, 3, 'ccc' );

INSERT t VALUES ( 20, 1, 'ppp' );
INSERT t VALUES ( 20, 2, 'qqq' );
INSERT t VALUES ( 20, 3, 'rrr' );

INSERT t VALUES ( 30, 1, 'xxx' );
INSERT t VALUES ( 30, 2, 'yyy' );
INSERT t VALUES ( 30, 3, 'zzz' );

SELECT [the first row in each partition of t];

partition_id,entry_id,data
10,1,'aaa'
20,1,'ppp'
30,1,'xxx'

Here's the OLAP solution using WINDOW and RANK():
SELECT partition_id,
       entry_id,
       data 
  FROM ( select partition_id, 
                entry_id, 
                data,
                rank ( ) OVER partition_id_window AS entry_rank
           FROM t
         WINDOW partition_id_window AS (
                   PARTITION BY partition_id
                   ORDER BY entry_id
                ) ) DT
 WHERE entry_rank = 1
 ORDER BY partition_id;

partition_id,entry_id,data
10,1,'aaa'
20,1,'ppp'
30,1,'xxx'

Here's the Old School solution using GROUP BY and MIN():
SELECT t.* 
  FROM t INNER JOIN ( SELECT partition_id     AS partition_id,
                             MIN ( entry_id ) AS min_entry_id
                        FROM t
                       GROUP BY partition_id
                    ) AS first_row
          ON t.partition_id = first_row.partition_id
         AND t.entry_id     = first_row.min_entry_id
 ORDER BY t.partition_id;

partition_id,entry_id,data
10,1,'aaa'
20,1,'ppp'
30,1,'xxx'

Here's the comparison:
OLAP Solution                   Old School Solution
=============                   ===================

Built upon a WINDOW clause.     Built upon a GROUP BY clause.

Uses RANK() to find the first   Uses MIN() to find the
row in each PARTITION.          first row in each GROUP.

A derived table is required     A derived table is required
because RANK() can't appear     because t.data can't appear
in a WHERE clause in the        in a SELECT based on
same SELECT.                    GROUP BY partition_id.

Requires a WHERE clause to      No WHERE clause is required.
make use of the RANK() result.   
The part about "a derived table is required" makes the two solutions pretty much the same when it comes to understandability: all other things being equal, any query that requires a derived table is going to be harder to write and understand than a query that doesn't need one.

Which query is best is not the point here, the point is that the WINDOW clause can be made to work using the RANK() function...

...which only Alphas know about?


When Betas (OK, Deltas) use OLAP they use the Help, and they learn what they can do by looking at topics like this:
all of which list MIN(), some of which list FIRST_VALUE(), and none of which list RANK().

Memo to the Doc Team: Please accept my apologies for dragging you into this! Stuff happens, and in normal circumstances a simple suggestion in DocCommentXchange would suffice, but the opportunity to make a smart-aleck comment point here was impossible to forgo!


HERE'S the scary part...

...and it's not about the Help or which method works or is easier to write and understand, it's this:
Is it important to performance to use OLAP instead of Old School solutions?
Here's a hint...

"...window functions have the potential to lead to better query execution plans by eliminating joins and nested queries, yet retaining the simplicity of a single SQL statement."
Glenn Paulley in The Window operator and window aggregate functions, March 31st, 2009

...maybe not in this case, but just because the Old School solution worked it doesn't mean we can give up on learning about OLAP and retreat to the familiar.

Apparently, not if we care about the improvements coming to us courtesy of the Alphas!

Dilbert.com


1 comment:

Anonymous said...

Thanks for answering my question (see my 2nd comment) on your article:)

It reminds me of a similar case when my attempts to "promote" the EXCEPT operator (cf. http://sqlanywhere-forum.sybase.com/questions/7153) seem to fail because of a general "Don't use it, the next guy won't understand it!" advice.

And now these Alphas Pluses are even about to introduce SQL/2011...

Great article, Breck!

Volker