Friday, December 16, 2011

Another tool for the box



Here's another simple question:

Question: How do I select the top 10 rows in each partition of a table?"
Will CROSS APPLY work as well as it did before?

Here's the data. The table looks pretty much the same, except that the entry_id is autoincrementing instead of repeating and there's a lot more data; three partitions with 100 rows in each partition:
CREATE TABLE t ( 
   partition_id     INTEGER NOT NULL,
   entry_id         INTEGER NOT NULL DEFAULT AUTOINCREMENT,
   data             VARCHAR ( 10 ) NOT NULL,
   PRIMARY KEY ( entry_id ) );

BEGIN
   DECLARE @loop_counter INTEGER;
   SET @loop_counter = 1;
   WHILE @loop_counter <= 100 LOOP
      INSERT t ( partition_id, data ) VALUES ( 10, 'aaa' );
      INSERT t ( partition_id, data ) VALUES ( 20, 'bbb' );
      INSERT t ( partition_id, data ) VALUES ( 30, 'ccc' );
      SET @loop_counter = @loop_counter + 1;
   END LOOP;
   COMMIT;
END;

SELECT * FROM t ORDER BY entry_id;

partition_id,entry_id,data
10,1,'aaa'
20,2,'bbb'
30,3,'ccc'
10,4,'aaa'
20,5,'bbb'
30,6,'ccc'
10,7,'aaa'
20,8,'bbb'
30,9,'ccc'
...
partition_id,entry_id,data
10,298,'aaa'
20,299,'bbb'
30,300,'ccc'
Before tackling the "top 10" question, let's confirm that the CROSS APPLY answer to the previous question "How do I select the first row in each partition of a table?" works just as well on this table:
SELECT first_t.*
  FROM ( SELECT DISTINCT
                t.partition_id 
           FROM t
          ORDER BY t.partition_id ) AS unique_partition
  CROSS APPLY
       ( SELECT FIRST * 
           FROM t
          WHERE t.partition_id = unique_partition.partition_id
          ORDER BY t.entry_id ) AS first_t 
 ORDER BY first_t.partition_id;
Yes, it works:
partition_id,entry_id,data 10,1,'aaa' 20,2,'bbb' 30,3,'ccc'
Plus, it is clear from the SQL what its purpose is:
  • When the "( SELECT DISTINCT ... )" query is translated into English you get "select each distinct partition_id,"

  • when the "( SELECT FIRST ... )" query is translated into English you get "select the first row by entry_id for each row in the other result set," and then

  • when the combined CROSS APPLY query is translated by combining the other two translations, you get "select the first row by entry_id for each distinct partition_id"
which is pretty much a restatement of the original question: "How do I select the first row in each partition?"

It's not perfect; in a perfect world, every single query problem would be solved by a single SQL SELECT statement, not a combination of two queries. But, this is real life... and the original question did refer to two queries: "first row" and "each partition", and the CROSS APPLY contains exactly those two queries...

Not like the Old School solution which does funky things with MIN and GROUP BY, neither of which are even indirectly mentioned in the original question,

SELECT t.* 
  FROM t INNER JOIN ( SELECT partition_id     AS partition_id,
                             MIN ( entry_id ) AS min_entry_id
                        FROM t
                       GROUP BY partition_id
                    ) AS first_row
          ON t.partition_id = first_row.partition_id
         AND t.entry_id     = first_row.min_entry_id
 ORDER BY t.partition_id;
or the OLAP solution which introduced the OVER, WINDOW, PARTITION BY and nested ORDER BY clauses plus the RANK function and a WHERE clause to deal with the RANK result.
SELECT partition_id,
       entry_id,
       data 
  FROM ( select partition_id, 
                entry_id, 
                data,
                rank ( ) OVER partition_id_window AS entry_rank
           FROM t
         WINDOW partition_id_window AS (
                   PARTITION BY partition_id
                   ORDER BY entry_id
                ) ) DT
 WHERE entry_rank = 1
 ORDER BY partition_id;
Go ahead, translate either of those queries into English, see what you get!

Back on topic...

Question: How do I select the top 10 rows in each partition of a table?"
It turns out the CROSS APPLY solution scales just as well as the OLAP solution. With the OLAP solution, the "WHERE entry_rank = 1" predicate was simply changed to "entry_rank <= 10", and with the CROSS APPLY the "FIRST" simply needs to be changed to "TOP 10".

Once again, it works, and once again, the SQL matches the question:

SELECT first_t.*
  FROM ( SELECT DISTINCT
                t.partition_id 
           FROM t
          ORDER BY t.partition_id ) AS unique_partition
  CROSS APPLY
       ( SELECT TOP 10 * 
           FROM t
          WHERE t.partition_id = unique_partition.partition_id
          ORDER BY t.entry_id ) AS first_t 
 ORDER BY first_t.partition_id;

partition_id,entry_id,data
10,1,'aaa'
10,4,'aaa'
10,7,'aaa'
10,10,'aaa'
10,13,'aaa'
10,16,'aaa'
10,19,'aaa'
10,22,'aaa'
10,25,'aaa'
10,28,'aaa'
20,2,'bbb'
20,5,'bbb'
20,8,'bbb'
20,11,'bbb'
20,14,'bbb'
20,17,'bbb'
20,20,'bbb'
20,23,'bbb'
20,26,'bbb'
20,29,'bbb'
30,3,'ccc'
30,6,'ccc'
30,9,'ccc'
30,12,'ccc'
30,15,'ccc'
30,18,'ccc'
30,21,'ccc'
30,24,'ccc'
30,27,'ccc'
30,30,'ccc'
Sometimes "scalability" means "Will it handle millions of rows?", other times the definition is different: "Will it handle more complex problems as well as it handled simple ones?"

For the second definition, at least so far, the answer for CROSS APPLY is "yes".


No comments: