Question: How do I determine the primary key that is 100 rows newer than the current row, for a subset of rows?
The following solution seems to work AND it's a fairly snappy performer in SQL Anywhere 12.0.1.3298; the question is, is there a better solution?
IF VAREXISTS ( '@partition_id' ) = 0 THEN CREATE VARIABLE @partition_id UNSIGNED INTEGER; END IF; IF VAREXISTS ( '@current_primary_key' ) = 0 THEN CREATE VARIABLE @current_primary_key UNSIGNED INTEGER; END IF; SET @partition_id = 2; SET @current_primary_key = 290505; SELECT newer_t.primary_key AS destination_primary_key FROM ( SELECT TOP 100 ROW_NUMBER() OVER ( ORDER BY T.primary_key ASC ) AS scrolling_row_number, t.primary_key FROM t WHERE t.partition_id = @partition_id AND t.primary_key > @current_primary_key ORDER BY t.primary_key ASC ) AS newer_t WHERE newer_t.scrolling_row_number = 100 ORDER BY newer_t.primary_key DESC;I'm sure all you front-row students out there already have a suggestion or two, but the rest of us need some more information...
Here's the table:
CREATE TABLE DBA.t ( -- 487,445 rows, 54.5M total = 37.4M table + 0 ext + 17M index, 118 bytes per row partition_id /* X */ UNSIGNED INT NOT NULL, primary_key /* PK */ UNSIGNED BIGINT NOT NULL DEFAULT autoincrement, other_data_1 VARCHAR ( 1 ) NOT NULL DEFAULT 'N', other_data_2 VARCHAR ( 1 ) NOT NULL DEFAULT 'Y', other_data_3 VARCHAR ( 32767 ) NOT NULL DEFAULT '', other_data_4 BIGINT NOT NULL, other_data_5 BIGINT NOT NULL, other_data_6 /* X */ TIMESTAMP NOT NULL DEFAULT '1900-01-01', other_data_7 TIMESTAMP NOT NULL DEFAULT '1900-01-01', other_data_8 TIMESTAMP NOT NULL DEFAULT '1900-01-01', other_data_9 TIMESTAMP NOT NULL DEFAULT '1900-01-01', other_data_10 BIGINT NOT NULL DEFAULT 0, CONSTRAINT ASA415 PRIMARY KEY ( -- 904k primary_key ) ); CREATE CLUSTERED INDEX ix_other_data_6 ON DBA.t ( -- 12.7M other_data_6 ); CREATE INDEX ix_partition_id ON DBA.t ( -- 3.5M partition_id );Here's the question again:
Question: How do I determine the primary key that is 100 rows newer than the current row, for a subset of rows?
Here are some definitions:
- Current row: where the autoincrementing primary_key = x
- Desired result: y = the destination primary key, or NULL if there are fewer than 100 newer rows
- Newer: where the primary_key is > x
- Subset: where the partition_id = z
- The test database has 487,445 rows in this table; in the real world millions of rows are common.
- The test database has only 4 distinct values in partition_id, with two values alternating throughout almost all of the rows: 1, 2, 1, 2, ... for primary keys n, n + 1, n + 2, n + 3, ...
- The partition_id column doesn't work in the sense of horizontal partitioning or sharding. In some real-world examples of this table all the rows have the same partition_id, in other cases there are many values of partition_id but only a few account for most of the rows, and in this case (as noted above) the rows are evenly divided between two partition_id values.
- The primary_key may be a perfect candidate for a CLUSTERED index, but one of the other columns is used in expensive range queries (not discussed here) so it got the CLUSTERED index. Sadly, there is apparently a world-wide shortage of the "CLUSTERED" keyword so it has been decreed that only one index per table can be CLUSTERED...
- ... but perhaps that doesn't matter if I read Anil Goel's comment on this post correctly: Multiple Clustered Indexes
SELECT newer_t.primary_key AS destination_primary_key FROM ( SELECT TOP 100 ROW_NUMBER() OVER ( ORDER BY T.primary_key ASC ) AS scrolling_row_number, t.primary_key FROM t WHERE t.partition_id = @partition_id AND t.primary_key > @current_primary_key ORDER BY t.primary_key ASC ) AS newer_t WHERE newer_t.scrolling_row_number = 100 ORDER BY newer_t.primary_key DESC;
- The ROW_NUMBER() OVER clause on line 3 assigns a row number 1, 2, 3, ... for the rows in the inner SELECT.
- It is important (I think, at least in this case) for the OVER ORDER BY on line 3 to be the same as the SELECT ORDER BY on line 8. If they're different (in this case) funky things happen with regards to returning the right answer (that part, I know for sure).
- The predicate t.partition_id = @partition_id on line 6 limits the candidate rows for the inner SELECT to exactly one partition.
- The predicate t.primary_key > @current_primary_key on line 7 limits the candidate rows to "newer" rows.
- The SELECT TOP 100 clause on line 2 limits the inner result set to the first 100 "newer" rows.
- The WHERE clause on line 10 throws away the first 99 "newer" rows and takes row number 100.
- The outer ORDER BY on line 11 is an "oops"... clearly the outer SELECT don't need no steenking ORDER BY. The query engine doesn't eliminate the clause, but at the same time (judging by the plans) it doesn't spend any time executing it either.
...or am I completely out to lunch? That's why I'm asking this question!
Three graphical plans are available, for
- a run with a cold cache,
- followed by an immediate repeat of exactly the same query, and
- then a test of the "next step forward": the destination_primary_key returned by the previous query was fed into the @current_primary_key for the step forward.
Estimated Actual RunTime Runtime ======= ======= Cold cache 10.418 1.7735 Repeat 1.7516 0.55487 Step forward 1.4718 0.39121You can look at the plans in dbisql; here's where to get them:
cold_cache_plan.saplan
repeat_plan.saplan
step_forward_plan.saplan
Here's a quick look at table t for the first two plans (cold cache and immediate repeat); you can see that when the cache warmed up the table scan was replaced by an index:
So, besides getting rid of the oops-es, what else can I do or try?
1 comment:
Note, this is untested and therefore just a suggestion to simplify the query. I can't claim if it does work at all - and if it does, whether it does perform better.
That being said, isn't TOP with the START AT clause helpful here (as to the docs the more appropriate "FIRST START AT" is not supported):
Such as
SELECT TOP 1 START AT 100 t.primary_key AS destination_primary_key
FROM t
WHERE t.partition_id = @partition_id
AND t.primary_key > @current_primary_key
ORDER BY t.primary_key ASC
Just my 0.5 cents
Volker
Post a Comment