Question: How do I check two columns to make sure they have the same ordering?
Answer: Why, that's EASY!
You just take the...
um...
you do a...
just a minnit...
Could you repeat the question, please?
Question: How do I make sure a table with two NOT NULL columns a and b yields the same row ordering no matter which column is used in the ORDER BY?
Whaddaya mean "the same row ordering"? What about duplicates?
OK, the "same row ordering for distinct column values".
The question can be restated as
How do I check that SELECT DISTINCT a, b FROM t ORDER BY a, b yields same result set as ORDER BY b, a?Here's some data:
CREATE TABLE t ( pkey INTEGER NOT NULL DEFAULT AUTOINCREMENT PRIMARY KEY, a INTEGER NOT NULL, b INTEGER NOT NULL ); INSERT t ( a, b ) VALUES ( 3, 22 ); INSERT t ( a, b ) VALUES ( 4, 45 ); INSERT t ( a, b ) VALUES ( 1, 10 ); INSERT t ( a, b ) VALUES ( 3, 22 ); INSERT t ( a, b ) VALUES ( 2, 22 ); INSERT t ( a, b ) VALUES ( 4, 44 ); INSERT t ( a, b ) VALUES ( 5, 33 ); INSERT t ( a, b ) VALUES ( 3, 22 ); COMMIT; SELECT * FROM t ORDER BY a, b; pkey a b 3 1 10 5 2 22 1 3 22 4 3 22 8 3 22 6 4 44 2 4 45 7 5 33
The order of the pkey column isn't important, what's important is the order of a as compared with the order of b; here's what that looks like:
SELECT DISTINCT a, b FROM t ORDER BY a, b; SELECT DISTINCT a, b FROM t ORDER BY b, a; a b a b 1 10 1 10 2 22 2 22 3 22 3 22 4 44 5 33 4 45 4 44 5 33 4 45
Right away, you can see "5, 33" is a problem because it appears at row 6 in one result set and row 4 in the other.
This suggests a brute-force solution:
Compare the ROW_NUMBER() values for the two result sets.
I'm sure YOU can do better than this...
WITH vdistinct AS ( SELECT DISTINCT a, b FROM t ), vjoin AS ( SELECT * FROM ( SELECT a AS a1, b AS b1, ROW_NUMBER() OVER ( ORDER BY a, b ) AS row1 FROM vdistinct ) AS t1 INNER JOIN ( SELECT a AS a2, b AS b2, ROW_NUMBER() OVER ( ORDER BY b, a ) AS row2 FROM vdistinct ) AS t2 ON t1.a1 = t2.a2 AND t1.b1 = t2.b2 ) SELECT * FROM vjoin WHERE row1 <> row2 ORDER BY row1, row2;
The WITH clause on lines 1 to 18 defines two local views, also known as "common table expressions", called vdistinct and vjoin.
The first local view on lines 1 to 5 takes care of the fact that the DISTINCT keyword applies to the entire SELECT list, and if we want all the distinct pairs of a and b, there can't be anything else (say, ROW_NUMBER()) in the SELECT DISTINCT list.
The second local view on lines 6 to 18y joins the first view to itself on matching values of a and b, with the difference being the two ROW_NUMBER() OVER ( ORDER BY ... ) items.
The final SELECT on lines 19 to 22 looks for ( a, b ) ( a, b ) pairs with different ROW_NUMBER() values... if the result set is empty, then columns a and b are OK...
...but if anything shows up, there's a problem:
a1 b1 row1 a2 b2 row2 4 44 4 4 44 5 4 45 5 4 45 6 5 33 6 5 33 4
There are three rows in the result set but only one problem, why?
Because the question only asked "Is there a problem here?"
Is that good enough?
In 1972 finding all the errors in one run was critically important because you had to wait hours, or overnight, for each test.
But it's not 1972 any more, and finding one error at a time is perfectly OK.
Consider this real-world example: Only lines 3, 4 and 5 had to be changed to use the WITH SELECT on a different table.
WITH vdistinct AS ( SELECT DISTINCT article_number AS a, order_by_date AS b FROM article ), vjoin AS ( SELECT * FROM ( SELECT a AS a1, b AS b1, ROW_NUMBER() OVER ( ORDER BY a, b ) AS row1 FROM vdistinct ) AS t1 INNER JOIN ( SELECT a AS a2, b AS b2, ROW_NUMBER() OVER ( ORDER BY b, a ) AS row2 FROM vdistinct ) AS t2 ON t1.a1 = t2.a2 AND t1.b1 = t2.b2 ) SELECT * FROM vjoin WHERE row1 <> row2 ORDER BY row1, row2;
Over 100 rows were returned, not because of data-entry errors but because of a logic error in an automatic load process:
a1,b1,row1,a2,b2,row2 161.011000,'2005-10-01',243,161.011000,'2005-10-01',262 161.020000,'2005-08-21',244,161.020000,'2005-08-21',243 161.030000,'2005-08-21',245,161.030000,'2005-08-21',244 161.040000,'2005-08-21',246,161.040000,'2005-08-21',245 161.050000,'2005-08-21',247,161.050000,'2005-08-21',246 161.060000,'2005-08-21',248,161.060000,'2005-08-21',247 161.070000,'2005-08-21',249,161.070000,'2005-08-21',248 161.080000,'2005-08-21',250,161.080000,'2005-08-21',249 161.090000,'2005-08-21',251,161.090000,'2005-08-21',250 161.100000,'2005-08-21',252,161.100000,'2005-08-21',251 161.110000,'2005-08-21',253,161.110000,'2005-08-21',252 161.120000,'2005-08-21',254,161.120000,'2005-08-21',253 161.130000,'2005-08-21',255,161.130000,'2005-08-21',254 161.140000,'2005-08-21',256,161.140000,'2005-08-21',255 161.150000,'2005-08-21',257,161.150000,'2005-08-21',256 161.160000,'2005-08-21',258,161.160000,'2005-08-21',257 161.170000,'2005-08-21',259,161.170000,'2005-08-21',258
After the logic error was fixed, only six rows showed up:
a1,b1,row1,a2,b2,row2 303.000000,'2011-02-01',759,303.000000,'2011-02-01',764 304.000000,'2011-01-04',760,304.000000,'2011-01-04',759 304.010000,'2011-01-11',761,304.010000,'2011-01-11',760 304.020000,'2011-01-17',762,304.020000,'2011-01-17',761 304.030000,'2011-01-24',763,304.030000,'2011-01-24',762 304.040000,'2011-01-31',764,304.040000,'2011-01-31',763
Now that WAS a data-entry error... one error involving two pairs (303, 2011-02-01) and (304, 2011-01-04). It caused a cascading effect in the subsequent automatic load process, hence the six rows in the WITH SELECT result set.
In other words, when that single error was fixed, the WITH SELECT returned no rows - no problems!
No comments:
Post a Comment