Monday, November 15, 2010

Crosstab, Rotate, Pivot, Normalize

The "Crosstab, Rotate, Pivot" article from earlier this year should have been named "Crosstab, Rotate, Pivot, Denormalize" because it showed how to take a perfectly nice table like this

and muck it up to look like this
without hard-coding any of the data values in the SQL.

What about going the other way?

What if you have a fat squat denormalized table with a bazillion columns
and you want to turn it into a tall skinny table with a bazillion rows?
And then (for bonus points) what if you wanted to keep rotating to get a different fat squat version like this?
First things first; here's the first fat squat table:

--------------------------------------------------------------------------------
-- Step 1: Initialize data.

CREATE TABLE t1 (
c1 VARCHAR ( 10 ) NOT NULL,
Ford_count INTEGER NOT NULL,
Hyundai_count INTEGER NOT NULL,
Honda_count INTEGER NOT NULL,
Chevrolet_count INTEGER NOT NULL,
PRIMARY KEY ( c1 ) );

INSERT t1 VALUES ( 'AZ', 5000, 5000, 1000, 3000 );
INSERT t1 VALUES ( 'CA', 1000, 2000, 9000, 7000 );
INSERT t1 VALUES ( 'FL', 9000, 7000, 2000, 1000 );
INSERT t1 VALUES ( 'MA', 2000, 6000, 5000, 3000 );
INSERT t1 VALUES ( 'NY', 4000, 5000, 1000, 6000 );
COMMIT;
Here's the tall skinny table:

CREATE TABLE t1_normalized (
c1 VARCHAR ( 10 ) NOT NULL,
c2 VARCHAR ( 128 ) NOT NULL,
c3 INTEGER NOT NULL,
PRIMARY KEY ( c1, c2 ) );
Here's the code that fills the tall skinny table (Step 2) and then rotates it again to create the second fat squat table (Step 3):

BEGIN

DECLARE @sql LONG VARCHAR;

--------------------------------------------------------------------------------
-- Step 2: Normalize table.

SELECT STRING (
'INSERT t1_normalized \x0d\x0a',
LIST (
STRING (
'SELECT c1, \x0d\x0a',
' ''', SYSCOLUMN.column_name, ''',\x0d\x0a',
' ', SYSCOLUMN.column_name, '\x0d\x0a',
' FROM ', SYSTABLE.table_name ),
'\x0d\x0aUNION ALL \x0d\x0a'
ORDER BY SYSCOLUMN.column_id ) )
INTO @sql
FROM SYSCOLUMN
INNER JOIN SYSTABLE
ON SYSTABLE.table_id = SYSCOLUMN.table_id
WHERE SYSTABLE.table_name = 't1'
AND SYSCOLUMN.column_name LIKE '%?_count' ESCAPE '?';

SELECT @sql; -- for display

EXECUTE IMMEDIATE @sql;

--------------------------------------------------------------------------------
-- Step 3: Pivot c1 values into columns.

SELECT STRING (
'SELECT c2',
LIST (
STRING (
',\x0d\x0a SUM ( ( IF t1_normalized.c1 = ''',
t1_distinct.c1,
''' THEN 1 ELSE 0 ENDIF ) * t1_normalized.c3 ) AS "',
t1_distinct.c1,
'"' ),
''
ORDER BY t1_distinct.c1 ),
'\x0d\x0a INTO #t1_pivot',
'\x0d\x0a FROM t1_normalized',
'\x0d\x0a GROUP BY t1_normalized.c2' )
INTO @sql
FROM ( SELECT DISTINCT c1
FROM t1_normalized ) AS t1_distinct;

SELECT @sql; -- for display

EXECUTE IMMEDIATE @sql;

SELECT * FROM t1 ORDER BY c1; -- original data, for checking
SELECT * FROM t1_normalized ORDER BY c1, c2; -- normalized data, for checking
SELECT * FROM #t1_pivot ORDER BY c2; -- pivot table

END;
Step 2 is the interesting part: it builds a SQL statement dynamically, at runtime, and then uses the magic EXECUTE IMMEDIATE feature to execute that statement. EXECUTE IMMEDIATE isn't a new feature, it's been around forever, and although the code in this article has only been tested on SQL Anywhere 11.0.1 and 12.0.0 I'm pretty sure it works on older versions like 9 and 10, maybe even earlier.

The code on lines 19 through 23 looks in the SYSTABLE and SYSCOLUMN system tables for all the "_count" columns in table "t1". (Yes, this code depends on the being able to identify the columns of interest by how they are named: see the LIKE predicate on line 23.)
08.SELECT STRING (
09. 'INSERT t1_normalized \x0d\x0a',
10. LIST (
11. STRING (
12. 'SELECT c1, \x0d\x0a',
13. ' ''', SYSCOLUMN.column_name, ''',\x0d\x0a',
14. ' ', SYSCOLUMN.column_name, '\x0d\x0a',
15. ' FROM ', SYSTABLE.table_name ),
16. '\x0d\x0aUNION ALL \x0d\x0a'
17. ORDER BY SYSCOLUMN.column_id ) )
18. INTO @sql
19. FROM SYSCOLUMN
20. INNER JOIN SYSTABLE
21. ON SYSTABLE.table_id = SYSCOLUMN.table_id
22. WHERE SYSTABLE.table_name = 't1'
23. AND SYSCOLUMN.column_name LIKE '%?_count' ESCAPE '?';
24.
25.SELECT @sql; -- for display
26.
27.EXECUTE IMMEDIATE @sql;
The code on lines 8 through 18 builds a SQL statement in a string, and puts that string in the local LONG VARCHAR variable @sql.

The value placed in @sql consists of an 'INSERT...' (line 9) followed by a LIST (line 10) of 'SELECT...' statements (lines 12 through 15) separated by the 'UNION ALL' operator (line 16). The STRING() and LIST() functions are magic too, just like EXECUTE IMMEDIATE... all well worth learning.

The LIST - ORDER BY on line 17 doesn't affect the final outcome, but it is nice to have the generated SELECTs appear in the same order as the columns appear in the original fat squat table.

Confused yet?

Maybe it will help to see what gets put in @sql; an INSERT statement which creates one row in t1_normalized for every column in t1:

INSERT t1_normalized
SELECT c1,
'Ford_count',
Ford_count
FROM t1
UNION ALL
SELECT c1,
'Hyundai_count',
Hyundai_count
FROM t1
UNION ALL
SELECT c1,
'Honda_count',
Honda_count
FROM t1
UNION ALL
SELECT c1,
'Chevrolet_count',
Chevrolet_count
FROM t1

Here are the tables again; first, the fat squat input "FROM t1":
...and the tall skinny output "INSERT t1_normalized":
Oh, by the way... the code in Step 3 uses pretty much the same technique shown in Crosstab, Rotate, Pivot to produce the rotated-again fat squat table:

Does it work in the real world?

If you have the impression I sit around all day coding "SELECT FROM t1", let me set the record straight: I own and maintain some of the nastiest real-world code that's ever been created. And it's in that real world where the real problems arise, like the need to normalize and denormalize data to make it possible for the inevitable Queries From Hell to run in finite time. Necessity is the mother of Invention (although I do have to admit, it does sound pompous to use the word "invention" in a sentence :)

OK, so here's some real code, warts and all, the horribly denormalized table that supports the "Peaks since" line in the Foxhound database monitor:



Truly fat, truly squat:

CREATE TABLE rroad_peaks (

-- *****************************************************************
-- ***** THIS TABLE IS A CANDIDATE FOR POST-SETUP DATA UPGRADE *****
-- *****************************************************************

sampling_id UNSIGNED INTEGER NOT NULL,
peaks_calculated_to_sample_set_number UNSIGNED BIGINT NOT NULL DEFAULT 0,
peaks_calculated_after TIMESTAMP NOT NULL DEFAULT CURRENT TIMESTAMP,
peaks_calculated_after_sample_set_number UNSIGNED BIGINT NOT NULL DEFAULT 0,
earliest_preserved_sample_set_number UNSIGNED BIGINT NOT NULL DEFAULT 0,
peak_sample_elapsed_msec BIGINT NOT NULL DEFAULT 0,
peak_canarian_query_elapsed_msec BIGINT NOT NULL DEFAULT 0,
peak_ActiveReq BIGINT NOT NULL DEFAULT 0,
peak_CheckpointUrgency BIGINT NOT NULL DEFAULT 0,
peak_interval_Chkpt BIGINT NOT NULL DEFAULT 0,
peak_ConnCount BIGINT NOT NULL DEFAULT 0,
peak_CurrIO BIGINT NOT NULL DEFAULT 0,
peak_DBFileFragments BIGINT NOT NULL DEFAULT 0,
peak_LockCount BIGINT NOT NULL DEFAULT 0,
peak_MultiProgrammingLevel BIGINT NOT NULL DEFAULT 0,
peak_RecoveryUrgency BIGINT NOT NULL DEFAULT 0,
peak_UnschReq BIGINT NOT NULL DEFAULT 0,
peak_executing_connection_count BIGINT NOT NULL DEFAULT 0,
peak_idle_connection_count BIGINT NOT NULL DEFAULT 0,
peak_waiting_connection_count BIGINT NOT NULL DEFAULT 0,
peak_total_blocked_connection_count BIGINT NOT NULL DEFAULT 0,
peak_total_temporary_file_bytes BIGINT NOT NULL DEFAULT 0,
peak_rate_Bytes DECIMAL ( 30, 6 ) NOT NULL DEFAULT 0.0,
peak_rate_Disk DECIMAL ( 30, 6 ) NOT NULL DEFAULT 0.0,
peak_rate_BytesReceived DECIMAL ( 30, 6 ) NOT NULL DEFAULT 0.0,
peak_rate_BytesSent DECIMAL ( 30, 6 ) NOT NULL DEFAULT 0.0,
peak_rate_CachePanics DECIMAL ( 30, 6 ) NOT NULL DEFAULT 0.0,
peak_CacheSatisfaction DECIMAL ( 30, 6 ) NOT NULL DEFAULT 1.0, -- a ratio, with the smallest value defined as the "peak"
peak_rate_DiskRead DECIMAL ( 30, 6 ) NOT NULL DEFAULT 0.0,
peak_rate_DiskWrite DECIMAL ( 30, 6 ) NOT NULL DEFAULT 0.0,
peak_rate_FullCompare DECIMAL ( 30, 6 ) NOT NULL DEFAULT 0.0,
peak_rate_IndAdd DECIMAL ( 30, 6 ) NOT NULL DEFAULT 0.0,
peak_rate_IndLookup DECIMAL ( 30, 6 ) NOT NULL DEFAULT 0.0,
peak_IndSatisfaction DECIMAL ( 30, 6 ) NOT NULL DEFAULT 1.0, -- a ratio, with the smallest value defined as the "peak"
peak_rate_LogWrite DECIMAL ( 30, 6 ) NOT NULL DEFAULT 0.0,
peak_cpu_percentage_string VARCHAR ( 10 ) NOT NULL DEFAULT '',
peak_cpu_percentage_number DECIMAL ( 30, 6 ) NOT NULL DEFAULT 0.0,
peak_rate_QueryLowMemoryStrategy DECIMAL ( 30, 6 ) NOT NULL DEFAULT 0.0,
peak_rate_Req DECIMAL ( 30, 6 ) NOT NULL DEFAULT 0.0,
peak_rate_Commit DECIMAL ( 30, 6 ) NOT NULL DEFAULT 0.0,
peak_rate_Rlbk DECIMAL ( 30, 6 ) NOT NULL DEFAULT 0.0,
peak_sample_elapsed_msec_sample_set_number UNSIGNED BIGINT NOT NULL DEFAULT 0,
peak_canarian_query_elapsed_msec_sample_set_number UNSIGNED BIGINT NOT NULL DEFAULT 0,
peak_ActiveReq_sample_set_number UNSIGNED BIGINT NOT NULL DEFAULT 0,
peak_CheckpointUrgency_sample_set_number UNSIGNED BIGINT NOT NULL DEFAULT 0,
peak_interval_Chkpt_sample_set_number UNSIGNED BIGINT NOT NULL DEFAULT 0,
peak_ConnCount_sample_set_number UNSIGNED BIGINT NOT NULL DEFAULT 0,
peak_CurrIO_sample_set_number UNSIGNED BIGINT NOT NULL DEFAULT 0,
peak_DBFileFragments_sample_set_number UNSIGNED BIGINT NOT NULL DEFAULT 0,
peak_LockCount_sample_set_number UNSIGNED BIGINT NOT NULL DEFAULT 0,
peak_MultiProgrammingLevel_sample_set_number UNSIGNED BIGINT NOT NULL DEFAULT 0,
peak_RecoveryUrgency_sample_set_number UNSIGNED BIGINT NOT NULL DEFAULT 0,
peak_UnschReq_sample_set_number UNSIGNED BIGINT NOT NULL DEFAULT 0,
peak_executing_connection_count_sample_set_number UNSIGNED BIGINT NOT NULL DEFAULT 0,
peak_idle_connection_count_sample_set_number UNSIGNED BIGINT NOT NULL DEFAULT 0,
peak_waiting_connection_count_sample_set_number UNSIGNED BIGINT NOT NULL DEFAULT 0,
peak_total_blocked_connection_count_sample_set_number UNSIGNED BIGINT NOT NULL DEFAULT 0,
peak_total_temporary_file_bytes_sample_set_number UNSIGNED BIGINT NOT NULL DEFAULT 0,
peak_rate_Bytes_sample_set_number UNSIGNED BIGINT NOT NULL DEFAULT 0,
peak_rate_Disk_sample_set_number UNSIGNED BIGINT NOT NULL DEFAULT 0,
peak_rate_BytesReceived_sample_set_number UNSIGNED BIGINT NOT NULL DEFAULT 0,
peak_rate_BytesSent_sample_set_number UNSIGNED BIGINT NOT NULL DEFAULT 0,
peak_rate_CachePanics_sample_set_number UNSIGNED BIGINT NOT NULL DEFAULT 0,
peak_CacheSatisfaction_sample_set_number UNSIGNED BIGINT NOT NULL DEFAULT 0,
peak_rate_DiskRead_sample_set_number UNSIGNED BIGINT NOT NULL DEFAULT 0,
peak_rate_DiskWrite_sample_set_number UNSIGNED BIGINT NOT NULL DEFAULT 0,
peak_rate_FullCompare_sample_set_number UNSIGNED BIGINT NOT NULL DEFAULT 0,
peak_rate_IndAdd_sample_set_number UNSIGNED BIGINT NOT NULL DEFAULT 0,
peak_rate_IndLookup_sample_set_number UNSIGNED BIGINT NOT NULL DEFAULT 0,
peak_IndSatisfaction_sample_set_number UNSIGNED BIGINT NOT NULL DEFAULT 0,
peak_rate_LogWrite_sample_set_number UNSIGNED BIGINT NOT NULL DEFAULT 0,
peak_cpu_percentage_sample_set_number UNSIGNED BIGINT NOT NULL DEFAULT 0,
peak_rate_QueryLowMemoryStrategy_sample_set_number UNSIGNED BIGINT NOT NULL DEFAULT 0,
peak_rate_Req_sample_set_number UNSIGNED BIGINT NOT NULL DEFAULT 0,
peak_rate_Commit_sample_set_number UNSIGNED BIGINT NOT NULL DEFAULT 0,
peak_rate_Rlbk_sample_set_number UNSIGNED BIGINT NOT NULL DEFAULT 0,
PRIMARY KEY ( sampling_id ) );

Here's the code to normalize the "_sample_set_number" columns into a tall, skinny table:

BEGIN

DECLARE @sql LONG VARCHAR;

DECLARE LOCAL TEMPORARY TABLE normalized_peaks (
sampling_id UNSIGNED INTEGER NOT NULL,
column_name VARCHAR ( 128 ) NOT NULL,
sample_set_number UNSIGNED BIGINT NOT NULL DEFAULT 0,
PRIMARY KEY ( sampling_id, column_name ) );

SELECT STRING (
'INSERT normalized_peaks \x0d\x0a',
LIST (
STRING (
'SELECT sampling_id, \x0d\x0a',
' ''', SYSCOLUMN.column_name, ''',\x0d\x0a',
' ', SYSCOLUMN.column_name, '\x0d\x0a',
' FROM ', SYSTABLE.table_name ),
'\x0d\x0aUNION ALL \x0d\x0a'
ORDER BY SYSCOLUMN.column_id ) )
INTO @sql
FROM SYSCOLUMN
INNER JOIN SYSTABLE
ON SYSTABLE.table_id = SYSCOLUMN.table_id
WHERE SYSTABLE.table_name = 'rroad_peaks'
AND SYSCOLUMN.column_name LIKE '%?_sample_set_number' ESCAPE '?';

SELECT @sql; -- for display

EXECUTE IMMEDIATE @sql;

SELECT * FROM rroad_peaks ORDER BY sampling_id;
SELECT * FROM normalized_peaks ORDER BY sampling_id, column_name;

END;

Here's what the normalized table looks like:
And just to drive the point home, here's the answer to the question "Why go to all that trouble with EXECUTE IMMEDIATE blah blah, why not just hand-code the solution?"

The answer is, the rroad_peaks table is subject to change (especially new columns) and the normalization logic is just one more piece of code that does NOT have to change along with it; i.e., this is the generated INSERT, this is the code that doesn't have to be maintained:

INSERT normalized_peaks
SELECT sampling_id,
'peaks_calculated_to_sample_set_number',
peaks_calculated_to_sample_set_number
FROM rroad_peaks
UNION ALL
SELECT sampling_id,
'peaks_calculated_after_sample_set_number',
peaks_calculated_after_sample_set_number
FROM rroad_peaks
UNION ALL
SELECT sampling_id,
'earliest_preserved_sample_set_number',
earliest_preserved_sample_set_number
FROM rroad_peaks
UNION ALL
SELECT sampling_id,
'peak_sample_elapsed_msec_sample_set_number',
peak_sample_elapsed_msec_sample_set_number
FROM rroad_peaks
UNION ALL
SELECT sampling_id,
'peak_canarian_query_elapsed_msec_sample_set_number',
peak_canarian_query_elapsed_msec_sample_set_number
FROM rroad_peaks
UNION ALL
SELECT sampling_id,
'peak_ActiveReq_sample_set_number',
peak_ActiveReq_sample_set_number
FROM rroad_peaks
UNION ALL
SELECT sampling_id,
'peak_CheckpointUrgency_sample_set_number',
peak_CheckpointUrgency_sample_set_number
FROM rroad_peaks
UNION ALL
SELECT sampling_id,
'peak_interval_Chkpt_sample_set_number',
peak_interval_Chkpt_sample_set_number
FROM rroad_peaks
UNION ALL
SELECT sampling_id,
'peak_ConnCount_sample_set_number',
peak_ConnCount_sample_set_number
FROM rroad_peaks
UNION ALL
SELECT sampling_id,
'peak_CurrIO_sample_set_number',
peak_CurrIO_sample_set_number
FROM rroad_peaks
UNION ALL
SELECT sampling_id,
'peak_DBFileFragments_sample_set_number',
peak_DBFileFragments_sample_set_number
FROM rroad_peaks
UNION ALL
SELECT sampling_id,
'peak_LockCount_sample_set_number',
peak_LockCount_sample_set_number
FROM rroad_peaks
UNION ALL
SELECT sampling_id,
'peak_MultiProgrammingLevel_sample_set_number',
peak_MultiProgrammingLevel_sample_set_number
FROM rroad_peaks
UNION ALL
SELECT sampling_id,
'peak_RecoveryUrgency_sample_set_number',
peak_RecoveryUrgency_sample_set_number
FROM rroad_peaks
UNION ALL
SELECT sampling_id,
'peak_UnschReq_sample_set_number',
peak_UnschReq_sample_set_number
FROM rroad_peaks
UNION ALL
SELECT sampling_id,
'peak_executing_connection_count_sample_set_number',
peak_executing_connection_count_sample_set_number
FROM rroad_peaks
UNION ALL
SELECT sampling_id,
'peak_idle_connection_count_sample_set_number',
peak_idle_connection_count_sample_set_number
FROM rroad_peaks
UNION ALL
SELECT sampling_id,
'peak_waiting_connection_count_sample_set_number',
peak_waiting_connection_count_sample_set_number
FROM rroad_peaks
UNION ALL
SELECT sampling_id,
'peak_total_blocked_connection_count_sample_set_number',
peak_total_blocked_connection_count_sample_set_number
FROM rroad_peaks
UNION ALL
SELECT sampling_id,
'peak_total_temporary_file_bytes_sample_set_number',
peak_total_temporary_file_bytes_sample_set_number
FROM rroad_peaks
UNION ALL
SELECT sampling_id,
'peak_rate_Bytes_sample_set_number',
peak_rate_Bytes_sample_set_number
FROM rroad_peaks
UNION ALL
SELECT sampling_id,
'peak_rate_Disk_sample_set_number',
peak_rate_Disk_sample_set_number
FROM rroad_peaks
UNION ALL
SELECT sampling_id,
'peak_rate_BytesReceived_sample_set_number',
peak_rate_BytesReceived_sample_set_number
FROM rroad_peaks
UNION ALL
SELECT sampling_id,
'peak_rate_BytesSent_sample_set_number',
peak_rate_BytesSent_sample_set_number
FROM rroad_peaks
UNION ALL
SELECT sampling_id,
'peak_rate_CachePanics_sample_set_number',
peak_rate_CachePanics_sample_set_number
FROM rroad_peaks
UNION ALL
SELECT sampling_id,
'peak_CacheSatisfaction_sample_set_number',
peak_CacheSatisfaction_sample_set_number
FROM rroad_peaks
UNION ALL
SELECT sampling_id,
'peak_rate_DiskRead_sample_set_number',
peak_rate_DiskRead_sample_set_number
FROM rroad_peaks
UNION ALL
SELECT sampling_id,
'peak_rate_DiskWrite_sample_set_number',
peak_rate_DiskWrite_sample_set_number
FROM rroad_peaks
UNION ALL
SELECT sampling_id,
'peak_rate_FullCompare_sample_set_number',
peak_rate_FullCompare_sample_set_number
FROM rroad_peaks
UNION ALL
SELECT sampling_id,
'peak_rate_IndAdd_sample_set_number',
peak_rate_IndAdd_sample_set_number
FROM rroad_peaks
UNION ALL
SELECT sampling_id,
'peak_rate_IndLookup_sample_set_number',
peak_rate_IndLookup_sample_set_number
FROM rroad_peaks
UNION ALL
SELECT sampling_id,
'peak_IndSatisfaction_sample_set_number',
peak_IndSatisfaction_sample_set_number
FROM rroad_peaks
UNION ALL
SELECT sampling_id,
'peak_rate_LogWrite_sample_set_number',
peak_rate_LogWrite_sample_set_number
FROM rroad_peaks
UNION ALL
SELECT sampling_id,
'peak_cpu_percentage_sample_set_number',
peak_cpu_percentage_sample_set_number
FROM rroad_peaks
UNION ALL
SELECT sampling_id,
'peak_rate_QueryLowMemoryStrategy_sample_set_number',
peak_rate_QueryLowMemoryStrategy_sample_set_number
FROM rroad_peaks
UNION ALL
SELECT sampling_id,
'peak_rate_Req_sample_set_number',
peak_rate_Req_sample_set_number
FROM rroad_peaks
UNION ALL
SELECT sampling_id,
'peak_rate_Commit_sample_set_number',
peak_rate_Commit_sample_set_number
FROM rroad_peaks
UNION ALL
SELECT sampling_id,
'peak_rate_Rlbk_sample_set_number',
peak_rate_Rlbk_sample_set_number
FROM rroad_peaks

No comments: