Monday, October 6, 2008

Divide and Conquer by Working Backwards

In previous postings here and here I have talked about solving large SQL query problems by splitting them into smaller problems that are easier to solve: divide and conquer.

Yeah, right... easier said than done... what if it isn't obvious how to split up the big problem into manageable pieces?

One way is to propose "a perfect world" in which the large problem has a simple solution, and work backwards from there. For a database query problem, that means starting with "a perfect schema" designed with one purpose in mind, to make it easy to solve the large problem, and then work backwards step-by-step to devise less-perfect schemas.

At each step of the way the less-perfect schema should meet two criteria: it should be relatively easy to transform the less-perfect schema to the previous more-perfect schema, and each less-perfect schema should be closer to reality... where reality is the ugly real-world schema that made the query hard to write in the first place. The problem is solved when you have a series of queries that transforms the real-world schema into the solution result set.

In other words, "divide and conquer" describes the end result whereas "working backwards" provides a possible path to that result. Here's an example:

Using a bank account database containing separate tables for opening balance, deposits, withdrawals and checks, display each entry together with a closing balance after the entry is applied. The entries are to be ordered by date, then within date by opening balance first followed by deposits, withdrawals and checks.
date        description                    amount  closing_balance
2008-02-01 FY009 Opening Balance 2329.54 2329.54
2008-02-19 DEP - insurance - adjustment 174.29 2503.83
2008-02-21 DEP - tax - refund 482.95 2986.78
2008-02-21 WD - cash - ATM -100.00 2886.78
2008-02-21 CHK 301 - Township - dogtags -40.00 2846.78
2008-02-21 CHK 300 - hydro - invoice -283.53 2563.25
2008-08-15 WD - cash - ATM -100.00 2463.25

Start With The Perfect Schema

Here's what "the perfect schema" might look like, one single table containing the opening balance and all the deposits, withdrawals and checks as individual rows:
CREATE TABLE solution ( 
transaction_number INTEGER NOT NULL,
"date" DATE NOT NULL,
description VARCHAR ( 200 ) NOT NULL,
amount NUMERIC ( 9, 2 ) NOT NULL,
closing_balance NUMERIC ( 9, 2 ) NOT NULL,
PRIMARY KEY ( transaction_number );
The transaction_order column specifies the order in which the closing_balance values are calculated as well as the sort order for the final query:
SELECT solution."date",
solution.description,
solution.amount,
solution.closing_balance
FROM solution
ORDER BY solution.transaction_number;

1st Step Back: Calculate Closing Balance

It should come as no surprise that the real-world schema looks nothing like the perfect solution table. In fact, there are four tables, not one, and there are no transaction_number or closing_balance columns:
CREATE TABLE opening_balance ( 
opening_balance_date DATE NOT NULL,
amount NUMERIC ( 9, 2 ) NOT NULL,
description VARCHAR ( 200 ) NOT NULL,
PRIMARY KEY ( opening_balance_date ) );

opening_balance_date amount description
-------------------- ------- ---------------------
2008-02-01 2329.54 FY009 Opening Balance

CREATE TABLE deposit (
deposit_number INTEGER NOT NULL DEFAULT AUTOINCREMENT,
deposit_date DATE NOT NULL,
paid_by VARCHAR ( 50 ) NOT NULL,
amount NUMERIC ( 9, 2 ) NOT NULL,
description VARCHAR ( 200 ) NOT NULL,
PRIMARY KEY ( deposit_number ) );

deposit_number deposit_date paid_by amount description
-------------- ------------ --------- ------ -----------
1 2008-02-19 insurance 174.29 adjustment
2 2008-02-21 tax 482.95 refund

CREATE TABLE withdrawal (
withdrawal_number INTEGER NOT NULL DEFAULT AUTOINCREMENT,
withdrawal_date DATE NOT NULL,
paid_to VARCHAR ( 50 ) NOT NULL,
amount NUMERIC ( 9, 2 ) NOT NULL,
description VARCHAR ( 200 ) NOT NULL,
PRIMARY KEY ( withdrawal_number ) );

withdrawal_number withdrawal_date paid_to amount description
----------------- --------------- ------- ------ -----------
1 2008-02-21 cash 100.00 ATM
2 2008-08-15 cash 100.00 ATM

CREATE TABLE "check" (
check_number INTEGER NOT NULL,
issue_date DATE NOT NULL,
pay_to VARCHAR ( 50 ) NOT NULL,
amount NUMERIC ( 9, 2 ) NOT NULL,
check_for VARCHAR ( 35 ) NOT NULL,
PRIMARY KEY ( check_number ) );

check_number issue_date pay_to amount check_for
------------ ---------- -------- ------ ---------
300 2008-02-21 hydro 283.53 invoice
301 2008-02-21 Township 40.00 dogtags
But, we're working backwards... let's pretend there is a single table that holds all the entries, and it does have a transaction_number column but not closing_balance. Here's how it can be used to calculate closing_balance:
CREATE TABLE transaction ( 
transaction_number INTEGER NOT NULL,
"date" DATE NOT NULL,
description VARCHAR ( 200 ) NOT NULL,
amount NUMERIC ( 9, 2 ) NOT NULL,
PRIMARY KEY ( transaction_number );

CREATE VIEW solution AS
SELECT transaction.transaction_number,
transaction."date",
transaction.description,
transaction.amount,
SUM ( transaction.amount )
OVER balance_window AS closing_balance
FROM transaction
WINDOW balance_window AS (
ORDER BY transaction.transaction_number
RANGE BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW );
The transaction table is the "less-perfect schema", one step closer to reality (it doesn't have the balance column). The solution table has been replaced by a view which performs the transformation to the perfect schema as follows:
  • A OLAP WINDOW clause is used to define a "balance_window" for each row in the result set.

  • The balance_window is cumulative: it includes the current row as well as all the preceding rows: BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.

  • The transaction_number column controls the ordering of rows in the balance_window.

  • The aggregate expression SUM ( transaction.amount ) OVER balance_window calculates the closing_balance value for each row.
Tip for English Speakers: The phrase "UNBOUNDED PRECEDING" may be translated from Pointy-Haired-ANSI-Speak to English as follows: "first row". In other words, the BETWEEN clause really means BETWEEN FIRST ROW AND CURRENT ROW. Similarly, "UNBOUNDED FOLLOWING" really means "last row".

2nd Step Back: Calculate Transaction Number

The next step backwards is to propose an even-less-perfect table raw_transaction that doesn't have either closing_balance or transaction_number, but does include a transaction_type column that differentiates between opening_balance, deposit, withdrawal and check as sort_order = '1', '2', '3' and '4':
CREATE TABLE raw_transaction ( 
"date" DATE NOT NULL,
sort_order VARCHAR ( 1 ) NOT NULL,
description VARCHAR ( 200 ) NOT NULL,
amount NUMERIC ( 9, 2 ) NOT NULL,
PRIMARY KEY ( "date",
sort_order,
description,
amount );
The raw_transaction table may be transformed into the transaction view (formerly the transaction table) by using the OLAP ROW_NUMBER() aggregate function to calculate the transaction_number column:
CREATE VIEW transaction (
transaction_number,
"date",
description,
amount )
AS
SELECT ROW_NUMBER() OVER (
ORDER BY
raw_transaction."date",
raw_transaction.sort_order )
AS transaction_number,
raw_transaction."date",
raw_transaction.description,
raw_transaction.amount
FROM raw_transaction;
The OVER ( ORDER BY ... ) clause tells ROW_NUMBER() what row order to use when doing the numbering. Astute observers will note that date and sort_order aren't fully deterministic because more than one row can have the same values, but that doesn't matter because the original specs weren't fully deterministic either. In other words, if there are two withdrawals on the same day it doesn't matter which order they appear in, only that they appear together, after all the deposits and before all the checks on that day.

3rd Step Back: Gather The Raw Transactions

The final step backwards replaces the raw_transaction table with a view that uses a UNION of four selects that pound the individual tables into one lowest-common-denominator representation of the data.

Here is that view, plus the other three queries that together form the fully set oriented "divide and conquer" solution:
Correction:The first two columns in the CREATE VIEW column list appeared in the wrong order when this article was first posted. The error might have been caught if the actual output had been included (it is now, down at the bottom) instead of just the original "specs".
CREATE VIEW raw_transaction (
"date",
sort_order,
description,
amount )
AS
SELECT opening_balance.opening_balance_date,
'1',
opening_balance.description,
opening_balance.amount
FROM opening_balance
UNION ALL
SELECT deposit.deposit_date,
'2',
STRING (
'DEP - ',
deposit.paid_by,
' - ',
deposit.description ),
deposit.amount
FROM deposit
UNION ALL
SELECT withdrawal.withdrawal_date,
'3',
STRING (
'WD - ',
withdrawal.paid_to,
' - ',
withdrawal.description ),
-withdrawal.amount
FROM withdrawal
UNION ALL
SELECT "check".issue_date,
'4',
STRING (
'CHK ',
"check".check_number,
' - ',
"check".pay_to,
' - ',
"check".check_for),
-"check".amount
FROM "check";

CREATE VIEW transaction (
transaction_number,
"date",
description,
amount )
AS
SELECT ROW_NUMBER() OVER (
ORDER BY
raw_transaction."date",
raw_transaction.sort_order )
AS transaction_number,
raw_transaction."date",
raw_transaction.description,
raw_transaction.amount
FROM raw_transaction;

CREATE VIEW solution AS
SELECT transaction.transaction_number,
transaction."date",
transaction.description,
transaction.amount,
SUM ( transaction.amount )
OVER balance_window AS closing_balance
FROM transaction
WINDOW balance_window AS (
ORDER BY transaction.transaction_number
RANGE BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW );

SELECT solution."date",
solution.description,
solution.amount,
solution.closing_balance
FROM solution
ORDER BY solution.transaction_number;
The final output shows the two checks in a different order from the original example, but that's OK because the specs didn't say how they should be sorted:
date        description                    amount  closing_balance
2008-02-01 FY009 Opening Balance 2329.54 2329.54
2008-02-19 DEP - insurance - adjustment 174.29 2503.83
2008-02-21 DEP - tax - refund 482.95 2986.78
2008-02-21 WD - cash - ATM -100.00 2886.78
2008-02-21 CHK 300 - hydro - invoice -283.53 2603.25
2008-02-21 CHK 301 - Township - dogtags -40.00 2563.25
2008-08-15 WD - cash - ATM -100.00 2463.25

2 comments:

Anonymous said...

Hi Breck,

Great post! You did a great job explaining a complex topic.

From your solution though, this struck me as odd:

CREATE VIEW raw_transaction (
sort_order,
"date",
...

The following select statements all seem to suggest that the view should be

CREATE VIEW raw_transaction (
"date",
sort_order,
...

Right?

Regards,

-james.

Breck Carter said...

Your correction has been noted! Thanks... making a mistake is on thing, but not showing the actual output that corresponds to the sample code (or vice versa) is worse.