SQL Anywhere: Divide and Conquer by Working Backwards

In previous postings here and here I have talked about solving large SQL query problems by splitting them into smaller problems that are easier to solve: divide and conquer.

Yeah, right... easier said than done... what if it isn't obvious how to split up the big problem into manageable pieces?

One way is to propose "a perfect world" in which the large problem has a simple solution, and work backwards from there. For a database query problem, that means starting with "a perfect schema" designed with one purpose in mind, to make it easy to solve the large problem, and then work backwards step-by-step to devise less-perfect schemas.

At each step of the way the less-perfect schema should meet two criteria: it should be relatively easy to transform the less-perfect schema to the previous more-perfect schema, and each less-perfect schema should be closer to reality... where reality is the ugly real-world schema that made the query hard to write in the first place. The problem is solved when you have a series of queries that transforms the real-world schema into the solution result set.

In other words, "divide and conquer" describes the end result whereas "working backwards" provides a possible path to that result. Here's an example:

Using a bank account database containing separate tables for opening balance, deposits, withdrawals and checks, display each entry together with a closing balance after the entry is applied. The entries are to be ordered by date, then within date by opening balance first followed by deposits, withdrawals and checks.

date        description                    amount  closing_balance
2008-02-01  FY009 Opening Balance         2329.54          2329.54
2008-02-19  DEP - insurance - adjustment   174.29          2503.83
2008-02-21  DEP - tax - refund             482.95          2986.78
2008-02-21  WD - cash - ATM               -100.00          2886.78
2008-02-21  CHK 301 - Township - dogtags   -40.00          2846.78
2008-02-21  CHK 300 - hydro - invoice     -283.53          2563.25
2008-08-15  WD - cash - ATM               -100.00          2463.25

Start With The Perfect Schema

Here's what "the perfect schema" might look like, one single table containing the opening balance and all the deposits, withdrawals and checks as individual rows:

CREATE TABLE solution ( 
   transaction_number INTEGER NOT NULL,
   "date"             DATE NOT NULL,
   description        VARCHAR ( 200 ) NOT NULL,
   amount             NUMERIC ( 9, 2 ) NOT NULL,
   closing_balance    NUMERIC ( 9, 2 ) NOT NULL,
   PRIMARY KEY ( transaction_number );

The transaction_order column specifies the order in which the closing_balance values are calculated as well as the sort order for the final query:

SELECT solution."date",
       solution.description,
       solution.amount,
       solution.closing_balance
  FROM solution
 ORDER BY solution.transaction_number;

1st Step Back: Calculate Closing Balance

It should come as no surprise that the real-world schema looks nothing like the perfect solution table. In fact, there are four tables, not one, and there are no transaction_number or closing_balance columns:

CREATE TABLE opening_balance ( 
   opening_balance_date   DATE NOT NULL,
   amount                 NUMERIC ( 9, 2 ) NOT NULL,
   description            VARCHAR ( 200 ) NOT NULL,
   PRIMARY KEY ( opening_balance_date ) );

opening_balance_date   amount  description
--------------------  -------  ---------------------
2008-02-01            2329.54  FY009 Opening Balance 

CREATE TABLE deposit ( 
   deposit_number  INTEGER NOT NULL DEFAULT AUTOINCREMENT,
   deposit_date    DATE NOT NULL,
   paid_by         VARCHAR ( 50 ) NOT NULL,
   amount          NUMERIC ( 9, 2 ) NOT NULL,
   description     VARCHAR ( 200 ) NOT NULL,
   PRIMARY KEY ( deposit_number ) );

deposit_number  deposit_date  paid_by    amount  description
--------------  ------------  ---------  ------  -----------
1               2008-02-19    insurance  174.29  adjustment 
2               2008-02-21    tax        482.95  refund 

CREATE TABLE withdrawal ( 
   withdrawal_number  INTEGER NOT NULL DEFAULT AUTOINCREMENT,
   withdrawal_date    DATE NOT NULL,
   paid_to            VARCHAR ( 50 ) NOT NULL,
   amount             NUMERIC ( 9, 2 ) NOT NULL,
   description        VARCHAR ( 200 ) NOT NULL,
   PRIMARY KEY ( withdrawal_number ) );

withdrawal_number  withdrawal_date  paid_to  amount  description
-----------------  ---------------  -------  ------  -----------
1                  2008-02-21       cash     100.00  ATM 
2                  2008-08-15       cash     100.00  ATM 

CREATE TABLE "check" ( 
   check_number  INTEGER NOT NULL,
   issue_date    DATE NOT NULL,
   pay_to        VARCHAR ( 50 ) NOT NULL,
   amount        NUMERIC ( 9, 2 ) NOT NULL,
   check_for     VARCHAR ( 35 ) NOT NULL,
   PRIMARY KEY ( check_number ) );

check_number  issue_date  pay_to    amount  check_for
------------  ----------  --------  ------  ---------
300           2008-02-21  hydro     283.53  invoice 
301           2008-02-21  Township   40.00  dogtags

But, we're working backwards... let's pretend there is a single table that holds all the entries, and it does have a transaction_number column but not closing_balance. Here's how it can be used to calculate closing_balance:

CREATE TABLE transaction ( 
   transaction_number INTEGER NOT NULL,
   "date"             DATE NOT NULL,
   description        VARCHAR ( 200 ) NOT NULL,
   amount             NUMERIC ( 9, 2 ) NOT NULL,
   PRIMARY KEY ( transaction_number );

CREATE VIEW solution AS
SELECT transaction.transaction_number,
       transaction."date",
       transaction.description,
       transaction.amount,
       SUM ( transaction.amount )
             OVER balance_window AS closing_balance
  FROM transaction
WINDOW balance_window AS (
           ORDER BY transaction.transaction_number
           RANGE BETWEEN UNBOUNDED PRECEDING
                 AND     CURRENT ROW );

The transaction table is the "less-perfect schema", one step closer to reality (it doesn't have the balance column). The solution table has been replaced by a view which performs the transformation to the perfect schema as follows:

A OLAP WINDOW clause is used to define a "balance_window" for each row in the result set.
The balance_window is cumulative: it includes the current row as well as all the preceding rows: BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.
The transaction_number column controls the ordering of rows in the balance_window.
The aggregate expression SUM ( transaction.amount ) OVER balance_window calculates the closing_balance value for each row.

Tip for English Speakers: The phrase "UNBOUNDED PRECEDING" may be translated from Pointy-Haired-ANSI-Speak to English as follows: "first row". In other words, the BETWEEN clause really means BETWEEN FIRST ROW AND CURRENT ROW. Similarly, "UNBOUNDED FOLLOWING" really means "last row".

2nd Step Back: Calculate Transaction Number

The next step backwards is to propose an even-less-perfect table raw_transaction that doesn't have either closing_balance or transaction_number, but does include a transaction_type column that differentiates between opening_balance, deposit, withdrawal and check as sort_order = '1', '2', '3' and '4':

CREATE TABLE raw_transaction ( 
   "date"       DATE NOT NULL,
   sort_order   VARCHAR ( 1 ) NOT NULL,
   description  VARCHAR ( 200 ) NOT NULL,
   amount       NUMERIC ( 9, 2 ) NOT NULL,
   PRIMARY KEY ( "date",
                 sort_order,
                 description,
                 amount );

The raw_transaction table may be transformed into the transaction view (formerly the transaction table) by using the OLAP ROW_NUMBER() aggregate function to calculate the transaction_number column:

CREATE VIEW transaction (
   transaction_number,
   "date",
   description,
   amount )   
AS
SELECT ROW_NUMBER() OVER ( 
          ORDER BY 
             raw_transaction."date",
             raw_transaction.sort_order ) 
          AS transaction_number,  
       raw_transaction."date",
       raw_transaction.description,
       raw_transaction.amount
  FROM raw_transaction;

The OVER ( ORDER BY ... ) clause tells ROW_NUMBER() what row order to use when doing the numbering. Astute observers will note that date and sort_order aren't fully deterministic because more than one row can have the same values, but that doesn't matter because the original specs weren't fully deterministic either. In other words, if there are two withdrawals on the same day it doesn't matter which order they appear in, only that they appear together, after all the deposits and before all the checks on that day.

3rd Step Back: Gather The Raw Transactions

The final step backwards replaces the raw_transaction table with a view that uses a UNION of four selects that pound the individual tables into one lowest-common-denominator representation of the data.

Here is that view, plus the other three queries that together form the fully set oriented "divide and conquer" solution:

Correction:The first two columns in the CREATE VIEW column list appeared in the wrong order when this article was first posted. The error might have been caught if the actual output had been included (it is now, down at the bottom) instead of just the original "specs".

CREATE VIEW raw_transaction (
   "date",
   sort_order,
   description,
   amount )   
AS
SELECT opening_balance.opening_balance_date,
       '1',
       opening_balance.description,
       opening_balance.amount
  FROM opening_balance
UNION ALL
SELECT deposit.deposit_date,
       '2',
       STRING (
          'DEP - ',
          deposit.paid_by,
          ' - ',
          deposit.description ),
       deposit.amount
  FROM deposit
UNION ALL
SELECT withdrawal.withdrawal_date,
       '3',
       STRING (
          'WD - ',
          withdrawal.paid_to,
          ' - ',
          withdrawal.description ),
       -withdrawal.amount
  FROM withdrawal
 UNION ALL
SELECT "check".issue_date,
       '4',
       STRING (
          'CHK ',
          "check".check_number,
          ' - ',
          "check".pay_to,
          ' - ',
          "check".check_for),
       -"check".amount
  FROM "check";

CREATE VIEW transaction (
   transaction_number,
   "date",
   description,
   amount )   
AS
SELECT ROW_NUMBER() OVER ( 
          ORDER BY 
             raw_transaction."date",
             raw_transaction.sort_order ) 
          AS transaction_number,  
       raw_transaction."date",
       raw_transaction.description,
       raw_transaction.amount
  FROM raw_transaction;  

CREATE VIEW solution AS
SELECT transaction.transaction_number,
       transaction."date",
       transaction.description,
       transaction.amount,
       SUM ( transaction.amount )
             OVER balance_window AS closing_balance
  FROM transaction
WINDOW balance_window AS (
           ORDER BY transaction.transaction_number
           RANGE BETWEEN UNBOUNDED PRECEDING
                 AND     CURRENT ROW );  

SELECT solution."date",
       solution.description,
       solution.amount,
       solution.closing_balance
  FROM solution
 ORDER BY solution.transaction_number;

The final output shows the two checks in a different order from the original example, but that's OK because the specs didn't say how they should be sorted:

date        description                    amount  closing_balance
2008-02-01  FY009 Opening Balance         2329.54          2329.54
2008-02-19  DEP - insurance - adjustment   174.29          2503.83
2008-02-21  DEP - tax - refund             482.95          2986.78
2008-02-21  WD - cash - ATM               -100.00          2886.78
2008-02-21  CHK 300 - hydro - invoice     -283.53          2603.25
2008-02-21  CHK 301 - Township - dogtags   -40.00          2563.25
2008-08-15  WD - cash - ATM               -100.00          2463.25

2 comments:

Anonymous said...: Hi Breck,

Great post! You did a great job explaining a complex topic.

From your solution though, this struck me as odd:

CREATE VIEW raw_transaction (
sort_order,
"date",
...

The following select statements all seem to suggest that the view should be

CREATE VIEW raw_transaction (
"date",
sort_order,
...

Right?

Regards,

-james.; October 9, 2008 at 10:06 PM
Breck Carter said...: Your correction has been noted! Thanks... making a mistake is on thing, but not showing the actual output that corresponds to the sample code (or vice versa) is worse.; October 10, 2008 at 9:55 AM

SQL Anywhere

Pages

Monday, October 6, 2008

Divide and Conquer by Working Backwards

Start With The Perfect Schema

1st Step Back: Calculate Closing Balance

2nd Step Back: Calculate Transaction Number

3rd Step Back: Gather The Raw Transactions

2 comments:

Categories