Version 3 of the Foxhound Database Monitor is now available...

Friday, February 29, 2008

Great Moments In History: Housing Bubble

It's the last day of the first month in the history of this blog...

So far there have been 3 kinds of postings here:

Well, this particular Great Moments entry isn't quite so easy to categorize: Yes, it is part happy-fun-time, at least if you're shopping for a house in south-west Florida. But it's also part serious technical stuff because it talks about the new FIRST_VALUE() and LAST_VALUE() aggregate functions in SQL Anywhere version 10.0.1.

Here are the specs for a real-world query:
Has the housing bubble finished bursting in south-west Florida?
Let's put that in more concrete terms:
Show me the price changes for houses currently on the market in south-west Florida.
Now let's have a look at the data that's actually available for this particular example: Every few days information on houses available for sale in a particular area has been gathered from emails and stored in a table. About 4,300 rows now exist, going back to 2005, covering about 1,300 separate houses. Some houses have only one row, some have several rows showing how the asking price has changed over the years.

Here is a query showing the history for one particular house:
SELECT email_date, street_address, zip_code, price
FROM mls_entry
WHERE street_address = '4207 Pelican Blvd'
ORDER BY email_date;
The asking price has collapsed about 58% over the past 2 1/2 years... clearly the residential real estate bubble is bursting:



Here's what the base table looks like:
CREATE TABLE mls_entry ( -- 4,345 rows, 440k total
pkey INTEGER NOT NULL DEFAULT autoincrement,
email_date DATE NOT NULL,
entry_line_number BIGINT NOT NULL DEFAULT 0,
price DECIMAL ( 11, 2 ) NOT NULL DEFAULT 0,
mls_number BIGINT NOT NULL DEFAULT 0,
street_number LONG VARCHAR NOT NULL DEFAULT '',
street_address LONG VARCHAR NOT NULL DEFAULT '',
zip_code VARCHAR ( 5 ) NOT NULL DEFAULT '',
street_name LONG VARCHAR NOT NULL
COMPUTE ( SUBSTR(street_address,LENGTH(street_number)+1) ),
CONSTRAINT ASA74 PRIMARY KEY ( -- 56k
pkey ) );
So let's go back and re-state the specs in terms of the mls_entry table:
Show the overall price changes for houses with two or more email entries, where the latest entry was received this year, and sort it by percentage price change.
What we want, for each street address, is a single row showing the price change between the earliest and latest entries. Not using the minimum and maximum prices, that would be too easy. We want to use the prices that correspond to the minimum and maximum email_date values. For example, for the house at 4207 Pelican Blvd it means reducing 16 mls_entry rows down to one row in the result.

Until version 10.0.1 there was no easy way to compute that using SQL Anywhere. Here is a query that works in earlier versions; it uses a join between two different derived tables, each with a correlated subselect, to gather data from the first and last rows for each address:
WITH price_change AS
( SELECT DISTINCT
first_entry.street_name,
first_entry.street_number,
first_entry.zip_code,
STRING
( first_entry.street_address,
' ',
first_entry.zip_code ) AS address,
first_entry.email_date AS first_date,
last_entry.email_date AS last_date,
first_entry.price AS first_price,
last_entry.price AS last_price
FROM ( SELECT *
FROM mls_entry AS F
WHERE F.email_date
= ( SELECT MIN ( M.email_date )
FROM mls_entry AS M
WHERE M.street_address
= F.street_address
AND M.zip_code
= F.zip_code ) )
AS first_entry
INNER JOIN
( SELECT *
FROM mls_entry AS L
WHERE L.email_date
= ( SELECT MAX ( M.email_date )
FROM mls_entry AS M
WHERE M.street_address
= L.street_address
AND M.zip_code
= L.zip_code ) )
AS last_entry
ON first_entry.street_address
= last_entry.street_address
AND first_entry.zip_code
= last_entry.zip_code )
SELECT address AS street_and_zip,
DATEDIFF ( DAY, first_date, last_date ) AS days,
( last_price - first_price ) AS change,
CAST ( change / first_price
* 100.0 AS INTEGER ) AS "%",
CAST ( last_price AS INTEGER ) AS price,
last_date AS "date"
FROM price_change
WHERE last_date >= '2008-01-01'
AND last_date <> first_date
ORDER BY "%",
zip_code,
street_name,
CAST ( street_number AS BIGINT );
The WITH price_change AS ( SELECT ... clause has been added to extend the "divide and conquer" process discussed back in The Evolution of a SELECT. You can think of the WITH clause as a local CREATE VIEW price_change, and here it simply separates the complex logic of gathering first-and-last rows from the final SELECT to do the calculations... like I said, divide and conquer to makes coding easier.

Here's the final result; you can see that a 58% price drop is not unique. In fact there are over 10 houses for which the asking price is half what it once was:



Of the 209 houses in the full result set, 200 show a decrease in asking price and only 2 show an increase, so the answer to our original question "has the bubble finished bursting?" is "probably not yet".

And now back to the new features: SQL Anywhere 9 introduced the OLAP WINDOW clause to let you create complex queries with less effort, and version 10.0.1 introduced two new aggregate functions to use with it: FIRST_VALUE() and LAST_VALUE(). Together these features completely eliminate the messy derived tables and correlated subselects shown earlier; here's how:

The following clause defines a window named date_window:
     WINDOW date_window AS ( 
PARTITION BY street_address, zip_code
ORDER BY email_date
RANGE BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING ) )
Inside the WINDOW clause, the PARTITION BY breaks up the rows into separate groups or partitions, one per house address (street_address plus zip_code).

The ORDER BY then sorts the rows inside each partition, and the RANGE ... UNBOUNDED clause is OLAP-speak for "make the window the same size as the partition when applying aggregate functions to the window".

In other words, the WINDOW clause is like a GROUP BY on steroids: It doesn't just group (partition) rows on specified columns, but it preserves all the rows and all the columns so you can treat each partition like a miniature result set.

Once you've defined an ordered window, you can calculate the first and last prices for each house with these expressions in the SELECT list:
   FIRST_VALUE ( price ) OVER date_window
LAST_VALUE ( price ) OVER date_window
Here's the whole query; it returns exactly the same results as shown above, and it's quite a bit simpler:
WITH price_change AS
( SELECT DISTINCT
street_name,
street_number,
zip_code,
STRING
( mls_entry.street_address,
' ',
mls_entry.zip_code ) AS address,
FIRST_VALUE ( email_date )
OVER date_window AS first_date,
LAST_VALUE ( email_date )
OVER date_window AS last_date,
FIRST_VALUE ( price )
OVER date_window AS first_price,
LAST_VALUE ( price )
OVER date_window AS last_price
FROM mls_entry
WINDOW date_window AS (
PARTITION BY street_address, zip_code
ORDER BY email_date
RANGE BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING ) )
SELECT address AS street_and_zip,
DATEDIFF ( DAY, first_date, last_date ) AS days,
( last_price - first_price ) AS change,
CAST ( change / first_price
* 100.0 AS INTEGER ) AS "%",
CAST ( last_price AS INTEGER ) AS price,
last_date AS "date"
FROM price_change
WHERE last_date >= '2008-01-01'
AND last_date <> first_date
ORDER BY "%",
zip_code,
street_name,
CAST ( street_number AS BIGINT );
And guess what else? As cool as they are, FIRST_VALUE() and LAST_VALUE() are only two minor features among all the other OLAP features already in SQL Anywhere.


Add to Technorati Favorites

No comments: