Monday, February 25, 2008

The Evolution of a SELECT

Q: I've been using the following query from your book. I know what it does (shows the MobiLink synchronization status of all the remote databases), but can you explain how it works?

SELECT ml_user.name,
ml_subscription.subscription_id,
ml_subscription.progress,
ml_subscription.publication_name,
ml_subscription.last_upload_time,
ml_subscription.last_download_time
FROM ml_user
JOIN ( SELECT a.user_id,
a.subscription_id,
a.progress,
a.publication_name,
a.last_upload_time,
a.last_download_time
FROM ml_subscription AS a
WHERE a.progress = ( SELECT MAX ( b.progress )
FROM ml_subscription AS b
WHERE b.user_id = a.user_id
AND b.publication_name = a.publication_name ) )
AS ml_subscription
ON ml_user.user_id = ml_subscription.user_id;
A: Funny you should ask about that query... in my dog-eared personal copy of the book I have written "too complex" next to it.

Here is an explanation of how it works, presented as a three step "evolution":
  • Simple but not quite correct
  • Correct but really too complex (the book version, above)
  • Simple and correct
Step 1: Simple But Not Correct

This following simple SELECT can be run against the MobiLink system tables in the consolidated database to show one row for each combination of remote database (ml_user.user_id) and publication (ml_subscription.publication_name):
SELECT ml_user.name,
ml_subscription.subscription_id,
ml_subscription.progress,
ml_subscription.publication_name,
ml_subscription.last_upload_time,
ml_subscription.last_download_time
FROM ml_user
JOIN ml_subscription
ON ml_user.user_id = ml_subscription.user_id;
That works just fine, at least initially. Here is the result for a single remote database that subscribes to a single publication, after the first synchronization round-trip has been run:



Here's the problem: When you DROP and re-CREATE the MobiLink subscription on the remote database, and re-synchronize, the MobiLink server inserts a new row in ml_subscription on the consolidated database without deleting the old row. Now the simple SELECT gives two rows when only one has any meaning:



Step 2: Correct But Complex

So, you want a SELECT that returns the second row but not the first. This is done by replacing "JOIN ml_subscription" with "JOIN ( big-fancy-nested-SELECTs ) AS ml_subscription"...
  JOIN ( SELECT a.user_id,
a.subscription_id,
a.progress,
a.publication_name,
a.last_upload_time,
a.last_download_time
FROM ml_subscription AS a
WHERE a.progress = ( SELECT MAX ( b.progress )
FROM ml_subscription AS b
WHERE b.user_id = a.user_id
AND b.publication_name = a.publication_name ) )
AS ml_subscription
What that JOIN ( SELECT... does is create a "derived table" with columns that look just like the base table ml_subscription, but only contains rows containing the MAX ( progress ) values for each publicaton_name.

In particular, the correlated subselect "WHERE a.progress = ( SELECT MAX ( b.progress )..." eliminates the progress = 413161 row, but keeps the progress = 423060.

Here's what the whole thing looks like (the book version, again):
SELECT ml_user.name,
ml_subscription.subscription_id,
ml_subscription.progress,
ml_subscription.publication_name,
ml_subscription.last_upload_time,
ml_subscription.last_download_time
FROM ml_user
JOIN ( SELECT a.user_id,
a.subscription_id,
a.progress,
a.publication_name,
a.last_upload_time,
a.last_download_time
FROM ml_subscription AS a
WHERE a.progress = ( SELECT MAX ( b.progress )
FROM ml_subscription AS b
WHERE b.user_id = a.user_id
AND b.publication_name = a.publication_name ) )
AS ml_subscription
ON ml_user.user_id = ml_subscription.user_id;
When that derived table participates in the outer "FROM ml_user JOIN ( SELECT... ON ml_user.user_id = ml_subscription.user_id", it only gets the progress = 423060 data:



Derived tables are wonderful things. They let you develop queries by a "divide and conquer" process, similar to the creation of views where you push complexity down into separate selects and then use the result sets as if they were tables. Unlike a view, however, you don't have to CREATE anything separate with a derived table, just code the separate SELECT right inside the FROM clause.

Another benefit to the divide and conquer process is that it sometimes lets you see further simplifications...

Step 3: Both Simple And Correct

The complex SELECT from Step 2 includes a column that is of absolutely no use: ml_subscription.subscription_id. MobiLink uses that column as an artificial primary key to identify rows that are otherwise duplicates: Different rows that have the same combination user_id and publication_name. For the purposes of administering MobiLink, only one of those rows is of interest (the latest one, with the maximum value of "progress").

So, if you don't display subscription_id, you can get rid of the whole complex derived table, and instead do a MAX ... GROUP BY to find the latest rows, thusly:
SELECT ml_user.name,
ml_subscription.publication_name,
MAX ( ml_subscription.progress ) AS progress,
MAX ( ml_subscription.last_upload_time ) AS last_upload_time,
MAX ( ml_subscription.last_download_time ) AS last_download_time
FROM ml_user
JOIN ml_subscription
ON ml_user.user_id = ml_subscription.user_id
GROUP BY ml_user.name,
ml_subscription.publication_name;


Add to Technorati Favorites

1 comment:

Anonymous said...

Ahhhhh yes; what was once confounding is now crystal clear. From your Step 2 explanation, I learned about derived tables which I have not used before. And I now understand the business rules behind the complex query. I like the 3rd query because my simple mind doesn't have to work so hard to understand it. And best of all, the simple solution produces the right answer!

You have a gift for teaching. Thanks for taking the time to do so.