Q: I've been using the following query from your book. I know what it does (shows the MobiLink synchronization status of all the remote databases), but can you explain how it works?
SELECT ml_user.name,
ml_subscription.subscription_id,
ml_subscription.progress,
ml_subscription.publication_name,
ml_subscription.last_upload_time,
ml_subscription.last_download_time
FROM ml_user
JOIN ( SELECT a.user_id,
a.subscription_id,
a.progress,
a.publication_name,
a.last_upload_time,
a.last_download_time
FROM ml_subscription AS a
WHERE a.progress = ( SELECT MAX ( b.progress )
FROM ml_subscription AS b
WHERE b.user_id = a.user_id
AND b.publication_name = a.publication_name ) )
AS ml_subscription
ON ml_user.user_id = ml_subscription.user_id;
A: Funny you should ask about that query... in my dog-eared personal copy of the book I have written "too complex" next to it.Here is an explanation of how it works, presented as a three step "evolution":
- Simple but not quite correct
- Correct but really too complex (the book version, above)
- Simple and correct
This following simple SELECT can be run against the MobiLink system tables in the consolidated database to show one row for each combination of remote database (ml_user.user_id) and publication (ml_subscription.publication_name):
SELECT ml_user.name,
ml_subscription.subscription_id,
ml_subscription.progress,
ml_subscription.publication_name,
ml_subscription.last_upload_time,
ml_subscription.last_download_time
FROM ml_user
JOIN ml_subscription
ON ml_user.user_id = ml_subscription.user_id;
That works just fine, at least initially. Here is the result for a single remote database that subscribes to a single publication, after the first synchronization round-trip has been run:Here's the problem: When you DROP and re-CREATE the MobiLink subscription on the remote database, and re-synchronize, the MobiLink server inserts a new row in ml_subscription on the consolidated database without deleting the old row. Now the simple SELECT gives two rows when only one has any meaning:
Step 2: Correct But Complex
So, you want a SELECT that returns the second row but not the first. This is done by replacing "JOIN ml_subscription" with "JOIN ( big-fancy-nested-SELECTs ) AS ml_subscription"...
JOIN ( SELECT a.user_id,
a.subscription_id,
a.progress,
a.publication_name,
a.last_upload_time,
a.last_download_time
FROM ml_subscription AS a
WHERE a.progress = ( SELECT MAX ( b.progress )
FROM ml_subscription AS b
WHERE b.user_id = a.user_id
AND b.publication_name = a.publication_name ) )
AS ml_subscription
What that JOIN ( SELECT... does is create a "derived table" with columns that look just like the base table ml_subscription, but only contains rows containing the MAX ( progress ) values for each publicaton_name.In particular, the correlated subselect "WHERE a.progress = ( SELECT MAX ( b.progress )..." eliminates the progress = 413161 row, but keeps the progress = 423060.
Here's what the whole thing looks like (the book version, again):
SELECT ml_user.name,
ml_subscription.subscription_id,
ml_subscription.progress,
ml_subscription.publication_name,
ml_subscription.last_upload_time,
ml_subscription.last_download_time
FROM ml_user
JOIN ( SELECT a.user_id,
a.subscription_id,
a.progress,
a.publication_name,
a.last_upload_time,
a.last_download_time
FROM ml_subscription AS a
WHERE a.progress = ( SELECT MAX ( b.progress )
FROM ml_subscription AS b
WHERE b.user_id = a.user_id
AND b.publication_name = a.publication_name ) )
AS ml_subscription
ON ml_user.user_id = ml_subscription.user_id;
When that derived table participates in the outer "FROM ml_user JOIN ( SELECT... ON ml_user.user_id = ml_subscription.user_id", it only gets the progress = 423060 data:Derived tables are wonderful things. They let you develop queries by a "divide and conquer" process, similar to the creation of views where you push complexity down into separate selects and then use the result sets as if they were tables. Unlike a view, however, you don't have to CREATE anything separate with a derived table, just code the separate SELECT right inside the FROM clause.
Another benefit to the divide and conquer process is that it sometimes lets you see further simplifications...
Step 3: Both Simple And Correct
The complex SELECT from Step 2 includes a column that is of absolutely no use: ml_subscription.subscription_id. MobiLink uses that column as an artificial primary key to identify rows that are otherwise duplicates: Different rows that have the same combination user_id and publication_name. For the purposes of administering MobiLink, only one of those rows is of interest (the latest one, with the maximum value of "progress").
So, if you don't display subscription_id, you can get rid of the whole complex derived table, and instead do a MAX ... GROUP BY to find the latest rows, thusly:
SELECT ml_user.name,
ml_subscription.publication_name,
MAX ( ml_subscription.progress ) AS progress,
MAX ( ml_subscription.last_upload_time ) AS last_upload_time,
MAX ( ml_subscription.last_download_time ) AS last_download_time
FROM ml_user
JOIN ml_subscription
ON ml_user.user_id = ml_subscription.user_id
GROUP BY ml_user.name,
ml_subscription.publication_name;
1 comment:
Ahhhhh yes; what was once confounding is now crystal clear. From your Step 2 explanation, I learned about derived tables which I have not used before. And I now understand the business rules behind the complex query. I like the 3rd query because my simple mind doesn't have to work so hard to understand it. And best of all, the simple solution produces the right answer!
You have a gift for teaching. Thanks for taking the time to do so.
Post a Comment