This is the second article about the foundations for client-server communications inside SQL Anywhere, where the server is a website and the client is a stored procedure inside a SQL Anywhere database.
The previous article, Client-Server Via Web Services Inside SQL Anywhere, showed how to build a variety of web client and server functions and procedures, ending with a SQL Anywhere stored procedure that used an HTTP GET request to retrieve the HTML body of an Amazon product web page plus all the associated HTTP status and header fields.
Resources...- The web-style CREATE PROCEDURE statement as described in the SQL Anywhere 16 Help.
- Descriptions of the various HTTP request methods like GET and HEAD, in Wikipedia.
- Descriptions of the associated HTTP headers, also in Wikipedia.
- Kurt Lichtner's excellent 2005 Techwave presentation on the new HTTP client introduced in SQL Anywhere 9.0.2.... yes, this stuff's been around for a while.
This article delves a bit lower, showing how to build a SQL Anywhere stored procedure that uses HTTP HEAD requests to check the validity of external URLs stored in a database. The database in question is the one used to construct the SQL Anywhere Technical Documents web page, and the motivation is link rot: many of the 1,156 technical document links no longer work, and it's time to find out just how bad the situation is without checking ... each ... link ... by ... hand (hint: it's dire!)
[click here to go straight to the code]
Why not just run a link checker on the web page? Why DIY?
The obvious answer is, that's what this blog is all about! (doing stuff with SQL Anywhere databases)But... it's one thing to Do It Yourself, it's quite another to DIY without ever seeing an example of the "It" you're trying to "Do".
I can't remember ever having run a "link checker" program, so here goes...
- Step 1, google it: how do I check links on a website page
- First hit: Link Checker - The W3C Markup Validation Service
- Go there, feed it this URL: http://sqlanywhere.blogspot.ca/2012/03/tales-from-doc-face.html
Figure 1: Home page for the W3C Link Checker
- Wait ... for ... a ... long ... time ... Done. Document processed in 651.56 seconds
Figure 2: Processing report displayed by the W3C Link Checker for the Technical Documents page
The situation is dire!
Figure 3 shows that the W3C Link Checker reports 345 "server side problem" links... yup, it's a problem.Figure 3: Summary of results displayed by the W3C Link Checker for the Technical Documents page
Isn't that all you need to know? Why DIY?
There's no question that the W3C Link Checker does a good job of parsing and checking a web page, and then telling you about all the problems.What it doesn't do, is this:
- The W3C Link Checker doesn't limit itself to the important content, in this case the Technical Documents links. That's understandable, it's parsing a web page, it doesn't know what's important and what isn't.
- The W3C Link Checker doesn't report the total number of links on the page, or how many were "good" links. You can determine the total by counting the number of "Checking link ..." messages in the output (1,298), then do some arithmetic to calculate the number of "good" links, but that's a pain in the patootie.
Checking link http://iablog.sybase.com/efarrar/2011/10/is-fuji-really-a-cloud/ HEAD http://iablog.sybase.com/efarrar/2011/10/is-fuji-really-a-cloud/ fetched in 1.20 seconds Checking link http://www.sybase.com/detail?id=47925 HEAD http://www.sybase.com/detail?id=47925 fetched in 1.30 seconds ... and so on, 1,296 more times
- The W3C Link Checker doesn't provide any mechanism for adhoc reporting. Sure, you could scan the output, muck around with regular expressions... but I won't :)
- The W3C Link Checker doesn't let you customize the process to, say, automate problem investigation. Yes, yes, it's open source... see cartoon above.
...so, this DIY project will make use of the SQL Anywhere WAITFOR statement.
SQL Anywhere Link Checker: The Code
Here's the basic building block for the SQL Anywhere Link Checker:CREATE PROCEDURE http_head_attributes ( IN url LONG VARCHAR ) RESULT ( attribute LONG VARCHAR, value LONG VARCHAR ) URL '!url' TYPE 'HTTP:HEAD';
When a single URL is passed to this web-style procedure, SQL Anywhere tries to send an HTTP HEAD request to the web server, and if that request is successful it gets back a variable number of HTTP headers. If the attempt is unsuccessful (and '404 Not Found' counts as unsuccessful) SQL Anywhere will raise an exception which can be handled in BEGIN CATCH block.
A successful call to http_head_attributes will return a two-column attribute and value result set containing the header names and values.
Here are examples of successful and unsuccessful calls:
Tip: The "Body" value is empty, and it should be, but that is not always the case! For example, SELECT * FROM http_head_attributes ( 'http://www.sap.com' ) returns the full Body value, all 28,672 bytes of it. That behavior is rare (other pages on sap.com don't do it) but it's something to keep in mind when issuing an HTTP HEAD request.SELECT * FROM http_head_attributes ( 'http://www.google.com' ); attribute value Status HTTP/1.0 200 OK Body Expires -1 Server gws X-XSS-Protection 1; mode=block Content-Type text/html; charset=ISO-8859-1 P3P CP="This is not a P3P policy! See http://www.google.com/support/accounts/bin/answer.py?hl=en&answer=151657 for more info." Cache-Control private, max-age=0 X-Frame-Options SAMEORIGIN Date Tue, 11 Jun 2013 23:06:04 GMT Set-Cookie NID=67=Dg4bUl ... etcetera ...UKCUD; expires=Wed, 11-Dec-2013 23:06:04 GMT; path=/; domain=.google.com; HttpOnly SELECT * FROM http_head_attributes ( 'http://www.google.com/unknown' ); HTTP request failed. Status code '404 Not Found' SQLCODE=-983, ODBC 3 State="HY000"
Figure 4 shows where the URLs to be checked come from. One row exists in the article table for each entry displayed on the Technical Documents web page, one row is inserted in the link_check table each time an article is checked by the SQL Anywhere Link Checker, and one row is inserted in the http_header table for each row returned by a call to web_client_procedure:
Figure 4: ER Diagram for the Technical Documents database
Here's the code for the main link checker procedure:
CREATE PROCEDURE catalog_link_checker() BEGIN DECLARE @url_count INTEGER; DECLARE @url_counter INTEGER; DECLARE @link_checked_at TIMESTAMP; DECLARE @status LONG VARCHAR; SELECT COUNT(*) INTO @url_count FROM article WHERE url LIKE 'http%'; SET @url_counter = 0; TRUNCATE TABLE http_header; TRUNCATE TABLE link_check; FOR f_fetch AS c_fetch INSENSITIVE CURSOR FOR SELECT artificial_id AS @artificial_id, url AS @url FROM article WHERE url LIKE 'http%' ORDER BY order_by_date DESC, title DESC FOR READ ONLY DO WAITFOR DELAY '00:00:01'; SET @url_counter = @url_counter + 1; SET @link_checked_at = CURRENT TIMESTAMP; INSERT link_check VALUES ( @artificial_id, @link_checked_at ); MESSAGE STRING ( 'URL ', @url_counter, ' of ', @url_count, ': ', @url ) TO CONSOLE; BEGIN TRY INSERT http_header SELECT @artificial_id, @link_checked_at, attribute, value FROM http_head_attributes ( @url ); SET @status = '(unknown)'; SELECT value INTO @status FROM http_header WHERE artificial_id = @artificial_id AND link_checked_at = @link_checked_at AND attribute = 'Status'; MESSAGE STRING ( 'Status: ', @status ) TO CONSOLE; END TRY BEGIN CATCH MESSAGE STRING ( 'ERROR_MESSAGE() = ', ERROR_MESSAGE() ) TO CONSOLE; MESSAGE STRING ( 'ERROR_SQLCODE() = ', ERROR_SQLCODE() ) TO CONSOLE; MESSAGE STRING ( 'ERROR_SQLSTATE() = ', ERROR_SQLSTATE() ) TO CONSOLE; INSERT http_header VALUES ( @artificial_id, @link_checked_at, 'ERROR_MESSAGE()', ERROR_MESSAGE() ); INSERT http_header VALUES ( @artificial_id, @link_checked_at, 'ERROR_SQLCODE()', ERROR_SQLCODE() ); INSERT http_header VALUES ( @artificial_id, @link_checked_at, 'ERROR_SQLSTATE()', ERROR_SQLSTATE() ); END CATCH; COMMIT; END FOR; END;
- The FOR loop starting on line 19 steps through each row of the article table in the same order the entries are displayed on the web page, and selects just the primary key and URL values.
- The WAITFOR DELAY statement on line 29 is a crude implementation of the "Sleeping 1 second between requests to each server" suggestion taken from Figure 2. It's crude because this code sleeps between all requests, not just requests to the same server, and that means this code will take a bit longer to run than the W3C Link Checker.
- The INSERT on line 34 records the fact that this URL was checked on such-and-such a date and time. This level of record-keeping isn't important right now (hence the TRUNCATE TABLE statements on lines 16 and 17) but it might be in the future if someone wanted to know when a particular document was available and when it wasn't, similar to the "Retrieved" dates that Wikipedia shows in the References sections.
- The INSERT on lines 40 to 45 copies all the headers from the call to http_head_attributes into the http_header table. If the call fails, control passes to the BEGIN CATCH block on line 59, and three surrogate rows are inserted for the three basic SQL Anywhere exception diagnostic values: ERROR_MESSAGE(), ERROR_SQLCODE() and ERROR_SQLSTATE().
Figure 6 shows a few snippets from the dbsrv16 -o filespec.txt diagnostic log containing some URLs that worked (Status: 200 OK) and some that didn't (SQLCODE -981).
Figure 5: Running the Link Checker in ISQL
Figure 6: Link Checker MESSAGE TO CONSOLE Output
I. 06/14 04:39:15. URL 1 of 1156: http://download.sybase.com/ianywhere/documents/UsingCrystalWithSQLAnywhere_FINAL.pdf I. 06/14 04:39:15. Status: HTTP/1.1 200 OK I. 06/14 04:39:16. URL 2 of 1156: http://sqlanywhere.blogspot.ca/2012/10/example-recursive-union-inverted-order.html I. 06/14 04:39:16. Status: HTTP/1.0 200 OK I. 06/14 04:39:17. URL 3 of 1156: http://sqlanywhere.blogspot.ca/2012/10/example-recursive-union-tree-traversal.html I. 06/14 04:39:17. Status: HTTP/1.0 200 OK ... I. 06/14 04:40:10. URL 38 of 1156: http://iablog.sybase.com/paulley/2012/03/limitations-of-proxy-tables/ I. 06/14 04:40:11. ERROR_MESSAGE() = Unable to connect to the remote host specified by 'http://iablog.sybase.com/paulley/2012/03/limitations-of-proxy-tables/' I. 06/14 04:40:11. ERROR_SQLCODE() = -981 I. 06/14 04:40:11. ERROR_SQLSTATE() = WW050 I. 06/14 04:40:12. URL 39 of 1156: http://iablog.sybase.com/paulley/2012/03/nhibernate-3-3-0cr1-released/ I. 06/14 04:40:13. ERROR_MESSAGE() = Unable to connect to the remote host specified by 'http://iablog.sybase.com/paulley/2012/03/nhibernate-3-3-0cr1-released/' I. 06/14 04:40:13. ERROR_SQLCODE() = -981 I. 06/14 04:40:13. ERROR_SQLSTATE() = WW050 ... I. 06/14 05:13:53. URL 1154 of 1156: http://www.sybase.com/detail?id=47851 I. 06/14 05:13:55. Status: HTTP/1.1 200 OK I. 06/14 05:13:56. URL 1155 of 1156: http://www.sybase.com/detail?id=20313 I. 06/14 05:13:56. Status: HTTP/1.1 200 OK I. 06/14 05:13:57. URL 1156 of 1156: http://www.sybase.com/detail?id=10915 I. 06/14 05:13:57. Status: HTTP/1.1 200 OK
Here's an adhoc query that summarizes the "dire situation" by classifying all the articles by the status of their URLs:
SELECT '1. Articles With Good Links:' AS title, COUNT(*) AS count FROM article INNER JOIN link_check ON link_check.artificial_id = article.artificial_id INNER JOIN ( SELECT * FROM http_header WHERE TRIM ( attribute ) = 'Status' AND TRIM ( value ) LIKE 'HTTP/% 200 OK' ) AS http_header ON http_header.artificial_id = link_check.artificial_id AND http_header.link_checked_at = link_check.link_checked_at UNION ALL SELECT '2. Articles Without Links:', COUNT(*) FROM article WHERE url NOT LIKE 'http%' UNION ALL SELECT '3. Rotted iAnywhere Blog Links:', COUNT(*) FROM article INNER JOIN link_check ON link_check.artificial_id = article.artificial_id INNER JOIN ( SELECT * FROM http_header WHERE TRIM ( attribute ) = 'ERROR_MESSAGE()' AND TRIM ( value ) LIKE 'Unable to connect to the remote host specified by ''http://iablog.sybase.com%' ) AS http_header ON http_header.artificial_id = link_check.artificial_id AND http_header.link_checked_at = link_check.link_checked_at UNION ALL SELECT '4. Invalid Response Links:', COUNT(*) FROM article INNER JOIN link_check ON link_check.artificial_id = article.artificial_id INNER JOIN ( SELECT * FROM http_header WHERE TRIM ( attribute ) = 'ERROR_MESSAGE()' AND TRIM ( value ) = 'Invalid response from the HTTP server' ) AS http_header ON http_header.artificial_id = link_check.artificial_id AND http_header.link_checked_at = link_check.link_checked_at UNION ALL SELECT '5. Other:', COUNT(*) FROM article INNER JOIN link_check ON link_check.artificial_id = article.artificial_id WHERE NOT EXISTS ( SELECT * FROM http_header WHERE http_header.artificial_id = article.artificial_id AND ( ( TRIM ( attribute ) = 'Status' AND TRIM ( value ) LIKE 'HTTP/% 200 OK' ) OR ( TRIM ( attribute ) = 'ERROR_MESSAGE()' AND TRIM ( value ) LIKE 'Unable to connect%http://iablog.sybase.com%' ) OR ( TRIM ( attribute ) = 'ERROR_MESSAGE()' AND TRIM ( value ) = 'Invalid response from the HTTP server' ) ) ) UNION ALL SELECT '6. Total Articles:', COUNT(*) FROM article ORDER BY 1; title count ------------------------------- ----------- 1. Articles With Good Links: 791 2. Articles Without Links: 18 3. Rotted iAnywhere Blog Links: 345 4. Invalid Response Links: 18 5. Other: 2 6. Total Articles: 1174 Execution time: 0.095 seconds
What's Going To Happen To The Technical Documents Page?
As the summary shows, only (!) 29% of the links have rotted away (the iAnywhere blog posts on sybase.com). Some of them have been moved to sap.com but it's not clear how many, nor is it clear how to map the old-to-new-URLs... it's probably a long ... slow ... manual ... process to make the repairs, which is why it hasn't been started yet.And it won't be, at least not until the fate of the remaining documents on sybase.com is determined. There are literally hundreds of whitepapers, Techwave presentations, Insider magazine articles and so on. Chances are, a few will move, many will be abandoned, and all links to sybase.com will rot away.
As the rot rate rises, enthusiasm sinks, and at at some point (40 percent? 50? 60?), the link to the Technical Documents page itself will rot away.
Like I said, the situation is dire, but only for those who care :)
1 comment:
"Rotted iAnywhere Blog Links" are a problem for several SA forum pages, as well, which relate to them.
I hope SAP will help to re-locate those. Yes, that's just hope.
At least
Volker does care.
Post a Comment