Monday, September 29, 2008

In Praise Of Knobs

Glenn Paulley is one of politest people I've ever met... I've never known him to explode in anger, and this May 6, 2008 post is probably the closest we'll ever see to a Glenn Paulley Rant:

Additional knobs are NOT the answer!
Here's what Glenn said about manual settings and parameters affecting database performance, also known as knobs:
"Knobs" are expedient, and they can be used as "quick fixes" to workaround performance problems in shipping code. But avoiding the introduction of additional knobs is a worthwhile goal because "knobs" are not in the best interests of our customers.

In my view, adding additional performance knobs of the types described above is simply counterproductive: the answer in the longer term is self-managing database systems.
In summary, Glenn is clearly saying self-management is good and knobs are bad. It might be unfair, but it's easy to infer that Glenn is also claiming knobs are the antithesis of self-management: if your software is self-managing, you won't have any knobs... or at least if you increase your software's level of self-management, you'll have fewer knobs.

One month later, on June 4, 2008 Glenn returned to the topic in this post: What self-management is (and isn’t). This post doesn't mention knobs at all, but sticks to the point "self-management is good".

And he's absolutely right on that point: self-management is good. I also agree that increasing the level of self-management in SQL Anywhere is a worthwhile goal. And I'll go further: progress has been excellent, so far, in making SQL Anywhere self-managing.

But... here's where my rant starts... knobs aren't bad... at least, they're not necessarily bad. In fact, every new self-management feature or improvement added to SQL Anywhere requires at least one new knob: a switch or setting that turns the feature off and returns manual control. This is a historical fact, and it's going to continue to be true, otherwise there's going to be big trouble ahead.

Here's an example from the aviation world: The B-2 bomber is an excellent example of a self-managing system... if it wasn't for the automatic computer controls the thing wouldn't even fly, no human being could manage it.

Back on Febrary 23, 2008 however, the pilots of this B-2 probably wished they had a few more knobs and a bit less "automatic computer control"... it was the computer system that sent it into a stall.

Knobs are not a panacea, but sometimes they are critical, especially for functions that are self-managed. In fact, each new self-management feature will probably require several new knobs, not just one to turn the feature off, but other new knobs to control the behavior of the new feature.

Here's an example from SQL Anywhere, where the story of database cache self-management can be told in the form of...
A Brief History of Knobs

Back in Versions 5.5 and 6 there was no cache self-management, and SQL Anywhere had only one knob affecting the cache:
   -c - Set the cache size
Version 7 added Dynamic Cache Sizing. It also added 3 new knobs for controlling the cache size and one new knob to display it:
   -ca 0    - Disable dynamic cache sizing
-ch size - Set the upper limit
-cl size - Set the lower limit
-cs - Display cache usage in console window
Version 8 added Address Windowing Extensions and a new knob to turn it on and off:
   -cw - Enable use of AWE on Windows.
Version 9 added Cache Warming, plus another three new knobs:
   -cc {+|-} - Collect information for cache warming  
-cr {+|-} - Warm the cache with database pages
-cv {+|-} - Controls messages about cache warming
Version 10 brought the total number to knobs to ten:
   -cm size - Specifies the size of the AWE cache 
For many years, one of those ten knobs was critical to the performance of cache self-management: For versions 7, 8 and 9, if you wanted your database cache to automatically grow past 256M, you had to use the -ch knob to specify a larger value... otherwise, it didn't matter if you had 2G of RAM installed, your cache would stop growing at 256M.

So here's my point to Glenn: Don't go knocking those knobs! You'll be adding lots more as you increase the level of self-management, whether you like it or not. And they'll be valuable and helpful.

Sunday, September 28, 2008

IvanAnywhere On Space TV

Scroll to the one-minute-point in this video to see a short clip on IvanAnywhere:

Friday, September 26, 2008

V11 EBF Coming Soon

The first Express Bug Fix (EBF) for SQL Anywhere Version 11 is coming soon... not yet, but soon, according to Jason Hinsperger writing in the SQL Anywhere "General" newsgroup.

That might be good news for some people, but personally I'm not waiting for any particular fixes... in my opinion Version 11 has been the cleanest General Availability (GA) release in the history of SQL Anywhere. My theory about the reason? All the extra testing getting ready for the TPC-C benchmark.

I wish I could give you a direct link for downloading the latest SQL Anywhere EBF, but the best I can do is tell you to start here, click on SQL Anywhere, and start scrolling... but not yet... soon.

Thursday, September 25, 2008

"There can be only one!"

Here's what Google says about it's Blog Search service:

"... when there are entire blogs that seem to be a good match for your query, these will appear in a short list just above the main search results."
Here's what it displays when you search on "SQL Anywhere":



OK, OK, enough ego-surfing... the real news is that there are over 256,000 hits on "SQL Anywhere". Maybe folks are discovering that SQL Anywhere scored 5th in the Top Ten TPC-C by Performance list.

To do your own blog searches, go to blogsearch.google.com.

Tuesday, September 9, 2008

URL Forwarding Via Reverse Proxy

For security reasons, many internet sites only allow access via certain ports, such as port 80 for ordinary HTTP and 443 for secure HTTPS.

For the official list of port number see www.iana.org/assignments/port-numbers.
That causes a problem when you want to have two or more web servers on the same IP address. For example, suppose you want all HTTP traffic addressed to your domain name xyz.com to go to your main web server, EXCEPT for traffic addressed to xyz.com/special which should be sent to a different web server.

That was exactly my problem; here are the details:
  • All traffic addressed to risingroad.com, regardless of port number, is sent to the IP address 64.7.134.118.

  • A Linksys router listens to all the traffic on 64.7.134.118.

  • The router forwards all traffic on port 80 to the local IP address 192.168.1.51. In this context, the router is acting like a firewall, and that firewall has port 80 "opened up" for access to a computer running inside the firewall. Some other port numbers are open as well (email, etc) but there are no other open ports available for HTTP traffic.

  • A Microsoft IIS web server is running on a Windows 2003 server at local IP address 192.168.1.51. Its purpose it to process regular HTTP traffic sent to risingroad.com.

  • A SQL Anywhere web server is running on a Windows Vista 64-bit computer at local IP address 192.168.1.52. Its purpose is to process HTTP traffic sent to the specific address risingroad.com/enwiki_search.
The following simple approach comes to mind: Open up a different port, say port 81, and have the router send port 80 traffic to 192.168.1.51 and port 81 traffic to 192.168.1.52. Other companies might not be willing to do that, but I am; why won't that work?

It won't work because other companies don't just block incoming HTTP traffic on any port other than 80, they also block outbound traffic. That means anyone using a browser inside those other companies can't get out to sites that are running HTTP servers on port 81... or even 8080 which is listed as an "HTTP Alternate".

The solution is to use a "reverse proxy" program to forward or redirect incoming HTTP traffic to the appropriate destination. I chose the IQ Reverse Proxy from Fastream Technologies.
If you find the term "reverse proxy" confusing, you're not alone. The word "proxy" might apply since the program is "acting for" HTTP servers at other locations. However, the word "reverse" just serves to obfuscate, especially since the term "port forwarding" applies to traffic going in the same direction as reverse proxy traffic. It may be simpler to think of "reverse proxy" as an "URL forwarder" or "IP redirector".
Here's how the URL forwarding works:

1. The Linksys router forwards port 80 to 192.168.1.51; ports 8080 and 81 are not open to the outside world:



2. The Fastream IQ reverse proxy program runs on 192.168.1.51 and listens to port 80:



3. The Microsoft IIS HTTP server also runs on 192.168.1.51, but it now listens to port 8080:



4. The SQL Anywhere HTTP server running on 192.168.1.52 is configured to listen to port 81:



5. The two target HTTP servers (IIS and SQL Anywhere) are defined to the Fastream IQ reverse proxy program in terms of their IP addresses and port numbers:
192.168.1.51   8080  - IIS
192.168.1.52 81 - SQL Anywhere


6. Special URL forwarding rules are defined to the Fastream IQ reverse proxy program as follows: one particular URL goes to SQL Anywhere and everything else goes to IIS:
*.//risingroad.com/enwiki_search  - goes to 192.168.1.52:81
*.//Default - goes to 192.168.1.51:8080


Tip: It may be possible to have Fastream handle URLs with GET-style parameters properly, but I couldn't figure it out; I had to use strict POST method processing for all web service parameters. That means no visible "http:\\x.y.com?a=...&b=..." parameters on the URLs, just the hidden POST parameters.

Wednesday, September 3, 2008

Loading Wikipedia

This article talks about loading 264 million lines of Wikipedia XML into a SQL Anywhere table, parsing and gathering this text into 7.3 million blobs in another table, and then building a full text index on two columns in that table: the title of the Wikipedia entry and the blob containing the article.

The final result can be seen using a simple Google-style interfacebased on a SQL Anywhere 11 web service.

Note: This demonstration is no longer available on the internet.


Don't expect Google-like speeds, however, or even the same results... SQL Anywhere's index-based full text search facility is not exactly the same thing as the Google search engine.

The purpose to this exercise is two-fold: To demonstrate the new full text search facility on a non-trivial data set, and to continue exploring ways to use SQL Anywhere to solve programming problems inside the database server. After all, the name of this blog is "SQL Anywhere" so you're not going to see articles about Java or C or Perl or PHP any time soon (unless, of course, it's about writing SQL Anywhere stored procedures in Perl or one of the other languages supported in Version 11 :)

Step 1: Download The Wikipedia File.

Go to this page and download this 3.9G compressed file: enwiki-latest-pages-articles.xml.bz2.

Time to download: Hours and hours, especially when the new FireFox 3 download manager unexpectedly paused the download when I wasn't looking... grrrr!

Step 2: Unzip The Wikipedia File.



Time to unzip: 20m

Here's what the text looks like:



The two interesting elements are the <title> nested within each <page>, plus the <text> element nested within the page <revision> element.

Step 3. Create The Database.

Here are the commands used to initialize the database and start the 64-bit version of dbsrv11.exe on a Core Quad processor with 4G of RAM and 4 hard drives running 64-bit Windows Vista (the lines have been wrapped to fit on screen):
"%SQLANY11%\bin64\dbinit.exe" -p 8192 -s 
-t e:\data\enwiki\enwiki.log
c:\data\enwiki\enwiki.db

"%SQLANY11%\bin64\dbspawn.exe"
-f "%SQLANY11%\bin64\dbsrv11.exe"
@$dbsrv11_enwiki_http_options.txt
To improve I/O performance all the busy files were placed on different physical drives in a non-RAID configuration:
Database:        c:\data\enwiki\enwiki.db
Transaction log: e:\data\enwiki\enwiki.log
Temporary file: f:\data\enwiki\sqla*.tmp
Input text: g:\data\enwiki\enwiki-latest-pages-articles.xml
The dbinit -t option is used to specify the transaction log location, and the dbsrv11 -dt option is used (see below) to specify where to put the sqla*.tmp temporary files.

Here are the dbsrv11 options contained in the configuration file $dbsrv11_enwiki_http_options.txt (the -xs line has been wrapped to fit on screen):
-c 3G 
-ch 3G
-dt f:\data\enwiki
-gn 10
-o dbsrv11_log_enwiki.txt
-os 10M
-x tcpip
-xs "http(log=dbsrv11_log_inbound_http_enwiki.txt;
lf=@T - @I - @U - @R - @L - @W - @E;
lsize=1M;dbn=enwiki;port=81;
maxsize=1M;to=600;kto=600)"
-ze
-zl
-zp
-zt
c:\data\enwiki\enwiki.db
The -c and -ch options were used to fix the database cache size at 3G.

The -gn 10 option was used temporarily to optimize server performance for a single busy connection: the dbisql session used to load the Wikipedia data. An even smaller number might have helped more, but there's a rule of thumb that says "Don't set gn lower than 10 if you know what's good for you!"

The -o and -os options specify where to write the database console log file, and when to rename and restart it. Except perhaps for embedded databases and handheld devices, every SQL Anywhere server in the world should specify these two parameters to help diagnose runtime problems.

The -x option speeds server startup by specifying that only tcpip is necessary.

The -xs option tells dbsrv11 to start the builtin HTTP server and how to configure it. The log parameter specifies where to write the HTTP log records, lf specifies what to include in the log, and lsize specifies when to rename and restart it. The dbn parameter must match the database name that contains the web services to be used, and the port parameter specifies what port to listen to for HTTP requests. The maxsize parameter specifies the largest HTTP request that will be requested, and the to and kto parameters are used to increase the HTTP timeout and keep-alive durations.

The -ze, -zl, -zp and -zt options all serve to increase the level of diagnostic tracing performed by the server.

To further increase I/O performance by eliminating disk defragmentation in the database file, Diskeeper 2008 was run after the following command was run to increase the database file size:
ALTER DBSPACE ADD 50G;
Time to ALTER DBSPACE ADD 50G: 11m 17s.

Tip: If you have Diskeeper 2008 "Automatic Defragmentation" enabled, you might want to disable it while the Wikipedia loading processes are running. Diskeeper isn't going to cause any errors, but it will happily pound away on the same drives you're trying to use... even if it's working on a different drive, it still might be using the same controller or other resources. You can set a schedule inside Diskeeper to enable and disable automatic defragmention:



Step 4: Load The Text.

The enwiki_text table stores each line of text as a LONG VARCHAR column, and the line_number column is automatically incremented to preserve the input order. This automatic ordering is achieved by using three features: the DEFAULT AUTOINCREMENT clause in the CREATE TABLE statement, and two clauses in the LOAD TABLE statement: First, the column list "( line_text )" is coded to specifically exclude the line_number column so SQL Anywhere won't expect any values to be present in the input file. Second, the DEFAULTS ON clause instructs LOAD TABLE to actually calculate the DEFAULT AUTOINCREMENT value for each new line_number; by default, LOAD TABLE doesn't apply defaults.
CREATE TABLE enwiki_text (
line_number BIGINT NOT NULL DEFAULT AUTOINCREMENT
PRIMARY KEY CLUSTERED,
line_text LONG VARCHAR NOT NULL DEFAULT '' );

LOAD TABLE enwiki_text ( line_text )
FROM 'c:/data/enwiki/enwiki-latest-pages-articles.xml'
DEFAULTS ON
DELIMITED BY ''
ESCAPES OFF
HEXADECIMAL OFF
QUOTES OFF
STRIP OFF;
The other clauses of the LOAD TABLE tell it to accept the input text as-is: don't look for any inter-field delimiters, don't expect any special escape characters or hexadecimal notation or quotes, and don't strip any blanks. This is a standard "what you see is what you get" LOAD TABLE command, very useful for loading raw text without any edits.

Time for LOAD TABLE enwiki_text: 1h 48m.

Tip: If you don't have four separate disk drives for the four busy files mentioned in Step 3 above, you can put the input text file on the same drive as the database file. However, try to put the temporary file on a separate drive from both the input text file and the database file. The LOAD TABLE process will perform heavy I/O on the temporary file and input text file simultaneously during its first phase of processing, and during its second phase it will work on the temporary file and database file simultaneously. Here's a snapshot of the Windows Vista Resource Monitor during the second phase of LOAD TABLE processing:



Step 5: Find The Pages.

The Wikipedia text consists of a <page> ... </page> element for each one of the 7.3 million pages, preceded and followed by a few other tags that we're not interested in.

Parsing this XML text is not a complex job for SQL Anywhere's OPENXML procedure, but it's a BIG job... too big to do in one statement.

So, we're going to divide and conquer: First, find the first and last lines of each <page> element, and later, pass each <page> element through the OPENXML parser. The enwiki_text_xref table is used to store the first and last line numbers:
CREATE TABLE enwiki_text_xref ( 
page_number BIGINT NOT NULL DEFAULT AUTOINCREMENT
PRIMARY KEY CLUSTERED,
from_line_number BIGINT NOT NULL,
to_line_number BIGINT NULL );
CONSTRAINT fk_from
NOT NULL FOREIGN KEY ( from_line_number )
REFERENCES enwiki_text ( line_number ),
CONSTRAINT fk_to
FOREIGN KEY ( to_line_number )
REFERENCES enwiki_text ( line_number ) );
Here's how the from_line_number column is filled in for each page:
INSERT enwiki_text_xref ( from_line_number )
SELECT enwiki_text.line_number
FROM enwiki_text
WHERE enwiki_text.line_text LIKE '%<page>%'
ORDER BY enwiki_text.line_number;
Yes, it's a full-table scan using LIKE '%<page>%', but it's just done once, and it's fast:



Time for INSERT enwiki_text_xref: 38m 28.9s.

The second column is a bit trickier: Each row's to_line_number is equal to the next row's from_line_number, minus one. How do you code that in SQL, without an old-fashioned FETCH loop that looks at every single row?

The answer is to use the LAST_VALUE() function introduced in SQL Anywhere 10.0.1:
CREATE VIEW to_line_numbering AS
SELECT page_number,
( LAST_VALUE ( from_line_number )
OVER line_number_window ) - 1
AS to_line_number
FROM enwiki_text_xref
WINDOW line_number_window AS (
ORDER BY page_number
RANGE BETWEEN CURRENT ROW
AND 1 FOLLOWING );

UPDATE enwiki_text_xref
INNER JOIN to_line_numbering
ON to_line_numbering.page_number
= enwiki_text_xref.page_number
SET enwiki_text_xref.to_line_number
= to_line_numbering.to_line_number;
The magic is in the CREATE VIEW, not the UPDATE; the UPDATE simply trusts the view to properly calculate to_line_number.

The SELECT in the CREATE VIEW works as follows: The WINDOW clause names and defines a moving set of rows (a window) for each row in the candidate result set. For each row in enwiki_text_xref, the window contains the CURRENT ROW and the next row (1 FOLLOWING). The ORDER BY clause determines how the rows are ordered for this purpose, and the RANGE clause specifies how the first and last rows in the window are identified.

The LAST VALUE aggregate functions asks for the last value of the from_line_number column, taken from the WINDOW named in the OVER clause. In English, this is the from_line_number column from the next row... that's what the WINDOW contains, this row and the next one, and therefore the "LAST" value from the WINDOW comes from the next row.

The result is reduced by 1 to give the to_line_number of the current row, and that's what the view returns to the UPDATE.

Time for UPDATE #1 enwiki_text_xref: 15m 35.4s.

You can read about OLAP windows in the Help, and in Glenn Paulley's white paper. You can also see another example of LAST_VALUE() here.

Astute readers will notice that the view doesn't return a proper value of to_line_number for the last row in enwiki_text_xref... it's the last row, so there is no "next row" to use to calculate the LAST VALUE(). Here's the code to finish things up:
BEGIN
DECLARE @last_page_line_number BIGINT;

SELECT TOP 1 enwiki_text.line_number
INTO @last_page_line_number
FROM enwiki_text
WHERE enwiki_text.line_text LIKE '%</page>%'
ORDER BY enwiki_text.line_number DESC;

UPDATE TOP 1
enwiki_text_xref
SET enwiki_text_xref.to_line_number
= @last_page_line_number
ORDER BY page_number DESC;

END;
Time for UPDATE #2 enwiki_text_xref: 0s (yes, zero, even with the LIKE '%</page>%' predicate).

Not only is the additional processing time for filling enwiki_text_xref table small, but so is the table: 390M for enwiki_text_xref versus 22G for enwiki_text.

Step 6: Parse The Pages.

Here's what the final Wikipedia table looks like, with the important columns being page_title and page_text:
CREATE TABLE enwiki_entry ( 
page_number BIGINT NOT NULL
PRIMARY KEY CLUSTERED,
from_line_number BIGINT NOT NULL,
to_line_number BIGINT NOT NULL,
page_title VARCHAR ( 1000 ) NOT NULL,
page_text LONG VARCHAR NOT NULL,
CONSTRAINT fk_page
NOT NULL FOREIGN KEY ( page_number )
REFERENCES enwiki_text_xref ( page_number ),
CONSTRAINT fk_from
NOT NULL FOREIGN KEY ( from_line_number )
REFERENCES enwiki_text ( line_number ),
CONSTRAINT fk_to
NOT NULL FOREIGN KEY ( to_line_number )
REFERENCES enwiki_text ( line_number ) );
Once again, here's what the input text looks like:



Data for the enwiki_entry.page_title column comes from the <title> element nested within each <page>, and the page_text column is filled from the <text> element nested within each page's <revision> element.

Here's the code that fills enwiki_entry as follows:
  • A cursor FOR loop is used to process one page at a time, rather than all 7.3 million pages at once.

  • The cursor SELECT uses the enwiki_text_xref table to find the @from and @to rows in enwiki_text for each <page> element.

  • The SELECT LIST() statement gathers the XML text for each page into the LONG VARCHAR variable @page_xml.

  • The INSERT enwiki_entry statement calls OPENXML() to fill the page_title and page_text columns; for a detailed discussion of parsing XML with OPENXML() see OPENXML() Rocks!

  • A COMMIT is issued for every 10,000 pages processed.
BEGIN
DECLARE @page_counter BIGINT;
DECLARE @page_xml LONG VARCHAR;

SET @page_counter = 0;

FOR f_fetch AS c_fetch NO SCROLL CURSOR FOR
SELECT enwiki_text_xref.page_number AS @page,
enwiki_text_xref.from_line_number AS @from,
enwiki_text_xref.to_line_number AS @to
FROM enwiki_text_xref
ORDER BY enwiki_text_xref.page_number
FOR READ ONLY
DO
SELECT LIST (
enwiki_text.line_text,
'\x0d\x0a'
ORDER BY enwiki_text.line_number ) AS @page_xml
INTO @page_xml
FROM enwiki_text
WHERE enwiki_text.line_number
BETWEEN @from
AND @to;

SET @page_counter = @page_counter + 1;

INSERT enwiki_entry (
page_number,
from_line_number,
to_line_number,
page_title,
page_text )
SELECT @page,
@from,
@to,
TRIM ( COALESCE ( page_title, '???' ) ),
TRIM ( COALESCE ( page_text, '???' ) )
FROM OPENXML (
@page_xml,
'/page' )
WITH (
page_title VARCHAR ( 1000 ) 'title',
page_text LONG VARCHAR 'revision/text' );

IF MOD ( @page_counter, 10000 ) = 0 THEN
COMMIT;
END IF;
END FOR;
COMMIT;
END;
Time to INSERT enwiki_entry: 3h 55m 6s

Tip: If you don't have four separate disk drives for the four busy files mentioned in Step 3 earlier, at least try to put the transaction log on a separate drive from the database file. Here's a snapshot of the Windows Vista Resource Monitor showing how both files are heavily used during the INSERT enwiki_entry process:



Step 7: Index The Pages.

For big, fairly static data sets like snapshot copies of Wikipedia, it's probably most convenient to take full control over the creation of full text indexes. One way to do this is to initially create an empty index using the MANUAL REFRESH clause of CREATE TEXT INDEX, and then later build the index via REFRESH TEXT INDEX.

Here's the code for a full text index on enwiki_entry:
CREATE TEXT INDEX tx_page_text 
ON enwiki_entry ( page_title, page_text )
MANUAL REFRESH;

REFRESH TEXT INDEX tx_page_text
ON enwiki_entry
WITH EXCLUSIVE MODE
FORCE BUILD;
Time to REFRESH TEXT INDEX tx_page_text: 2h 14m 47s

Tip: When designing a full text index, don't forget to include critical columns, such as "title" or "description" columns that might accompany large document columns in the same table. You can't ALTER a text index to include columns you've forgotten, but you can exclude columns you don't need when doing later searches.

Step 8: Test The Index.

Here's a sample query that uses the full text index to find Wikipedia pages containing both 'Wayne' and 'Gretzky'. The result set is sorted in decreasing order by the full text score, and the second group of 100 rows are displayed:
SELECT TOP 100
START AT 101
ROW_NUMBER()
OVER ( ORDER BY score DESC ) AS entry_number,
ROUND ( score, 1 ) AS score,
enwiki_entry.page_title AS title,
LEFT ( enwiki_entry.page_text, 500 ) AS excerpt
FROM enwiki_entry
CONTAINS ( enwiki_entry.page_title,
enwiki_entry.page_text,
'Wayne Gretzky' )
WHERE enwiki_entry.page_text NOT LIKE '#REDIRECT%'
ORDER BY entry_number;

The WHERE clause eliminates all content-free Wikipedia "redirection" pages. The new ROW_NUMBER() window function is used instead of the old NUMBER(*) function so that clauses like TOP 100 START AT 101 will show entry_number values 101, 102, 103, ... instead of 1, 2, 3, ...