Wednesday, September 3, 2008

Loading Wikipedia

This article talks about loading 264 million lines of Wikipedia XML into a SQL Anywhere table, parsing and gathering this text into 7.3 million blobs in another table, and then building a full text index on two columns in that table: the title of the Wikipedia entry and the blob containing the article.

The final result can be seen using a simple Google-style interfacebased on a SQL Anywhere 11 web service.

Note: This demonstration is no longer available on the internet.


Don't expect Google-like speeds, however, or even the same results... SQL Anywhere's index-based full text search facility is not exactly the same thing as the Google search engine.

The purpose to this exercise is two-fold: To demonstrate the new full text search facility on a non-trivial data set, and to continue exploring ways to use SQL Anywhere to solve programming problems inside the database server. After all, the name of this blog is "SQL Anywhere" so you're not going to see articles about Java or C or Perl or PHP any time soon (unless, of course, it's about writing SQL Anywhere stored procedures in Perl or one of the other languages supported in Version 11 :)

Step 1: Download The Wikipedia File.

Go to this page and download this 3.9G compressed file: enwiki-latest-pages-articles.xml.bz2.

Time to download: Hours and hours, especially when the new FireFox 3 download manager unexpectedly paused the download when I wasn't looking... grrrr!

Step 2: Unzip The Wikipedia File.



Time to unzip: 20m

Here's what the text looks like:



The two interesting elements are the <title> nested within each <page>, plus the <text> element nested within the page <revision> element.

Step 3. Create The Database.

Here are the commands used to initialize the database and start the 64-bit version of dbsrv11.exe on a Core Quad processor with 4G of RAM and 4 hard drives running 64-bit Windows Vista (the lines have been wrapped to fit on screen):
"%SQLANY11%\bin64\dbinit.exe" -p 8192 -s 
-t e:\data\enwiki\enwiki.log
c:\data\enwiki\enwiki.db

"%SQLANY11%\bin64\dbspawn.exe"
-f "%SQLANY11%\bin64\dbsrv11.exe"
@$dbsrv11_enwiki_http_options.txt
To improve I/O performance all the busy files were placed on different physical drives in a non-RAID configuration:
Database:        c:\data\enwiki\enwiki.db
Transaction log: e:\data\enwiki\enwiki.log
Temporary file: f:\data\enwiki\sqla*.tmp
Input text: g:\data\enwiki\enwiki-latest-pages-articles.xml
The dbinit -t option is used to specify the transaction log location, and the dbsrv11 -dt option is used (see below) to specify where to put the sqla*.tmp temporary files.

Here are the dbsrv11 options contained in the configuration file $dbsrv11_enwiki_http_options.txt (the -xs line has been wrapped to fit on screen):
-c 3G 
-ch 3G
-dt f:\data\enwiki
-gn 10
-o dbsrv11_log_enwiki.txt
-os 10M
-x tcpip
-xs "http(log=dbsrv11_log_inbound_http_enwiki.txt;
lf=@T - @I - @U - @R - @L - @W - @E;
lsize=1M;dbn=enwiki;port=81;
maxsize=1M;to=600;kto=600)"
-ze
-zl
-zp
-zt
c:\data\enwiki\enwiki.db
The -c and -ch options were used to fix the database cache size at 3G.

The -gn 10 option was used temporarily to optimize server performance for a single busy connection: the dbisql session used to load the Wikipedia data. An even smaller number might have helped more, but there's a rule of thumb that says "Don't set gn lower than 10 if you know what's good for you!"

The -o and -os options specify where to write the database console log file, and when to rename and restart it. Except perhaps for embedded databases and handheld devices, every SQL Anywhere server in the world should specify these two parameters to help diagnose runtime problems.

The -x option speeds server startup by specifying that only tcpip is necessary.

The -xs option tells dbsrv11 to start the builtin HTTP server and how to configure it. The log parameter specifies where to write the HTTP log records, lf specifies what to include in the log, and lsize specifies when to rename and restart it. The dbn parameter must match the database name that contains the web services to be used, and the port parameter specifies what port to listen to for HTTP requests. The maxsize parameter specifies the largest HTTP request that will be requested, and the to and kto parameters are used to increase the HTTP timeout and keep-alive durations.

The -ze, -zl, -zp and -zt options all serve to increase the level of diagnostic tracing performed by the server.

To further increase I/O performance by eliminating disk defragmentation in the database file, Diskeeper 2008 was run after the following command was run to increase the database file size:
ALTER DBSPACE ADD 50G;
Time to ALTER DBSPACE ADD 50G: 11m 17s.

Tip: If you have Diskeeper 2008 "Automatic Defragmentation" enabled, you might want to disable it while the Wikipedia loading processes are running. Diskeeper isn't going to cause any errors, but it will happily pound away on the same drives you're trying to use... even if it's working on a different drive, it still might be using the same controller or other resources. You can set a schedule inside Diskeeper to enable and disable automatic defragmention:



Step 4: Load The Text.

The enwiki_text table stores each line of text as a LONG VARCHAR column, and the line_number column is automatically incremented to preserve the input order. This automatic ordering is achieved by using three features: the DEFAULT AUTOINCREMENT clause in the CREATE TABLE statement, and two clauses in the LOAD TABLE statement: First, the column list "( line_text )" is coded to specifically exclude the line_number column so SQL Anywhere won't expect any values to be present in the input file. Second, the DEFAULTS ON clause instructs LOAD TABLE to actually calculate the DEFAULT AUTOINCREMENT value for each new line_number; by default, LOAD TABLE doesn't apply defaults.
CREATE TABLE enwiki_text (
line_number BIGINT NOT NULL DEFAULT AUTOINCREMENT
PRIMARY KEY CLUSTERED,
line_text LONG VARCHAR NOT NULL DEFAULT '' );

LOAD TABLE enwiki_text ( line_text )
FROM 'c:/data/enwiki/enwiki-latest-pages-articles.xml'
DEFAULTS ON
DELIMITED BY ''
ESCAPES OFF
HEXADECIMAL OFF
QUOTES OFF
STRIP OFF;
The other clauses of the LOAD TABLE tell it to accept the input text as-is: don't look for any inter-field delimiters, don't expect any special escape characters or hexadecimal notation or quotes, and don't strip any blanks. This is a standard "what you see is what you get" LOAD TABLE command, very useful for loading raw text without any edits.

Time for LOAD TABLE enwiki_text: 1h 48m.

Tip: If you don't have four separate disk drives for the four busy files mentioned in Step 3 above, you can put the input text file on the same drive as the database file. However, try to put the temporary file on a separate drive from both the input text file and the database file. The LOAD TABLE process will perform heavy I/O on the temporary file and input text file simultaneously during its first phase of processing, and during its second phase it will work on the temporary file and database file simultaneously. Here's a snapshot of the Windows Vista Resource Monitor during the second phase of LOAD TABLE processing:



Step 5: Find The Pages.

The Wikipedia text consists of a <page> ... </page> element for each one of the 7.3 million pages, preceded and followed by a few other tags that we're not interested in.

Parsing this XML text is not a complex job for SQL Anywhere's OPENXML procedure, but it's a BIG job... too big to do in one statement.

So, we're going to divide and conquer: First, find the first and last lines of each <page> element, and later, pass each <page> element through the OPENXML parser. The enwiki_text_xref table is used to store the first and last line numbers:
CREATE TABLE enwiki_text_xref ( 
page_number BIGINT NOT NULL DEFAULT AUTOINCREMENT
PRIMARY KEY CLUSTERED,
from_line_number BIGINT NOT NULL,
to_line_number BIGINT NULL );
CONSTRAINT fk_from
NOT NULL FOREIGN KEY ( from_line_number )
REFERENCES enwiki_text ( line_number ),
CONSTRAINT fk_to
FOREIGN KEY ( to_line_number )
REFERENCES enwiki_text ( line_number ) );
Here's how the from_line_number column is filled in for each page:
INSERT enwiki_text_xref ( from_line_number )
SELECT enwiki_text.line_number
FROM enwiki_text
WHERE enwiki_text.line_text LIKE '%<page>%'
ORDER BY enwiki_text.line_number;
Yes, it's a full-table scan using LIKE '%<page>%', but it's just done once, and it's fast:



Time for INSERT enwiki_text_xref: 38m 28.9s.

The second column is a bit trickier: Each row's to_line_number is equal to the next row's from_line_number, minus one. How do you code that in SQL, without an old-fashioned FETCH loop that looks at every single row?

The answer is to use the LAST_VALUE() function introduced in SQL Anywhere 10.0.1:
CREATE VIEW to_line_numbering AS
SELECT page_number,
( LAST_VALUE ( from_line_number )
OVER line_number_window ) - 1
AS to_line_number
FROM enwiki_text_xref
WINDOW line_number_window AS (
ORDER BY page_number
RANGE BETWEEN CURRENT ROW
AND 1 FOLLOWING );

UPDATE enwiki_text_xref
INNER JOIN to_line_numbering
ON to_line_numbering.page_number
= enwiki_text_xref.page_number
SET enwiki_text_xref.to_line_number
= to_line_numbering.to_line_number;
The magic is in the CREATE VIEW, not the UPDATE; the UPDATE simply trusts the view to properly calculate to_line_number.

The SELECT in the CREATE VIEW works as follows: The WINDOW clause names and defines a moving set of rows (a window) for each row in the candidate result set. For each row in enwiki_text_xref, the window contains the CURRENT ROW and the next row (1 FOLLOWING). The ORDER BY clause determines how the rows are ordered for this purpose, and the RANGE clause specifies how the first and last rows in the window are identified.

The LAST VALUE aggregate functions asks for the last value of the from_line_number column, taken from the WINDOW named in the OVER clause. In English, this is the from_line_number column from the next row... that's what the WINDOW contains, this row and the next one, and therefore the "LAST" value from the WINDOW comes from the next row.

The result is reduced by 1 to give the to_line_number of the current row, and that's what the view returns to the UPDATE.

Time for UPDATE #1 enwiki_text_xref: 15m 35.4s.

You can read about OLAP windows in the Help, and in Glenn Paulley's white paper. You can also see another example of LAST_VALUE() here.

Astute readers will notice that the view doesn't return a proper value of to_line_number for the last row in enwiki_text_xref... it's the last row, so there is no "next row" to use to calculate the LAST VALUE(). Here's the code to finish things up:
BEGIN
DECLARE @last_page_line_number BIGINT;

SELECT TOP 1 enwiki_text.line_number
INTO @last_page_line_number
FROM enwiki_text
WHERE enwiki_text.line_text LIKE '%</page>%'
ORDER BY enwiki_text.line_number DESC;

UPDATE TOP 1
enwiki_text_xref
SET enwiki_text_xref.to_line_number
= @last_page_line_number
ORDER BY page_number DESC;

END;
Time for UPDATE #2 enwiki_text_xref: 0s (yes, zero, even with the LIKE '%</page>%' predicate).

Not only is the additional processing time for filling enwiki_text_xref table small, but so is the table: 390M for enwiki_text_xref versus 22G for enwiki_text.

Step 6: Parse The Pages.

Here's what the final Wikipedia table looks like, with the important columns being page_title and page_text:
CREATE TABLE enwiki_entry ( 
page_number BIGINT NOT NULL
PRIMARY KEY CLUSTERED,
from_line_number BIGINT NOT NULL,
to_line_number BIGINT NOT NULL,
page_title VARCHAR ( 1000 ) NOT NULL,
page_text LONG VARCHAR NOT NULL,
CONSTRAINT fk_page
NOT NULL FOREIGN KEY ( page_number )
REFERENCES enwiki_text_xref ( page_number ),
CONSTRAINT fk_from
NOT NULL FOREIGN KEY ( from_line_number )
REFERENCES enwiki_text ( line_number ),
CONSTRAINT fk_to
NOT NULL FOREIGN KEY ( to_line_number )
REFERENCES enwiki_text ( line_number ) );
Once again, here's what the input text looks like:



Data for the enwiki_entry.page_title column comes from the <title> element nested within each <page>, and the page_text column is filled from the <text> element nested within each page's <revision> element.

Here's the code that fills enwiki_entry as follows:
  • A cursor FOR loop is used to process one page at a time, rather than all 7.3 million pages at once.

  • The cursor SELECT uses the enwiki_text_xref table to find the @from and @to rows in enwiki_text for each <page> element.

  • The SELECT LIST() statement gathers the XML text for each page into the LONG VARCHAR variable @page_xml.

  • The INSERT enwiki_entry statement calls OPENXML() to fill the page_title and page_text columns; for a detailed discussion of parsing XML with OPENXML() see OPENXML() Rocks!

  • A COMMIT is issued for every 10,000 pages processed.
BEGIN
DECLARE @page_counter BIGINT;
DECLARE @page_xml LONG VARCHAR;

SET @page_counter = 0;

FOR f_fetch AS c_fetch NO SCROLL CURSOR FOR
SELECT enwiki_text_xref.page_number AS @page,
enwiki_text_xref.from_line_number AS @from,
enwiki_text_xref.to_line_number AS @to
FROM enwiki_text_xref
ORDER BY enwiki_text_xref.page_number
FOR READ ONLY
DO
SELECT LIST (
enwiki_text.line_text,
'\x0d\x0a'
ORDER BY enwiki_text.line_number ) AS @page_xml
INTO @page_xml
FROM enwiki_text
WHERE enwiki_text.line_number
BETWEEN @from
AND @to;

SET @page_counter = @page_counter + 1;

INSERT enwiki_entry (
page_number,
from_line_number,
to_line_number,
page_title,
page_text )
SELECT @page,
@from,
@to,
TRIM ( COALESCE ( page_title, '???' ) ),
TRIM ( COALESCE ( page_text, '???' ) )
FROM OPENXML (
@page_xml,
'/page' )
WITH (
page_title VARCHAR ( 1000 ) 'title',
page_text LONG VARCHAR 'revision/text' );

IF MOD ( @page_counter, 10000 ) = 0 THEN
COMMIT;
END IF;
END FOR;
COMMIT;
END;
Time to INSERT enwiki_entry: 3h 55m 6s

Tip: If you don't have four separate disk drives for the four busy files mentioned in Step 3 earlier, at least try to put the transaction log on a separate drive from the database file. Here's a snapshot of the Windows Vista Resource Monitor showing how both files are heavily used during the INSERT enwiki_entry process:



Step 7: Index The Pages.

For big, fairly static data sets like snapshot copies of Wikipedia, it's probably most convenient to take full control over the creation of full text indexes. One way to do this is to initially create an empty index using the MANUAL REFRESH clause of CREATE TEXT INDEX, and then later build the index via REFRESH TEXT INDEX.

Here's the code for a full text index on enwiki_entry:
CREATE TEXT INDEX tx_page_text 
ON enwiki_entry ( page_title, page_text )
MANUAL REFRESH;

REFRESH TEXT INDEX tx_page_text
ON enwiki_entry
WITH EXCLUSIVE MODE
FORCE BUILD;
Time to REFRESH TEXT INDEX tx_page_text: 2h 14m 47s

Tip: When designing a full text index, don't forget to include critical columns, such as "title" or "description" columns that might accompany large document columns in the same table. You can't ALTER a text index to include columns you've forgotten, but you can exclude columns you don't need when doing later searches.

Step 8: Test The Index.

Here's a sample query that uses the full text index to find Wikipedia pages containing both 'Wayne' and 'Gretzky'. The result set is sorted in decreasing order by the full text score, and the second group of 100 rows are displayed:
SELECT TOP 100
START AT 101
ROW_NUMBER()
OVER ( ORDER BY score DESC ) AS entry_number,
ROUND ( score, 1 ) AS score,
enwiki_entry.page_title AS title,
LEFT ( enwiki_entry.page_text, 500 ) AS excerpt
FROM enwiki_entry
CONTAINS ( enwiki_entry.page_title,
enwiki_entry.page_text,
'Wayne Gretzky' )
WHERE enwiki_entry.page_text NOT LIKE '#REDIRECT%'
ORDER BY entry_number;

The WHERE clause eliminates all content-free Wikipedia "redirection" pages. The new ROW_NUMBER() window function is used instead of the old NUMBER(*) function so that clauses like TOP 100 START AT 101 will show entry_number values 101, 102, 103, ... instead of 1, 2, 3, ...

3 comments:

Anonymous said...

Breck,

Thank you for sharing the details of this project. Very educational!

Do you also plan to show the implementation of the web service behind http://www.risingroad.com/enwiki_search? It would make the end-to-end example perfectly complete.

Separately, while the use of LAST_VALUE and WINDOW to populate enwiki_text_xref.to_line_number is a good demonstration for these features, I wonder if it's really necessary. If, as you say, each row's to_line_number is equal to the next row's from_line_number, minus one, could something like the following work?

UPDATE enwiki_text_xref this_page
INNER JOIN enwiki_text_xref next_page
ON this_page.page_number
= next_page.page_number - 1
SET this_page.to_line_number
= next_page.from_line_number - 1;

or:

UPDATE enwiki_text_xref this_page
SET this_page.to_line_number
= (SELECT next_page.from_line_number – 1
FROM enwiki_text_xref next_page
WHERE this_page.page_number
= next_page.page_number - 1;

Is this a question of performance? Does window function in this case work much faster than either the self-join or the sub-query?

Thanks again for posting this material.

Breck Carter said...

Yuliy: Your UPDATE works fine, and it is five times faster than my version using LAST_VALUE... even when I call sa_flush_cache() first. Plus, your version is simpler and easier to understand. In other words, it's a better solution no matter how you look at it!

I was going to call this article "part 1 of 2" but I don't like promising anything that hasn't been written yet. Yes, I would like to write about the web service, but first I'd like to have another look at the code. In the meantime, you can actually *use* it by following the link near the top of the article.

Again, thanks for the suggestion!

Breck

Breck Carter said...

First, a correction to the previous comment: No, you can't actually follow the link any more, that physical server is no longer operational.

Second, the Steps were numbered incorrectly, now they are fixed.