Monday, December 31, 2018

WITH FILE NAME LOGGING runs faster, but...

The LOAD TABLE statement has been a feature of SQL Anywhere since the early days, and the Version 5.5 Help was enthusiastic about it: "LOAD TABLE is more efficient than the ISQL statement INPUT and can be called from any client application."

The official enthusiasm had waned a bit by Version 6: "LOAD TABLE is intended solely for fast loading of large amounts of data. It is not intended for routine use in applications."

The Dark Side of LOAD TABLE was introduced in Version 7: "Inserts are not recorded in the log file. Thus, the inserted rows may not be recovered in the event of a crash. In addition, the LOAD TABLE statement should never be used in database involved in SQL Remote replication or databases used as MobiLink clients because these technologies replicated changes through analysis of the log file.

Versions 8, 9 and 10 were positively shrill on the subject: "Inserts are not recorded in the log file, raising the risk that data will be lost in the event of a crash and making this statement unusable with SQL Remote or with MobiLink remote databases."

Version 11, however, dropped the scary rhetoric and introduced three new LOAD TABLE clauses: WITH FILE NAME LOGGING, WITH ROW LOGGING and WITH CONTENT LOGGING.

WITH FILE NAME LOGGING is the default in Versions 11 through 17, and it is the name given to the way LOAD TABLE worked in all the earlier versions:

  • It records the file specification of the input file in the transaction log file, not the individual row INSERT statements.

  • It is also the fast method because it doesn't have to write millions of records to the log, and

  • it makes the log file much smaller.
The bad news is that
  • SQL Remote stops working if you use LOAD TABLE WITH FILE NAME LOGGING on any table involved in replication,

  • MobiLink stops working if you do the same thing to any remote table involved in synchronization, and

  • any database restart recovery that needs to apply transaction log data must have access to the original input file in its original location.
If you don't use SQL Remote or MobiLink then you don't care the first two problems, and if you do use them you already know what the problems with LOAD TABLE are.

The third problem (LOAD TABLE versus restart recovery) is less well understood.

For example, here's what an ordinary LOAD TABLE FROM 'file' statement
LOAD TABLE ttt FROM 'C:\\DATA\\TEST\\ttt.TXT';
looks like in the SQL Anywhere 17 transaction log:
--CHECKPOINT-0000-05896335817-2019-01-04 08:32:43.559
--CONNECT-1004-05896335806-dba-2019-01-04 08:32
--ROLLBACK-1004-05896335851
ROLLBACK WORK
go
--CHECKPOINT-0000-05896335867-2019-01-04 08:38:26.067
--CONNECT-1007-05896335856-dba-2019-01-04 08:38
--BEGIN TRANSACTION-1007-05896335901
BEGIN TRANSACTION
go
--BEGIN LOAD TABLE-1007-05896335904: load into table "dba"."ttt" using file 'C:\\TEMP\\ttt.TXT'
--END LOAD TABLE-1007-05896335966
--ROLLBACK-1007-05896335969
ROLLBACK WORK
go
--CHECKPOINT-0000-05896335972-2019-01-04 08:39:05.645
--BEGIN TRANSACTION-1007-05896336006
BEGIN TRANSACTION
go
--BEGIN LOAD TABLE-1007-05896336009: load into table "dba"."ttt" using file 'C:\\DATA\\TEST\\ttt.TXT'
--SQL-1007-05896336077
begin
  set temporary option "conversion_error" = 'On';
  set temporary option "string_rtruncation" = 'On';
  set temporary option "date_order" = 'YMD';
  set temporary option "nearest_century" = '50';
  set temporary option "default_timestamp_increment" = '1';
  set temporary option "time_zone_adjustment" = '-300';
  load into table "dba"."ttt" using file 'C:\\DATA\\TEST\\ttt.TXT' encoding 'windows-1252';
  set temporary option "conversion_error" = ;
  set temporary option "string_rtruncation" = ;
  set temporary option "date_order" = ;
  set temporary option "nearest_century" = ;
  set temporary option "default_timestamp_increment" = ;
  set temporary option "time_zone_adjustment" = ;
end
go
--END LOAD TABLE-1007-05896336793
--COMMIT-1007-05896336796-2019-01-04 08:41:03.742
COMMIT WORK
go
--START CHECKPOINT-0000-05896336807-2019-01-04 08:41:05.581
--FINISH CHECKPOINT-0000-05896336832-2019-01-04 08:41:26.851
--ROLLBACK-1007-05896336848
ROLLBACK WORK
go
--CHECKPOINT-0000-05896336853-2019-01-04 08:45:55.716
Let's say you do these four things:
  1. You make a full backup of your database,

  2. you run the LOAD TABLE ttt FROM 'C:\\DATA\\TEST\\ttt.TXT' as shown above,

  3. you back up the log file, and then

  4. you delete the data file C:\DATA\TEST\ttt.TXT.
Then, disaster strikes: You lose your current database.

Here's how you might try to recover your database:
  1. You copy your backup database file to replace the current database, and then

  2. you run dbsrv17 -ad to apply the backup log file to the restored database file.
Sadly, that Cannot access file 'C:\DATA\TEST\ttt.TXT':
I. 01/04 09:18:32. Starting database "ddd17" (C:\$ web\blog\20181231 WITH FILE NAME LOGGING runs faster, 
   but\ddd17.db) at Fri Jan 04 2019 09:18
I. 01/04 09:18:32. Performance warning: Database file "C:\$ web\blog\20181231 WITH FILE NAME LOGGING runs faster, 
   but\ddd17.db" consists of 213 disk fragments
I. 01/04 09:18:32. Database recovery in progress
I. 01/04 09:18:32.     Last checkpoint at Fri Jan 04 2019 08:29
I. 01/04 09:18:32.     Checkpoint log...
I. 01/04 09:18:33.     Transaction log: C:\$ web\blog\20181231 WITH FILE NAME LOGGING runs faster, but\190104AB.LOG...
I. 01/04 09:18:33. Starting checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 09:18
I. 01/04 09:18:33. Finished checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 09:18
I. 01/04 09:18:33. Starting checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 09:18
I. 01/04 09:18:33. Finished checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 09:18
I. 01/04 09:18:33. Starting checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 09:18
I. 01/04 09:18:33. Finished checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 09:18
I. 01/04 09:18:33.     Transaction log: C:\$ web\blog\20181231 WITH FILE NAME LOGGING runs faster, but\ddd17.log...
I. 01/04 09:18:33. Starting checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 09:18
I. 01/04 09:18:34. Finished checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 09:18
I. 01/04 09:18:34. Starting checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 09:18
I. 01/04 09:18:34. Finished checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 09:18
I. 01/04 09:18:34.     Parallel recovery enabled
I. 01/04 09:18:34. Starting checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 09:18
I. 01/04 09:18:34. Finished checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 09:18
E. 01/04 09:18:34. *** ERROR *** Assertion failed: 100948 (17.0.9.4882)
Failed to redo a database operation (id=4, page_no=0x1, offset=0x128) - Error: Cannot access file 'C:\DATA\TEST\ttt.TXT' 
   -- The system cannot find the file specified.
Here are three alternatives to avoid that problem:
  1. Take a full backup right after each LOAD TABLE WITH FILE NAME LOGGING statement,

  2. preserve all the input files until they are no longer needed for recovery... which is after you eventually do take a full backup, or

  3. use LOAD TABLE WITH ROW LOGGING instead (or WITH CONTENT LOGGING).
Alternative 3's probably the easiest if you have a lot of LOAD TABLE files and they're not very big; just run with ROW LOGGING and throw the files away after they're loaded.

If you regularly load millions of rows, however, you might need the speed of Alternatives 1 and 2. Here's how FILE NAME LOGGING and ROW LOGGING compare:
              LOAD TABLE              LOAD TABLE 
              WITH FILE NAME LOGGING  WITH ROW LOGGING
              ======================  ================ 
Elapsed time  118 seconds             214 seconds      - much slower
Log growth    0 pages                 481,295 pages    - huge growth
Here are the test scripts:

LOAD TABLE ttt FROM 'C:\\DATA\\TEST\\ttt.TXT';
BEGIN
DECLARE @start_at TIMESTAMP;
DECLARE @seconds  BIGINT;
DECLARE @pages    BIGINT;
DECLARE @growth   BIGINT;

-- Test A: LOAD TABLE

SET @start_at = CURRENT TIMESTAMP;
SET @pages    = DB_EXTENDED_PROPERTY ( 'FileSize', 'translog' );
MESSAGE STRING ( CURRENT TIMESTAMP, ' Test A starting; @@VERSION = ', @@VERSION ) TO CONSOLE;
LOAD TABLE ttt FROM 'C:\\DATA\\TEST\\ttt.TXT';
SET @seconds = DATEDIFF ( SECOND, @start_at, CURRENT TIMESTAMP );
SET @growth  = DB_EXTENDED_PROPERTY ( 'FileSize', 'translog' ) - @pages;
MESSAGE STRING ( CURRENT TIMESTAMP, ' Test A seconds = ', @seconds, ', log pages growth = ', @growth ) TO CONSOLE;
END;

I. 01/04 08:38:26. 2019-01-04 08:38:26.035 Test A starting; @@VERSION = 17.0.9.4882
I. 01/04 08:38:26. Starting checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 08:38
I. 01/04 08:38:26. Finished checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 08:38
I. 01/04 08:39:05. 2019-01-04 08:39:05.598 Test A starting; @@VERSION = 17.0.9.4882
I. 01/04 08:39:05. Starting checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 08:39
I. 01/04 08:39:05. Finished checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 08:39
I. 01/04 08:41:03. 2019-01-04 08:41:03.742 Test A seconds = 118, log pages growth = 0
I. 01/04 08:41:05. Starting checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 08:41
I. 01/04 08:41:05. Cache size adjusted to 2270460K
I. 01/04 08:41:10. Cache size adjusted to 2256544K
I. 01/04 08:41:26. Finished checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 08:41

LOAD TABLE ttt FROM 'C:\\DATA\\TEST\\ttt.TXT' WITH ROW LOGGING;
BEGIN
DECLARE @start_at TIMESTAMP;
DECLARE @seconds  BIGINT;
DECLARE @pages    BIGINT;
DECLARE @growth   BIGINT;

-- Test B: LOAD TABLE WITH ROW LOGGING

SET @start_at = CURRENT TIMESTAMP;
SET @pages    = DB_EXTENDED_PROPERTY ( 'FileSize', 'translog' );
MESSAGE STRING ( CURRENT TIMESTAMP, ' Test B starting; @@VERSION = ', @@VERSION ) TO CONSOLE;
LOAD TABLE ttt FROM 'C:\\DATA\\TEST\\ttt.TXT' WITH ROW LOGGING;
SET @seconds = DATEDIFF ( SECOND, @start_at, CURRENT TIMESTAMP );
SET @growth  = DB_EXTENDED_PROPERTY ( 'FileSize', 'translog' ) - @pages;
MESSAGE STRING ( CURRENT TIMESTAMP, ' Test B seconds = ', @seconds, ', log pages growth = ', @growth ) TO CONSOLE;
END;

I. 01/04 09:41:29. 2019-01-04 09:41:29.472 Test B starting; @@VERSION = 17.0.9.4882
I. 01/04 09:41:29. Starting checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 09:41
I. 01/04 09:41:29. Finished checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 09:41
I. 01/04 09:45:03. Starting checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 09:45
I. 01/04 09:45:03. 2019-01-04 09:45:03.724 Test B seconds = 214, log pages growth = 481295
I. 01/04 09:45:04. Cache size adjusted to 2267588K
I. 01/04 09:45:19. Finished checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 09:45
I. 01/04 09:46:04. Cache size adjusted to 2255732K



9 comments:

Justin said...

That's interesting (and useful). Common sense would suggest that it **should** be faster, I wonder why it's not? However it's reassuring to know that the **much** safer option of having the data in the transaction log is not going to slow things down.

Breck Carter said...

@Justin: My uneducated guess is that over the years the process that writes log data has been optimized, perhaps by using parallelism so the main LOAD task is not affected at all. Since HA is a heavy user of log data, the logging process might have had a LOT of attention :)

John Smirnios said...

Can you post the two statements that you are comparing for these tests? For most LOAD TABLE statements, 'WITH FILE NAME LOGGING' is the default so adding that clause doesn't normally change anything. If you are doing a LOAD TABLE from a named pipe, file name logging shouldn't even be allowed.

Breck Carter said...

@John: Yes, you are correct (and the Help is wrong): WITH FILE NAME LOGGING has always been the default, since before the LOGGING clause existed.

That means the test proved nothing... but the updated test (see blog for code) shows the premise is the same: FILE NAME LOGGING runs no faster than ROW LOGGING.

As for named pipes LOAD TABLE really did run without error with WITH FILE NAME LOGGING... just a lot slower. Alas, I didn't keep track of log growth.

John Smirnios said...

Curious. It seems unlikely that writing almost half a million pages would cost nothing. All the other work for inserting rows into the database remains the same regardless of the logging option. Maybe you are CPU bound and IO is very fast. Are you on an SSD? Single spindle? RAID?

I can't imagine what is going on with the named pipes example...

It might also be a good idea to do a checkpoint after each truncate. Then you'd know you are not competing with background IO to clean the cache and the pages that were freed from the table would be available for reuse.

Breck Carter said...

@John: The plot thickens ( see Update 2 above :)

John Smirnios said...

Your first test is growing the cache during the load which means it will be forced to wait for IO (since we only resize periodically). Try running your server with, say, -c 3g.

Growing the cache also suggests that you are starting with a small database file and the first statement is taking the hit for extending the database. I suggest you pre-grow the dbspace.

Alternatively, just do one load first (of either variety) and throw away the results.

Anonymous said...

FWIW, Breck, is the current version different from the original blog article? I remember having read about Update 1 and 2, and now it looks way more coherent...:)

Volker

Breck Carter said...

@Volker: The current blog post is a complete rewrite because the original conclusion (observation, measurement) was 100% incorrect :)