Monday, December 31, 2018

WITH FILE NAME LOGGING runs faster, but...

The LOAD TABLE statement has been a feature of SQL Anywhere since the early days, and the Version 5.5 Help was enthusiastic about it: "LOAD TABLE is more efficient than the ISQL statement INPUT and can be called from any client application."

The official enthusiasm had waned a bit by Version 6: "LOAD TABLE is intended solely for fast loading of large amounts of data. It is not intended for routine use in applications."

The Dark Side of LOAD TABLE was introduced in Version 7: "Inserts are not recorded in the log file. Thus, the inserted rows may not be recovered in the event of a crash. In addition, the LOAD TABLE statement should never be used in database involved in SQL Remote replication or databases used as MobiLink clients because these technologies replicated changes through analysis of the log file.

Versions 8, 9 and 10 were positively shrill on the subject: "Inserts are not recorded in the log file, raising the risk that data will be lost in the event of a crash and making this statement unusable with SQL Remote or with MobiLink remote databases."

Version 11, however, dropped the scary rhetoric and introduced three new LOAD TABLE clauses: WITH FILE NAME LOGGING, WITH ROW LOGGING and WITH CONTENT LOGGING.

WITH FILE NAME LOGGING is the default in Versions 11 through 17, and it is the name given to the way LOAD TABLE worked in all the earlier versions:

  • It records the file specification of the input file in the transaction log file, not the individual row INSERT statements.

  • It is also the fast method because it doesn't have to write millions of records to the log, and

  • it makes the log file much smaller.
The bad news is that
  • SQL Remote stops working if you use LOAD TABLE WITH FILE NAME LOGGING on any table involved in replication,

  • MobiLink stops working if you do the same thing to any remote table involved in synchronization, and

  • any database restart recovery that needs to apply transaction log data must have access to the original input file in its original location.
If you don't use SQL Remote or MobiLink then you don't care the first two problems, and if you do use them you already know what the problems with LOAD TABLE are.

The third problem (LOAD TABLE versus restart recovery) is less well understood.

For example, here's what an ordinary LOAD TABLE FROM 'file' statement
LOAD TABLE ttt FROM 'C:\\DATA\\TEST\\ttt.TXT';
looks like in the SQL Anywhere 17 transaction log:
--CHECKPOINT-0000-05896335817-2019-01-04 08:32:43.559
--CONNECT-1004-05896335806-dba-2019-01-04 08:32
--ROLLBACK-1004-05896335851
ROLLBACK WORK
go
--CHECKPOINT-0000-05896335867-2019-01-04 08:38:26.067
--CONNECT-1007-05896335856-dba-2019-01-04 08:38
--BEGIN TRANSACTION-1007-05896335901
BEGIN TRANSACTION
go
--BEGIN LOAD TABLE-1007-05896335904: load into table "dba"."ttt" using file 'C:\\TEMP\\ttt.TXT'
--END LOAD TABLE-1007-05896335966
--ROLLBACK-1007-05896335969
ROLLBACK WORK
go
--CHECKPOINT-0000-05896335972-2019-01-04 08:39:05.645
--BEGIN TRANSACTION-1007-05896336006
BEGIN TRANSACTION
go
--BEGIN LOAD TABLE-1007-05896336009: load into table "dba"."ttt" using file 'C:\\DATA\\TEST\\ttt.TXT'
--SQL-1007-05896336077
begin
  set temporary option "conversion_error" = 'On';
  set temporary option "string_rtruncation" = 'On';
  set temporary option "date_order" = 'YMD';
  set temporary option "nearest_century" = '50';
  set temporary option "default_timestamp_increment" = '1';
  set temporary option "time_zone_adjustment" = '-300';
  load into table "dba"."ttt" using file 'C:\\DATA\\TEST\\ttt.TXT' encoding 'windows-1252';
  set temporary option "conversion_error" = ;
  set temporary option "string_rtruncation" = ;
  set temporary option "date_order" = ;
  set temporary option "nearest_century" = ;
  set temporary option "default_timestamp_increment" = ;
  set temporary option "time_zone_adjustment" = ;
end
go
--END LOAD TABLE-1007-05896336793
--COMMIT-1007-05896336796-2019-01-04 08:41:03.742
COMMIT WORK
go
--START CHECKPOINT-0000-05896336807-2019-01-04 08:41:05.581
--FINISH CHECKPOINT-0000-05896336832-2019-01-04 08:41:26.851
--ROLLBACK-1007-05896336848
ROLLBACK WORK
go
--CHECKPOINT-0000-05896336853-2019-01-04 08:45:55.716
Let's say you do these four things:
  1. You make a full backup of your database,

  2. you run the LOAD TABLE ttt FROM 'C:\\DATA\\TEST\\ttt.TXT' as shown above,

  3. you back up the log file, and then

  4. you delete the data file C:\DATA\TEST\ttt.TXT.
Then, disaster strikes: You lose your current database.

Here's how you might try to recover your database:
  1. You copy your backup database file to replace the current database, and then

  2. you run dbsrv17 -ad to apply the backup log file to the restored database file.
Sadly, that Cannot access file 'C:\DATA\TEST\ttt.TXT':
I. 01/04 09:18:32. Starting database "ddd17" (C:\$ web\blog\20181231 WITH FILE NAME LOGGING runs faster, 
   but\ddd17.db) at Fri Jan 04 2019 09:18
I. 01/04 09:18:32. Performance warning: Database file "C:\$ web\blog\20181231 WITH FILE NAME LOGGING runs faster, 
   but\ddd17.db" consists of 213 disk fragments
I. 01/04 09:18:32. Database recovery in progress
I. 01/04 09:18:32.     Last checkpoint at Fri Jan 04 2019 08:29
I. 01/04 09:18:32.     Checkpoint log...
I. 01/04 09:18:33.     Transaction log: C:\$ web\blog\20181231 WITH FILE NAME LOGGING runs faster, but\190104AB.LOG...
I. 01/04 09:18:33. Starting checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 09:18
I. 01/04 09:18:33. Finished checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 09:18
I. 01/04 09:18:33. Starting checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 09:18
I. 01/04 09:18:33. Finished checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 09:18
I. 01/04 09:18:33. Starting checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 09:18
I. 01/04 09:18:33. Finished checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 09:18
I. 01/04 09:18:33.     Transaction log: C:\$ web\blog\20181231 WITH FILE NAME LOGGING runs faster, but\ddd17.log...
I. 01/04 09:18:33. Starting checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 09:18
I. 01/04 09:18:34. Finished checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 09:18
I. 01/04 09:18:34. Starting checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 09:18
I. 01/04 09:18:34. Finished checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 09:18
I. 01/04 09:18:34.     Parallel recovery enabled
I. 01/04 09:18:34. Starting checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 09:18
I. 01/04 09:18:34. Finished checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 09:18
E. 01/04 09:18:34. *** ERROR *** Assertion failed: 100948 (17.0.9.4882)
Failed to redo a database operation (id=4, page_no=0x1, offset=0x128) - Error: Cannot access file 'C:\DATA\TEST\ttt.TXT' 
   -- The system cannot find the file specified.
Here are three alternatives to avoid that problem:
  1. Take a full backup right after each LOAD TABLE WITH FILE NAME LOGGING statement,

  2. preserve all the input files until they are no longer needed for recovery... which is after you eventually do take a full backup, or

  3. use LOAD TABLE WITH ROW LOGGING instead (or WITH CONTENT LOGGING).
Alternative 3's probably the easiest if you have a lot of LOAD TABLE files and they're not very big; just run with ROW LOGGING and throw the files away after they're loaded.

If you regularly load millions of rows, however, you might need the speed of Alternatives 1 and 2. Here's how FILE NAME LOGGING and ROW LOGGING compare:
              LOAD TABLE              LOAD TABLE 
              WITH FILE NAME LOGGING  WITH ROW LOGGING
              ======================  ================ 
Elapsed time  118 seconds             214 seconds      - much slower
Log growth    0 pages                 481,295 pages    - huge growth
Here are the test scripts:

LOAD TABLE ttt FROM 'C:\\DATA\\TEST\\ttt.TXT';
BEGIN
DECLARE @start_at TIMESTAMP;
DECLARE @seconds  BIGINT;
DECLARE @pages    BIGINT;
DECLARE @growth   BIGINT;

-- Test A: LOAD TABLE

SET @start_at = CURRENT TIMESTAMP;
SET @pages    = DB_EXTENDED_PROPERTY ( 'FileSize', 'translog' );
MESSAGE STRING ( CURRENT TIMESTAMP, ' Test A starting; @@VERSION = ', @@VERSION ) TO CONSOLE;
LOAD TABLE ttt FROM 'C:\\DATA\\TEST\\ttt.TXT';
SET @seconds = DATEDIFF ( SECOND, @start_at, CURRENT TIMESTAMP );
SET @growth  = DB_EXTENDED_PROPERTY ( 'FileSize', 'translog' ) - @pages;
MESSAGE STRING ( CURRENT TIMESTAMP, ' Test A seconds = ', @seconds, ', log pages growth = ', @growth ) TO CONSOLE;
END;

I. 01/04 08:38:26. 2019-01-04 08:38:26.035 Test A starting; @@VERSION = 17.0.9.4882
I. 01/04 08:38:26. Starting checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 08:38
I. 01/04 08:38:26. Finished checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 08:38
I. 01/04 08:39:05. 2019-01-04 08:39:05.598 Test A starting; @@VERSION = 17.0.9.4882
I. 01/04 08:39:05. Starting checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 08:39
I. 01/04 08:39:05. Finished checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 08:39
I. 01/04 08:41:03. 2019-01-04 08:41:03.742 Test A seconds = 118, log pages growth = 0
I. 01/04 08:41:05. Starting checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 08:41
I. 01/04 08:41:05. Cache size adjusted to 2270460K
I. 01/04 08:41:10. Cache size adjusted to 2256544K
I. 01/04 08:41:26. Finished checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 08:41

LOAD TABLE ttt FROM 'C:\\DATA\\TEST\\ttt.TXT' WITH ROW LOGGING;
BEGIN
DECLARE @start_at TIMESTAMP;
DECLARE @seconds  BIGINT;
DECLARE @pages    BIGINT;
DECLARE @growth   BIGINT;

-- Test B: LOAD TABLE WITH ROW LOGGING

SET @start_at = CURRENT TIMESTAMP;
SET @pages    = DB_EXTENDED_PROPERTY ( 'FileSize', 'translog' );
MESSAGE STRING ( CURRENT TIMESTAMP, ' Test B starting; @@VERSION = ', @@VERSION ) TO CONSOLE;
LOAD TABLE ttt FROM 'C:\\DATA\\TEST\\ttt.TXT' WITH ROW LOGGING;
SET @seconds = DATEDIFF ( SECOND, @start_at, CURRENT TIMESTAMP );
SET @growth  = DB_EXTENDED_PROPERTY ( 'FileSize', 'translog' ) - @pages;
MESSAGE STRING ( CURRENT TIMESTAMP, ' Test B seconds = ', @seconds, ', log pages growth = ', @growth ) TO CONSOLE;
END;

I. 01/04 09:41:29. 2019-01-04 09:41:29.472 Test B starting; @@VERSION = 17.0.9.4882
I. 01/04 09:41:29. Starting checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 09:41
I. 01/04 09:41:29. Finished checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 09:41
I. 01/04 09:45:03. Starting checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 09:45
I. 01/04 09:45:03. 2019-01-04 09:45:03.724 Test B seconds = 214, log pages growth = 481295
I. 01/04 09:45:04. Cache size adjusted to 2267588K
I. 01/04 09:45:19. Finished checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 09:45
I. 01/04 09:46:04. Cache size adjusted to 2255732K