The LOAD TABLE statement has been a feature of SQL Anywhere since the early days, and the Version 5.5 Help was enthusiastic about it: "LOAD TABLE is more efficient than the ISQL statement INPUT and can be called from any client application."
The official enthusiasm had waned a bit by Version 6: "LOAD TABLE is intended solely for fast loading of large amounts of data. It is not intended for routine use in applications."
The Dark Side of LOAD TABLE was introduced in Version 7: "Inserts are not recorded in the log file. Thus, the inserted rows may not be recovered in the event of a crash. In addition, the LOAD TABLE statement should never be used in database involved in SQL Remote replication or databases used as MobiLink clients because these technologies replicated changes through analysis of the log file.
Versions 8, 9 and 10 were positively shrill on the subject: "Inserts are not recorded in the log file, raising the risk that data will be lost in the event of a crash and making this statement unusable with SQL Remote or with MobiLink remote databases."
Version 11, however, dropped the scary rhetoric and introduced three new LOAD TABLE clauses: WITH FILE NAME LOGGING, WITH ROW LOGGING and WITH CONTENT LOGGING.
WITH FILE NAME LOGGING is the default in Versions 11 through 17, and it is the name given to the way LOAD TABLE worked in all the earlier versions:
- It records the file specification of the input file in the transaction log file, not the individual row INSERT statements.
- It is also the fast method because it doesn't have to write millions of records to the log, and
- it makes the log file much smaller.
- SQL Remote stops working if you use LOAD TABLE WITH FILE NAME LOGGING on any table involved in replication,
- MobiLink stops working if you do the same thing to any remote table involved in synchronization, and
- any database restart recovery that needs to apply transaction log data must have access to the original input file in its original location.
The third problem (LOAD TABLE versus restart recovery) is less well understood.
For example, here's what an ordinary LOAD TABLE FROM 'file' statement
looks like in the SQL Anywhere 17 transaction log:
LOAD TABLE ttt FROM 'C:\\DATA\\TEST\\ttt.TXT';
Let's say you do these four things:
--CHECKPOINT-0000-05896335817-2019-01-04 08:32:43.559 --CONNECT-1004-05896335806-dba-2019-01-04 08:32 --ROLLBACK-1004-05896335851 ROLLBACK WORK go --CHECKPOINT-0000-05896335867-2019-01-04 08:38:26.067 --CONNECT-1007-05896335856-dba-2019-01-04 08:38 --BEGIN TRANSACTION-1007-05896335901 BEGIN TRANSACTION go --BEGIN LOAD TABLE-1007-05896335904: load into table "dba"."ttt" using file 'C:\\TEMP\\ttt.TXT' --END LOAD TABLE-1007-05896335966 --ROLLBACK-1007-05896335969 ROLLBACK WORK go --CHECKPOINT-0000-05896335972-2019-01-04 08:39:05.645 --BEGIN TRANSACTION-1007-05896336006 BEGIN TRANSACTION go --BEGIN LOAD TABLE-1007-05896336009: load into table "dba"."ttt" using file 'C:\\DATA\\TEST\\ttt.TXT' --SQL-1007-05896336077 begin set temporary option "conversion_error" = 'On'; set temporary option "string_rtruncation" = 'On'; set temporary option "date_order" = 'YMD'; set temporary option "nearest_century" = '50'; set temporary option "default_timestamp_increment" = '1'; set temporary option "time_zone_adjustment" = '-300'; load into table "dba"."ttt" using file 'C:\\DATA\\TEST\\ttt.TXT' encoding 'windows-1252'; set temporary option "conversion_error" = ; set temporary option "string_rtruncation" = ; set temporary option "date_order" = ; set temporary option "nearest_century" = ; set temporary option "default_timestamp_increment" = ; set temporary option "time_zone_adjustment" = ; end go --END LOAD TABLE-1007-05896336793 --COMMIT-1007-05896336796-2019-01-04 08:41:03.742 COMMIT WORK go --START CHECKPOINT-0000-05896336807-2019-01-04 08:41:05.581 --FINISH CHECKPOINT-0000-05896336832-2019-01-04 08:41:26.851 --ROLLBACK-1007-05896336848 ROLLBACK WORK go --CHECKPOINT-0000-05896336853-2019-01-04 08:45:55.716
- You make a full backup of your database,
- you run the LOAD TABLE ttt FROM 'C:\\DATA\\TEST\\ttt.TXT' as shown above,
- you back up the log file, and then
- you delete the data file C:\DATA\TEST\ttt.TXT.
Here's how you might try to recover your database:
- You copy your backup database file to replace the current database, and then
- you run dbsrv17 -ad to apply the backup log file to the restored database file.
Here are three alternatives to avoid that problem:
I. 01/04 09:18:32. Starting database "ddd17" (C:\$ web\blog\20181231 WITH FILE NAME LOGGING runs faster, but\ddd17.db) at Fri Jan 04 2019 09:18 I. 01/04 09:18:32. Performance warning: Database file "C:\$ web\blog\20181231 WITH FILE NAME LOGGING runs faster, but\ddd17.db" consists of 213 disk fragments I. 01/04 09:18:32. Database recovery in progress I. 01/04 09:18:32. Last checkpoint at Fri Jan 04 2019 08:29 I. 01/04 09:18:32. Checkpoint log... I. 01/04 09:18:33. Transaction log: C:\$ web\blog\20181231 WITH FILE NAME LOGGING runs faster, but\190104AB.LOG... I. 01/04 09:18:33. Starting checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 09:18 I. 01/04 09:18:33. Finished checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 09:18 I. 01/04 09:18:33. Starting checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 09:18 I. 01/04 09:18:33. Finished checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 09:18 I. 01/04 09:18:33. Starting checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 09:18 I. 01/04 09:18:33. Finished checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 09:18 I. 01/04 09:18:33. Transaction log: C:\$ web\blog\20181231 WITH FILE NAME LOGGING runs faster, but\ddd17.log... I. 01/04 09:18:33. Starting checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 09:18 I. 01/04 09:18:34. Finished checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 09:18 I. 01/04 09:18:34. Starting checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 09:18 I. 01/04 09:18:34. Finished checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 09:18 I. 01/04 09:18:34. Parallel recovery enabled I. 01/04 09:18:34. Starting checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 09:18 I. 01/04 09:18:34. Finished checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 09:18 E. 01/04 09:18:34. *** ERROR *** Assertion failed: 100948 (17.0.9.4882) Failed to redo a database operation (id=4, page_no=0x1, offset=0x128) - Error: Cannot access file 'C:\DATA\TEST\ttt.TXT' -- The system cannot find the file specified.
- Take a full backup right after each LOAD TABLE WITH FILE NAME LOGGING statement,
- preserve all the input files until they are no longer needed for recovery... which is after you eventually do take a full backup, or
- use LOAD TABLE WITH ROW LOGGING instead (or WITH CONTENT LOGGING).
If you regularly load millions of rows, however, you might need the speed of Alternatives 1 and 2. Here's how FILE NAME LOGGING and ROW LOGGING compare:
Here are the test scripts:
LOAD TABLE LOAD TABLE WITH FILE NAME LOGGING WITH ROW LOGGING ====================== ================ Elapsed time 118 seconds 214 seconds - much slower Log growth 0 pages 481,295 pages - huge growth
LOAD TABLE ttt FROM 'C:\\DATA\\TEST\\ttt.TXT';
BEGIN DECLARE @start_at TIMESTAMP; DECLARE @seconds BIGINT; DECLARE @pages BIGINT; DECLARE @growth BIGINT; -- Test A: LOAD TABLE SET @start_at = CURRENT TIMESTAMP; SET @pages = DB_EXTENDED_PROPERTY ( 'FileSize', 'translog' ); MESSAGE STRING ( CURRENT TIMESTAMP, ' Test A starting; @@VERSION = ', @@VERSION ) TO CONSOLE; LOAD TABLE ttt FROM 'C:\\DATA\\TEST\\ttt.TXT'; SET @seconds = DATEDIFF ( SECOND, @start_at, CURRENT TIMESTAMP ); SET @growth = DB_EXTENDED_PROPERTY ( 'FileSize', 'translog' ) - @pages; MESSAGE STRING ( CURRENT TIMESTAMP, ' Test A seconds = ', @seconds, ', log pages growth = ', @growth ) TO CONSOLE; END; I. 01/04 08:38:26. 2019-01-04 08:38:26.035 Test A starting; @@VERSION = 17.0.9.4882 I. 01/04 08:38:26. Starting checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 08:38 I. 01/04 08:38:26. Finished checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 08:38 I. 01/04 08:39:05. 2019-01-04 08:39:05.598 Test A starting; @@VERSION = 17.0.9.4882 I. 01/04 08:39:05. Starting checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 08:39 I. 01/04 08:39:05. Finished checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 08:39 I. 01/04 08:41:03. 2019-01-04 08:41:03.742 Test A seconds = 118, log pages growth = 0 I. 01/04 08:41:05. Starting checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 08:41 I. 01/04 08:41:05. Cache size adjusted to 2270460K I. 01/04 08:41:10. Cache size adjusted to 2256544K I. 01/04 08:41:26. Finished checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 08:41
LOAD TABLE ttt FROM 'C:\\DATA\\TEST\\ttt.TXT' WITH ROW LOGGING;
BEGIN DECLARE @start_at TIMESTAMP; DECLARE @seconds BIGINT; DECLARE @pages BIGINT; DECLARE @growth BIGINT; -- Test B: LOAD TABLE WITH ROW LOGGING SET @start_at = CURRENT TIMESTAMP; SET @pages = DB_EXTENDED_PROPERTY ( 'FileSize', 'translog' ); MESSAGE STRING ( CURRENT TIMESTAMP, ' Test B starting; @@VERSION = ', @@VERSION ) TO CONSOLE; LOAD TABLE ttt FROM 'C:\\DATA\\TEST\\ttt.TXT' WITH ROW LOGGING; SET @seconds = DATEDIFF ( SECOND, @start_at, CURRENT TIMESTAMP ); SET @growth = DB_EXTENDED_PROPERTY ( 'FileSize', 'translog' ) - @pages; MESSAGE STRING ( CURRENT TIMESTAMP, ' Test B seconds = ', @seconds, ', log pages growth = ', @growth ) TO CONSOLE; END; I. 01/04 09:41:29. 2019-01-04 09:41:29.472 Test B starting; @@VERSION = 17.0.9.4882 I. 01/04 09:41:29. Starting checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 09:41 I. 01/04 09:41:29. Finished checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 09:41 I. 01/04 09:45:03. Starting checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 09:45 I. 01/04 09:45:03. 2019-01-04 09:45:03.724 Test B seconds = 214, log pages growth = 481295 I. 01/04 09:45:04. Cache size adjusted to 2267588K I. 01/04 09:45:19. Finished checkpoint of "ddd17" (ddd17.db) at Fri Jan 04 2019 09:45 I. 01/04 09:46:04. Cache size adjusted to 2255732K
9 comments:
That's interesting (and useful). Common sense would suggest that it **should** be faster, I wonder why it's not? However it's reassuring to know that the **much** safer option of having the data in the transaction log is not going to slow things down.
@Justin: My uneducated guess is that over the years the process that writes log data has been optimized, perhaps by using parallelism so the main LOAD task is not affected at all. Since HA is a heavy user of log data, the logging process might have had a LOT of attention :)
Can you post the two statements that you are comparing for these tests? For most LOAD TABLE statements, 'WITH FILE NAME LOGGING' is the default so adding that clause doesn't normally change anything. If you are doing a LOAD TABLE from a named pipe, file name logging shouldn't even be allowed.
@John: Yes, you are correct (and the Help is wrong): WITH FILE NAME LOGGING has always been the default, since before the LOGGING clause existed.
That means the test proved nothing... but the updated test (see blog for code) shows the premise is the same: FILE NAME LOGGING runs no faster than ROW LOGGING.
As for named pipes LOAD TABLE really did run without error with WITH FILE NAME LOGGING... just a lot slower. Alas, I didn't keep track of log growth.
Curious. It seems unlikely that writing almost half a million pages would cost nothing. All the other work for inserting rows into the database remains the same regardless of the logging option. Maybe you are CPU bound and IO is very fast. Are you on an SSD? Single spindle? RAID?
I can't imagine what is going on with the named pipes example...
It might also be a good idea to do a checkpoint after each truncate. Then you'd know you are not competing with background IO to clean the cache and the pages that were freed from the table would be available for reuse.
@John: The plot thickens ( see Update 2 above :)
Your first test is growing the cache during the load which means it will be forced to wait for IO (since we only resize periodically). Try running your server with, say, -c 3g.
Growing the cache also suggests that you are starting with a small database file and the first statement is taking the hit for extending the database. I suggest you pre-grow the dbspace.
Alternatively, just do one load first (of either variety) and throw away the results.
FWIW, Breck, is the current version different from the original blog article? I remember having read about Update 1 and 2, and now it looks way more coherent...:)
Volker
@Volker: The current blog post is a complete rewrite because the original conclusion (observation, measurement) was 100% incorrect :)
Post a Comment