Question: How do I save space by eliminating duplicate values from a string column while still being able to SELECT those values whenever necessary?
Answer: One method is to store a single copy of each different string value in a separate table, and replace the string column in the original table with a pointer to the new table.
Sounds simple, right?
Well, it's not too bad, given some of the language features built in to SQL Anywhere. Here's an example of the original table before any changes are made, plus a simple query:
CREATE TABLE t ( pkey BIGINT NOT NULL DEFAULT AUTOINCREMENT PRIMARY KEY, blob LONG VARCHAR NULL ); SELECT pkey, blob FROM t;The blob column can vary wildly in length, from NULL, to the empty string '', to a few characters, to megabytes in size.
Here's the new table, designed to hold t.blob values that are moved into tblob.blob:
CREATE TABLE tblob ( blob_key INTEGER NOT NULL DEFAULT AUTOINCREMENT PRIMARY KEY, hash_value VARCHAR ( 64 ) NOT NULL, collision_counter INTEGER NOT NULL, reference_count BIGINT NOT NULL, blob LONG VARCHAR NOT NULL, CONSTRAINT ukey UNIQUE ( hash_value, collision_counter ) );
- The blob_key column is an artificial primary key 1, 2, 3, ... for the new table; this is where the "pointer to the new table" comes from.
- The hash_value column is filled by applying SQL Anywhere's HASH() function to the long string in t.blob to create a much shorter but (almost?) perfect alternate version of t.blob. In other words, if two HASH() values are the same, then the two original strings are (probably) the same, and vice versa: if two HASH() values are different then the original strings are (probably) different.
- The collision_counter column counts the number of times 0, 1, 2, ... that the same HASH() value was returned for different t.blob values. This value has remained zero during all the testing done for this article, but... who knows?
- The reference_count column counts the number of times that separate copies of the same t.blob value were replaced by one row in tblob. This column is nice to have for testing, and it may even be needed in your application, but it does increase the overhead.
- The blob column is where the single copy goes, and it's NOT NULL because by the time anything is stored in tblob it's known to be longer than the threshold (e.g., 100 bytes).
- The UNIQUE constraint is a performance optimization to help with the process that inserts new rows in tblob.
CREATE TABLE t ( pkey BIGINT NOT NULL DEFAULT AUTOINCREMENT PRIMARY KEY, blob_key INTEGER NULL REFERENCES tblob ( blob_key ), blob LONG VARCHAR NULL ); SELECT t.pkey, COALESCE ( t.blob, tblob.blob ) AS blob FROM t LEFT OUTER JOIN tblob ON t.blob_key = tblob.blob_key;
- The LEFT OUTER JOIN matches each row in t with the corresponding row in tblob, if one exists. If no such row in tblob exists, that means the original blob value was too short to be moved, or it was NULL in the first place.
- The COALESCE function call returns t.blob if it isn't NULL,
- otherwise it returns tblob.blob if there was a matching row in tblob, or
- if neither is true, it returns NULL.
Here's a sample of code that shows how to INSERT rows into t and tblob, using as test data the blob column from another table called x:
BEGIN DECLARE @match_found VARCHAR ( 1 ); DECLARE @hash_value VARCHAR ( 64 ); DECLARE @max_existing_collision_counter INTEGER; DECLARE @blob_key INTEGER; outer_fetch_loop: FOR f_outer AS c_outer NO SCROLL CURSOR FOR SELECT x.blob AS @blob FROM x FOR READ ONLY DO IF LENGTH ( COALESCE ( @blob, '' ) ) > 100 THEN SET @match_found = 'N'; SET @hash_value = HASH ( @blob, 'SHA256' ); SET @max_existing_collision_counter = -1; inner_fetch_loop: FOR f_inner AS c_inner NO SCROLL CURSOR FOR SELECT tblob.blob_key AS @existing_blob_key, tblob.collision_counter AS @existing_collision_counter, tblob.blob AS @existing_blob FROM tblob WHERE tblob.hash_value = @hash_value ORDER BY tblob.collision_counter FOR READ ONLY DO IF @blob = @existing_blob THEN SET @match_found = 'Y'; SET @blob_key = @existing_blob_key; UPDATE tblob SET tblob.reference_count = tblob.reference_count + 1 WHERE tblob.blob_key = @existing_blob_key; LEAVE inner_fetch_loop; ELSE SET @max_existing_collision_counter = @existing_collision_counter; END IF; END FOR inner_fetch_loop; IF @match_found = 'N' THEN INSERT tblob ( hash_value, collision_counter, reference_count, blob ) VALUES ( @hash_value, @max_existing_collision_counter + 1, 1, @blob ); SET @blob_key = @@IDENTITY; END IF; INSERT t ( blob_key, blob ) VALUES ( @blob_key, NULL ); ELSE INSERT t ( blob_key, blob ) VALUES ( NULL, @blob ); END IF; COMMIT; END FOR outer_fetch_loop; COMMIT; END;
- The IF LENGTH > 100 on line 14 checks to see if the benefits of moving t.blob into tblob is worth the overhead. If not, the INSERT down on line 61 leaves the blob in t and sets the pointer to NULL.
- The HASH() function call on line 18 uses the SHA256 algorithm, the best that SQL Anywhere has to offer. HASH() returns VARCHAR ( 64 ) values like b9afc180bd8fa250a006229ba9c8b8eddba0b96dfcb8320740af84a4c485f5ea... in other words "trust me, it's unique".
- The FOR statement starting on line 21 looks through all the rows in tblob with matching HASH() values. In the real world, there will (probably) be zero or one rows processed by this FOR loop, and if one row is returned the IF @blob = @existing_blob on line 31 will be TRUE.
- The IF THEN on lines 31 through 37 handles the case where not only does the HASH() value match but so does the blob value, and it saves the tblob.blob_key value in @blob_key for later use, before using the LEAVE statement on line 37 to skip out of the inner FOR loop and resume processing at line 43.
- The ELSE starting on line 38 handles a collision, where the HASH() value matches but not the blob value, by bumping up the value of @max_existing_collision_counter (which was initialized to -1 back on line 19).
- The IF THEN on lines 43 through 55 handles the case where the HASH() value didn't match, so a new row in tblob must be inserted.
- The INSERT statement on line 57 fills in the pointer (t.blob_key) while setting t.blob to NULL. This is different from the INSERT on line 61 which sets the pointer to NULL and puts the original blob value into t.blob.
SELECT tblob.reference_count, LENGTH ( tblob.blob ) AS length_blob, length_blob * ( reference_count - 1 ) AS bytes_saved FROM tblob ORDER BY bytes_saved DESC, length_blob DESC; reference_count length_blob bytes_saved 101 16,080 1,608,000 5,801 135 783,000 87 7,447 640,442 1,036 575 595,125 478 962 458,874 2,270 135 306,315 53 4,440 230,880 166 654 107,910 136 591 79,785 415 135 55,890 78 654 50,358 362 136 49,096 75 608 44,992 313 135 42,120 71 366 25,620 37 571 20,556 100 162 16,038 29 437 12,236 23 535 11,770 68 135 9,045 3 4,343 8,686 13 644 7,728 17 366 5,856 9 577 4,616 15 314 4,396 6 805 4,025 9 474 3,792 6 566 2,830 5 605 2,420 7 357 2,142 7 355 2,130 4 581 1,743 10 174 1,566 4 451 1,353 3 584 1,168 9 119 952 6 183 915 3 345 690 3 328 656 3 323 646 2 639 639 3 317 634 4 125 375 2 358 358 2 303 303 2 188 188 2 187 187 2 136 136 2 107 107 1 798 0 1 340 0 1 179 0 1 141 0 1 110 0Some parting thoughts...
- 100 bytes might not be a good threshold. There's a lot of overhead here, and it might not be worth it for strings that short.
- The whole idea might not be worth it... it depends on your data, and you'll never know until you try it, but if you do go to the trouble of implementing it, don't fall in love with it until you've done some objective performance tests.
- Unless you really need tblob.reference_count, the UPDATE on line 34 is pure waste.
- The tblob.collision_counter column might also be a waste since it is (almost) always zero. An alternative approach might be to just skip moving the blob into tblob if a collision occurs; i.e., just leave it in t.blob, and don't bother implementing tblob.collision_counter at all.
4 comments:
Interesting approach!
FWIW, your statement on table t's new definition "In other words, only one of t.blob_key and t.blob can be non-NULL. Both columns can be NULL, however, which is what happens when the original blob value was NULL." translates to a helpful (and not to complex) CHECK constraint, doesn't it?
Regards
Volker
FWIW, the database server itself does somthing similar - under certain circumstances - as "BLOB sharing", cf. this doc page:
http://dcx.sybase.com/index.html#1201/en/dbadmin/design-s-3180714.html"
Regards
Volker
@Volker: OK, you've convinced me. It's time for me to read the docs from cover to cover, which I will do when 16 comes out. The last time I did it was for (gasp!) Version 6, so there are sure to be some surprises... apparently it's too easy to miss stuff when reading the "What's New" sections.
@Breck: Well, I surely don't have read from cover to cover after v5.5, either, but now and then I remember stuff I've read before - let's assume the "indexing" in my brainware seems working:)
The doc topic came to my mind when I thought about your approach - and how one could "hide" this blob sharing from SQL users: Whereas one could easily use a trigger to automate the storage of shared blobs, the selecting would be harder to "hide" - you would have to use a view to mask the additional tblob table and the according left join whereas a direct table access would reveal the implementation details...
When further thinking about that, I thought a server-side solution would be much smarter - and here we are:)
However, note, that the documented behaviour (I have read but not tested!) tells about several columns from the same table, not about several rows, and it requires the second BLOB value to be assigned directly. So I would guess the following would not lead to BLOB sharing, as there is no direct assignment:
insert t(blob) xp_read('MyBlob.txt);
insert t(blob) xp_read('MyBlob.txt);
Volker
Post a Comment