Question: How do I save space by eliminating duplicate values from a string column while still being able to SELECT those values whenever necessary?
Answer: One method is to store a single copy of each different string value in a separate table, and replace the string column in the original table with a pointer to the new table.
Sounds simple, right?
Well, it's not
too bad, given some of the language features built in to SQL Anywhere. Here's an example of the original table before any changes are made, plus a simple query:
CREATE TABLE t (
pkey BIGINT NOT NULL DEFAULT AUTOINCREMENT PRIMARY KEY,
blob LONG VARCHAR NULL );
SELECT pkey, blob
FROM t;
The blob column can vary wildly in length, from NULL, to the empty string '', to a few characters, to megabytes in size.
Here's the new table, designed to hold t.blob values that are moved into tblob.blob:
CREATE TABLE tblob (
blob_key INTEGER NOT NULL DEFAULT AUTOINCREMENT PRIMARY KEY,
hash_value VARCHAR ( 64 ) NOT NULL,
collision_counter INTEGER NOT NULL,
reference_count BIGINT NOT NULL,
blob LONG VARCHAR NOT NULL,
CONSTRAINT ukey UNIQUE ( hash_value, collision_counter ) );
- The blob_key column is an artificial primary key 1, 2, 3, ... for the new table; this is where the "pointer to the new table" comes from.
- The hash_value column is filled by applying SQL Anywhere's HASH() function to the long string in t.blob to create a much shorter but (almost?) perfect alternate version of t.blob. In other words, if two HASH() values are the same, then the two original strings are (probably) the same, and vice versa: if two HASH() values are different then the original strings are (probably) different.
- The collision_counter column counts the number of times 0, 1, 2, ... that the same HASH() value was returned for different t.blob values. This value has remained zero during all the testing done for this article, but... who knows?
- The reference_count column counts the number of times that separate copies of the same t.blob value were replaced by one row in tblob. This column is nice to have for testing, and it may even be needed in your application, but it does increase the overhead.
- The blob column is where the single copy goes, and it's NOT NULL because by the time anything is stored in tblob it's known to be longer than the threshold (e.g., 100 bytes).
- The UNIQUE constraint is a performance optimization to help with the process that inserts new rows in tblob.
Here's what the original table and query look like now, with the t.blob value moved to tblob.blob for some rows and not others:
CREATE TABLE t (
pkey BIGINT NOT NULL DEFAULT AUTOINCREMENT PRIMARY KEY,
blob_key INTEGER NULL REFERENCES tblob ( blob_key ),
blob LONG VARCHAR NULL );
SELECT t.pkey, COALESCE ( t.blob, tblob.blob ) AS blob
FROM t LEFT OUTER JOIN tblob
ON t.blob_key = tblob.blob_key;
- The LEFT OUTER JOIN matches each row in t with the corresponding row in tblob, if one exists. If no such row in tblob exists, that means the original blob value was too short to be moved, or it was NULL in the first place.
- The COALESCE function call returns t.blob if it isn't NULL,
- otherwise it returns tblob.blob if there was a matching row in tblob, or
- if neither is true, it returns NULL.
In other words, only one of t.blob_key and t.blob can be non-NULL. Both columns can be NULL, however, which is what happens when the original blob value was NULL.
Here's a sample of code that shows how to INSERT rows into t and tblob, using as test data the blob column from another table called x:
BEGIN
DECLARE @match_found VARCHAR ( 1 );
DECLARE @hash_value VARCHAR ( 64 );
DECLARE @max_existing_collision_counter INTEGER;
DECLARE @blob_key INTEGER;
outer_fetch_loop: FOR f_outer AS c_outer NO SCROLL CURSOR FOR
SELECT x.blob AS @blob
FROM x
FOR READ ONLY
DO
IF LENGTH ( COALESCE ( @blob, '' ) ) > 100 THEN
SET @match_found = 'N';
SET @hash_value = HASH ( @blob, 'SHA256' );
SET @max_existing_collision_counter = -1;
inner_fetch_loop: FOR f_inner AS c_inner NO SCROLL CURSOR FOR
SELECT tblob.blob_key AS @existing_blob_key,
tblob.collision_counter AS @existing_collision_counter,
tblob.blob AS @existing_blob
FROM tblob
WHERE tblob.hash_value = @hash_value
ORDER BY tblob.collision_counter
FOR READ ONLY
DO
IF @blob = @existing_blob THEN
SET @match_found = 'Y';
SET @blob_key = @existing_blob_key;
UPDATE tblob
SET tblob.reference_count = tblob.reference_count + 1
WHERE tblob.blob_key = @existing_blob_key;
LEAVE inner_fetch_loop;
ELSE
SET @max_existing_collision_counter = @existing_collision_counter;
END IF;
END FOR inner_fetch_loop;
IF @match_found = 'N' THEN
INSERT tblob (
hash_value,
collision_counter,
reference_count,
blob )
VALUES (
@hash_value,
@max_existing_collision_counter + 1,
1,
@blob );
SET @blob_key = @@IDENTITY;
END IF;
INSERT t ( blob_key, blob ) VALUES ( @blob_key, NULL );
ELSE
INSERT t ( blob_key, blob ) VALUES ( NULL, @blob );
END IF;
COMMIT;
END FOR outer_fetch_loop;
COMMIT;
END;
- The IF LENGTH > 100 on line 14 checks to see if the benefits of moving t.blob into tblob is worth the overhead. If not, the INSERT down on line 61 leaves the blob in t and sets the pointer to NULL.
- The HASH() function call on line 18 uses the SHA256 algorithm, the best that SQL Anywhere has to offer. HASH() returns VARCHAR ( 64 ) values like b9afc180bd8fa250a006229ba9c8b8eddba0b96dfcb8320740af84a4c485f5ea... in other words "trust me, it's unique".
- The FOR statement starting on line 21 looks through all the rows in tblob with matching HASH() values. In the real world, there will (probably) be zero or one rows processed by this FOR loop, and if one row is returned the IF @blob = @existing_blob on line 31 will be TRUE.
- The IF THEN on lines 31 through 37 handles the case where not only does the HASH() value match but so does the blob value, and it saves the tblob.blob_key value in @blob_key for later use, before using the LEAVE statement on line 37 to skip out of the inner FOR loop and resume processing at line 43.
- The ELSE starting on line 38 handles a collision, where the HASH() value matches but not the blob value, by bumping up the value of @max_existing_collision_counter (which was initialized to -1 back on line 19).
- The IF THEN on lines 43 through 55 handles the case where the HASH() value didn't match, so a new row in tblob must be inserted.
- The INSERT statement on line 57 fills in the pointer (t.blob_key) while setting t.blob to NULL. This is different from the INSERT on line 61 which sets the pointer to NULL and puts the original blob value into t.blob.
Here are some numbers...
SELECT tblob.reference_count,
LENGTH ( tblob.blob ) AS length_blob,
length_blob * ( reference_count - 1 ) AS bytes_saved
FROM tblob
ORDER BY bytes_saved DESC,
length_blob DESC;
reference_count length_blob bytes_saved
101 16,080 1,608,000
5,801 135 783,000
87 7,447 640,442
1,036 575 595,125
478 962 458,874
2,270 135 306,315
53 4,440 230,880
166 654 107,910
136 591 79,785
415 135 55,890
78 654 50,358
362 136 49,096
75 608 44,992
313 135 42,120
71 366 25,620
37 571 20,556
100 162 16,038
29 437 12,236
23 535 11,770
68 135 9,045
3 4,343 8,686
13 644 7,728
17 366 5,856
9 577 4,616
15 314 4,396
6 805 4,025
9 474 3,792
6 566 2,830
5 605 2,420
7 357 2,142
7 355 2,130
4 581 1,743
10 174 1,566
4 451 1,353
3 584 1,168
9 119 952
6 183 915
3 345 690
3 328 656
3 323 646
2 639 639
3 317 634
4 125 375
2 358 358
2 303 303
2 188 188
2 187 187
2 136 136
2 107 107
1 798 0
1 340 0
1 179 0
1 141 0
1 110 0
Some parting thoughts...
- 100 bytes might not be a good threshold. There's a lot of overhead here, and it might not be worth it for strings that short.
- The whole idea might not be worth it... it depends on your data, and you'll never know until you try it, but if you do go to the trouble of implementing it, don't fall in love with it until you've done some objective performance tests.
- Unless you really need tblob.reference_count, the UPDATE on line 34 is pure waste.
- The tblob.collision_counter column might also be a waste since it is (almost) always zero. An alternative approach might be to just skip moving the blob into tblob if a collision occurs; i.e., just leave it in t.blob, and don't bother implementing tblob.collision_counter at all.