Friday, August 10, 2012

Example: String De-Duplication

Question: How do I save space by eliminating duplicate values from a string column while still being able to SELECT those values whenever necessary?

Answer: One method is to store a single copy of each different string value in a separate table, and replace the string column in the original table with a pointer to the new table.

Sounds simple, right?

Well, it's not too bad, given some of the language features built in to SQL Anywhere. Here's an example of the original table before any changes are made, plus a simple query:
   blob                LONG VARCHAR NULL );

SELECT pkey, blob
  FROM t;
The blob column can vary wildly in length, from NULL, to the empty string '', to a few characters, to megabytes in size.

Here's the new table, designed to hold t.blob values that are moved into tblob.blob:
   hash_value          VARCHAR ( 64 ) NOT NULL,
   collision_counter   INTEGER NOT NULL,
   reference_count     BIGINT NOT NULL,
   blob                LONG VARCHAR NOT NULL,
   CONSTRAINT ukey UNIQUE ( hash_value, collision_counter ) );
  • The blob_key column is an artificial primary key 1, 2, 3, ... for the new table; this is where the "pointer to the new table" comes from.

  • The hash_value column is filled by applying SQL Anywhere's HASH() function to the long string in t.blob to create a much shorter but (almost?) perfect alternate version of t.blob. In other words, if two HASH() values are the same, then the two original strings are (probably) the same, and vice versa: if two HASH() values are different then the original strings are (probably) different.

  • The collision_counter column counts the number of times 0, 1, 2, ... that the same HASH() value was returned for different t.blob values. This value has remained zero during all the testing done for this article, but... who knows?

  • The reference_count column counts the number of times that separate copies of the same t.blob value were replaced by one row in tblob. This column is nice to have for testing, and it may even be needed in your application, but it does increase the overhead.

  • The blob column is where the single copy goes, and it's NOT NULL because by the time anything is stored in tblob it's known to be longer than the threshold (e.g., 100 bytes).

  • The UNIQUE constraint is a performance optimization to help with the process that inserts new rows in tblob.
Here's what the original table and query look like now, with the t.blob value moved to tblob.blob for some rows and not others:
   blob_key            INTEGER NULL REFERENCES tblob ( blob_key ), 
   blob                LONG VARCHAR NULL );

SELECT t.pkey, COALESCE ( t.blob, tblob.blob ) AS blob
          ON t.blob_key = tblob.blob_key;
  • The LEFT OUTER JOIN matches each row in t with the corresponding row in tblob, if one exists. If no such row in tblob exists, that means the original blob value was too short to be moved, or it was NULL in the first place.

  • The COALESCE function call returns t.blob if it isn't NULL,

  • otherwise it returns tblob.blob if there was a matching row in tblob, or

  • if neither is true, it returns NULL.
In other words, only one of t.blob_key and t.blob can be non-NULL. Both columns can be NULL, however, which is what happens when the original blob value was NULL.

Here's a sample of code that shows how to INSERT rows into t and tblob, using as test data the blob column from another table called x:

DECLARE @match_found                       VARCHAR ( 1 );
DECLARE @hash_value                        VARCHAR ( 64 );
DECLARE @max_existing_collision_counter    INTEGER;
DECLARE @blob_key                          INTEGER;

outer_fetch_loop: FOR f_outer AS c_outer NO SCROLL CURSOR FOR
SELECT x.blob AS @blob
  FROM x

   IF LENGTH ( COALESCE ( @blob, '' ) ) > 100 THEN

      SET @match_found = 'N';

      SET @hash_value = HASH ( @blob, 'SHA256' );
      SET @max_existing_collision_counter = -1; 

      inner_fetch_loop: FOR f_inner AS c_inner NO SCROLL CURSOR FOR
      SELECT tblob.blob_key          AS @existing_blob_key,
             tblob.collision_counter AS @existing_collision_counter,
             tblob.blob              AS @existing_blob
        FROM tblob
       WHERE tblob.hash_value = @hash_value
       ORDER BY tblob.collision_counter

         IF @blob = @existing_blob THEN
            SET @match_found = 'Y';
            SET @blob_key = @existing_blob_key;
            UPDATE tblob
               SET tblob.reference_count = tblob.reference_count + 1 
             WHERE tblob.blob_key = @existing_blob_key;
            LEAVE inner_fetch_loop;
            SET @max_existing_collision_counter = @existing_collision_counter;
         END IF;
      END FOR inner_fetch_loop;

      IF @match_found = 'N' THEN
         INSERT tblob (
            blob )
         VALUES ( 
            @max_existing_collision_counter + 1, 
            @blob );
         SET @blob_key = @@IDENTITY;
      END IF;

      INSERT t ( blob_key, blob ) VALUES ( @blob_key, NULL );


      INSERT t ( blob_key, blob ) VALUES ( NULL, @blob );

   END IF;


END FOR outer_fetch_loop;


  • The IF LENGTH > 100 on line 14 checks to see if the benefits of moving t.blob into tblob is worth the overhead. If not, the INSERT down on line 61 leaves the blob in t and sets the pointer to NULL.

  • The HASH() function call on line 18 uses the SHA256 algorithm, the best that SQL Anywhere has to offer. HASH() returns VARCHAR ( 64 ) values like b9afc180bd8fa250a006229ba9c8b8eddba0b96dfcb8320740af84a4c485f5ea... in other words "trust me, it's unique".

  • The FOR statement starting on line 21 looks through all the rows in tblob with matching HASH() values. In the real world, there will (probably) be zero or one rows processed by this FOR loop, and if one row is returned the IF @blob = @existing_blob on line 31 will be TRUE.

  • The IF THEN on lines 31 through 37 handles the case where not only does the HASH() value match but so does the blob value, and it saves the tblob.blob_key value in @blob_key for later use, before using the LEAVE statement on line 37 to skip out of the inner FOR loop and resume processing at line 43.

  • The ELSE starting on line 38 handles a collision, where the HASH() value matches but not the blob value, by bumping up the value of @max_existing_collision_counter (which was initialized to -1 back on line 19).

  • The IF THEN on lines 43 through 55 handles the case where the HASH() value didn't match, so a new row in tblob must be inserted.

  • The INSERT statement on line 57 fills in the pointer (t.blob_key) while setting t.blob to NULL. This is different from the INSERT on line 61 which sets the pointer to NULL and puts the original blob value into t.blob.
Here are some numbers...
SELECT tblob.reference_count,
       LENGTH ( tblob.blob ) AS length_blob,
       length_blob * ( reference_count - 1 ) AS bytes_saved
  FROM tblob
 ORDER BY bytes_saved DESC,
       length_blob DESC;

reference_count  length_blob  bytes_saved
            101       16,080    1,608,000
          5,801          135      783,000
             87        7,447      640,442
          1,036          575      595,125
            478          962      458,874
          2,270          135      306,315
             53        4,440      230,880
            166          654      107,910
            136          591       79,785
            415          135       55,890
             78          654       50,358
            362          136       49,096
             75          608       44,992
            313          135       42,120
             71          366       25,620
             37          571       20,556
            100          162       16,038
             29          437       12,236
             23          535       11,770
             68          135        9,045
              3        4,343        8,686
             13          644        7,728
             17          366        5,856
              9          577        4,616
             15          314        4,396
              6          805        4,025
              9          474        3,792
              6          566        2,830
              5          605        2,420
              7          357        2,142
              7          355        2,130
              4          581        1,743
             10          174        1,566
              4          451        1,353
              3          584        1,168
              9          119          952
              6          183          915
              3          345          690
              3          328          656
              3          323          646
              2          639          639
              3          317          634
              4          125          375
              2          358          358
              2          303          303
              2          188          188
              2          187          187
              2          136          136
              2          107          107
              1          798            0
              1          340            0
              1          179            0
              1          141            0
              1          110            0
Some parting thoughts...
  • 100 bytes might not be a good threshold. There's a lot of overhead here, and it might not be worth it for strings that short.

  • The whole idea might not be worth it... it depends on your data, and you'll never know until you try it, but if you do go to the trouble of implementing it, don't fall in love with it until you've done some objective performance tests.

  • Unless you really need tblob.reference_count, the UPDATE on line 34 is pure waste.

  • The tblob.collision_counter column might also be a waste since it is (almost) always zero. An alternative approach might be to just skip moving the blob into tblob if a collision occurs; i.e., just leave it in t.blob, and don't bother implementing tblob.collision_counter at all.


Anonymous said...

Interesting approach!

FWIW, your statement on table t's new definition "In other words, only one of t.blob_key and t.blob can be non-NULL. Both columns can be NULL, however, which is what happens when the original blob value was NULL." translates to a helpful (and not to complex) CHECK constraint, doesn't it?


Anonymous said...

FWIW, the database server itself does somthing similar - under certain circumstances - as "BLOB sharing", cf. this doc page:"


Breck Carter said...

@Volker: OK, you've convinced me. It's time for me to read the docs from cover to cover, which I will do when 16 comes out. The last time I did it was for (gasp!) Version 6, so there are sure to be some surprises... apparently it's too easy to miss stuff when reading the "What's New" sections.

Anonymous said...

@Breck: Well, I surely don't have read from cover to cover after v5.5, either, but now and then I remember stuff I've read before - let's assume the "indexing" in my brainware seems working:)

The doc topic came to my mind when I thought about your approach - and how one could "hide" this blob sharing from SQL users: Whereas one could easily use a trigger to automate the storage of shared blobs, the selecting would be harder to "hide" - you would have to use a view to mask the additional tblob table and the according left join whereas a direct table access would reveal the implementation details...

When further thinking about that, I thought a server-side solution would be much smarter - and here we are:)

However, note, that the documented behaviour (I have read but not tested!) tells about several columns from the same table, not about several rows, and it requires the second BLOB value to be assigned directly. So I would guess the following would not lead to BLOB sharing, as there is no direct assignment:

insert t(blob) xp_read('MyBlob.txt);
insert t(blob) xp_read('MyBlob.txt);