Tuesday, March 3, 2020

Inside Foxhound: String De-Duplication

String de-duplication is the replacement of multiple duplicate string values with a single copy to save space.

Foxhound 5 uses custom code to implement de-duplication of two columns in the database, and this sometimes results in the database file being 50% smaller than in Foxhound 4.

This article shows

  • how existing Foxhound code was changed to implement string de-duplication, and

  • how concurrency problems were handled to prevent performance problems.
CREATE TABLE Changes
SELECT Changes
INSERT Changes
DELETE Changes
Final Thoughts: Is This For You?


CREATE TABLE Changes

The strings of interest are the Last Statement and Last Plan Text fields on the Foxhound Monitor and Sample History pages:
In Foxhound 4 they're stored in this table:
CREATE TABLE DBA.rroad_group_2_property_pivot (  
   sampling_id                                     UNSIGNED INT NOT NULL,
   sample_set_number               /* PK      X */ UNSIGNED BIGINT NOT NULL,
   connection_number               /* PK        */ BIGINT NOT NULL,
   ...
   LastStatement                                   LONG VARCHAR NULL,
   LastPlanText                                    LONG VARCHAR NULL DEFAULT '',
   ...
   PRIMARY KEY (
      sample_set_number,
      connection_number ) );
In Foxhound 5 they've been replaced by two unsigned integers:
CREATE TABLE dba.rroad_group_2_property_pivot (  
   sampling_id                                         UNSIGNED INT NOT NULL,
   sample_set_number                   /* PK      X */ UNSIGNED BIGINT NOT NULL,
   connection_number                   /* PK        */ BIGINT NOT NULL,
   ...
   LastStatement_id                                    UNSIGNED INT NOT NULL DEFAULT 0,
   LastPlanText_id                                     UNSIGNED INT NOT NULL DEFAULT 0,
   ...
   PRIMARY KEY (
      sample_set_number,
      connection_number ) );
The LastStatement_id and LastPlanText_id columns are effectively foreign key columns pointing to a new table (rroad_long_varchar) where the actual LONG VARCHAR values are stored:
CREATE TABLE dba.rroad_long_varchar ( 
   long_varchar_id   /* PK        */ UNSIGNED INT NOT NULL DEFAULT autoincrement,
   reference_count                   BIGINT NOT NULL,
   long_varchar      /*       U   */ LONG VARCHAR NOT NULL,
   PRIMARY KEY ( long_varchar_id ) );

CREATE UNIQUE INDEX ux_long_varchar ON dba.rroad_long_varchar ( long_varchar );
The UNIQUE INDEX has two purposes: first, it guarantees there are no duplicates, and second, it supports the SELECT WHERE rroad_long_varchar.long_varchar = @new_long_varchar clause used in a later section.

SELECT Changes

In Foxhound 5 the adhoc view sample_connection has been modified to return the same result set as in Foxhound 4. In other words, even though the underlying rroad_group_2_property_pivot table no longer contains LastStatement and LastPlanText, those columns do appear in the view.

Here's how:

In Foxhound 4 the view is a simple "SELECT everything from the base table"
create VIEW sample_connection AS 
SELECT * 
  FROM rroad_group_2_property_pivot;
whereas Foxhound 5 uses joins to turn LastStatement_id and LastPlanText_id into the original LastStatement and LastPlanText:
create VIEW sample_connection AS 
SELECT rroad_group_2_property_pivot.sampling_id,
       rroad_group_2_property_pivot.sample_set_number, -- PRIMARY KEY
       rroad_group_2_property_pivot.connection_number, -- PRIMARY KEY
       ...
       rroad_LastStatement.long_varchar   AS LastStatement,
       rroad_LastPlanText.long_varchar    AS LastPlanText,
       ...
  FROM rroad_group_2_property_pivot
       INNER JOIN ( SELECT * 
                      FROM rroad_long_varchar 
                  ) AS rroad_LastStatement
               ON rroad_LastStatement.long_varchar_id = rroad_group_2_property_pivot.LastStatement_id
       INNER JOIN ( SELECT * 
                      FROM rroad_long_varchar 
                  ) AS rroad_LastPlanText
               ON rroad_LastPlanText.long_varchar_id = rroad_group_2_property_pivot.LastPlanText_id
Foxhound itself uses the same joins to display LastStatement and LastPlanText on the Monitor and History pages.

INSERT Changes

When the Foxhound sampling process inserts a connection sample, it calls a function to convert the SQL Anywhere LastStatement and LastPlanText properties into unsigned integers:
INSERT rroad_group_2_property_pivot (
       sampling_id, 
       sample_set_number,
       connection_number,
       ...
       LastStatement_id,
       LastPlanText_id,
       ...
SELECT @sampling_id,
       @sample_set_number,
       connection_number,
       ...
       rroad_get_long_varchar_id ( LastStatement ), 
       rroad_get_long_varchar_id ( LastPlanText ),
       ...
  FROM ...
Here's that function:
CREATE FUNCTION rroad_get_long_varchar_id ( 
   IN @long_varchar        LONG VARCHAR,
   IN @is_recursive_call   VARCHAR ( 1 ) DEFAULT 'N' )
   RETURNS UNSIGNED INTEGER
   NOT DETERMINISTIC
BEGIN

DECLARE @new_long_varchar              LONG VARCHAR;
DECLARE @existing_long_varchar_id      UNSIGNED INTEGER;
DECLARE @existing_long_varchar         LONG VARCHAR;
DECLARE @sqlstate                      VARCHAR ( 5 );
DECLARE @errormsg                      VARCHAR ( 32767 );
DECLARE @sqlcode                       INTEGER;
DECLARE @diagnostic_location           VARCHAR ( 20 );
DECLARE @v_existing_long_varchar_id    VARCHAR ( 128 );

SET @new_long_varchar = TRIM ( COALESCE ( @long_varchar, '' ) );

IF @new_long_varchar = '' THEN
   RETURN 0;
ELSE
The following section checks if the LONG VARCHAR value has already been stored in the rroad_long_varchar table.
   SET @existing_long_varchar_id = NULL;

   SELECT rroad_long_varchar.long_varchar_id 
     INTO @existing_long_varchar_id
     FROM rroad_long_varchar
    WHERE rroad_long_varchar.long_varchar = @new_long_varchar;

   IF @existing_long_varchar_id IS NOT NULL THEN
At this point @existing_long_varchar_id is ready to be returned to the caller because it points to the LONG VARCHAR value that already exists in rroad_long_varchar...

...but first, some bookkeeping is necessary.

The Foxhound purge process must know when it's safe to delete old LONG VARCHAR values, and it does this by checking that rroad_long_varchar.reference_count is zero.

The following code increments the reference_count by 1 each time a new reference is encountered, or rather, it attempts to increment the reference count:
      BEGIN -- handle exception

         SET TEMPORARY OPTION BLOCKING = 'OFF';

         UPDATE rroad_long_varchar
            SET reference_count = reference_count + 1
          WHERE long_varchar_id = @existing_long_varchar_id;

         SET TEMPORARY OPTION BLOCKING = 'ON';
It is possible that multiple almost-simultaneous calls to this function can reach this point with same LONG VARCHAR value; i.e., Foxhound can sample up to 100 target databases in parallel, and it's possible for the same LastStatement and LastPlanText values could be received from multiple targets.

When that happens, the above UPDATE can fail with SQLCODE -210 Row locked, -306 Deadlock detected, or -307 All threads are blocked.

Rather than have these collisions cause waits or rollbacks, BLOCKING = 'OFF' is used to force all collisions to raise exceptions (even row locks), and those exceptions are all handled by the next section of code which defers the incrementing of reference_count until later.
      EXCEPTION WHEN OTHERS THEN

         SELECT SQLCODE INTO @sqlcode;

         SET TEMPORARY OPTION BLOCKING = 'ON';

         IF @sqlcode IN ( -210, -306, -307 ) THEN 

            -- Defer the UPDATE until later.

            INSERT rroad_long_varchar_deferred_increment ( long_varchar_id ) 
            VALUES ( @existing_long_varchar_id );
In the grand scheme of things, incrementing reference_count is very low priority; it is much more important for Foxhound sampling to proceed without delay.

Here's the table used by the above INSERT; each row represents one deferred reference_count increment of +1 for one long_varchar_id:
CREATE TABLE rroad_long_varchar_deferred_increment ( 
   increment_id               UNSIGNED INTEGER NOT NULL DEFAULT AUTOINCREMENT PRIMARY KEY,
   long_varchar_id            UNSIGNED INTEGER NOT NULL );
These deferrals are indefinite; the rows inserted in this table aren't needed until the Foxhound purge process is ready to delete unused rows in rroad_long_varchar.
         ELSE

            -- Let the outer block handle the exception.

            RESIGNAL;

         END IF;

      END; -- handle exception

      RETURN @existing_long_varchar_id;
The above RETURN finishes handling a matching row in the rroad_long_varchar table.

The next section handles a new LONG VARCHAR value by inserting a new row in rroad_long_varchar:
   ELSE -- @existing_long_varchar_id IS NULL

      BEGIN -- handle exception

         INSERT rroad_long_varchar (
            long_varchar_id, 
            reference_count,
            long_varchar )
         VALUES ( 
            DEFAULT,
            1,
            @new_long_varchar );
It is also possible that multiple almost-simultaneous calls to this function can reach this point with same LONG VARCHAR value.

When that happens, the first execution of the above INSERT statement will work and later executions will (correctly) fail with SQLCODE -196 Index 'ux_long_varchar' for table 'rroad_long_varchar' would not be unique.

The resulting exceptions are handled in the next section, by using recursive calls to this function to handle duplicate LONG VARCHAR values.

The recursive calls will ( should :) work because they won't ( shouldn't :) reach this section of code.
      EXCEPTION WHEN OTHERS THEN

         SELECT SQLCODE, SQLSTATE, ERRORMSG() 
           INTO @sqlcode, @sqlstate, @errormsg;

         IF @sqlcode = -196 THEN -- Index 'ux_long_varchar' for table 'rroad_long_varchar' would not be unique

            IF @is_recursive_call = 'N' THEN

               -- Try a recursive call, and this time @existing_long_varchar_id IS NOT NULL.

               RETURN rroad_get_long_varchar_id ( @new_long_varchar, 'Y' );

            ELSE

               -- This is already a recursive call, and it is still failing.

               CALL rroad_exception ( STRING ( 
                  @diagnostic_location, '(454eh1) Unexpected error in recursive call to rroad_get_long_varchar_id: ', 
                  ' SQLCODE = ', @sqlcode,  
                  ', SQLSTATE = ', @sqlstate,  
                  ', ERRORMSG() = ', @errormsg ) );

               RESIGNAL;

            END IF;

         ELSE

            -- Let the outer block handle the exception.

            RESIGNAL;

         END IF;

      END; -- handle exception

      RETURN @@IDENTITY;

   END IF;

END IF;

END;

DELETE Changes

In Foxhound 4, deleting old values of LastStatement and LastPlanText was easy: just DELETE the rroad_group_2_property_pivot rows holding those values.

In Foxhound 5, LastStatement and LastPlanText are stored in rroad_long_varchar, and those rows can't be deleted until rroad_long_varchar.reference_count sinks to zero.

That happens when enough rows in rroad_group_2_property_pivot are deleted to fire the following trigger enough times to force the reference_count down to zero:
CREATE TRIGGER trd_rroad_group_2_property_pivot
   BEFORE DELETE ON rroad_group_2_property_pivot
   REFERENCING OLD AS old_rroad_group_2_property_pivot
   FOR EACH ROW
BEGIN

   -- Decrement rroad_long_varchar.reference_count as necessary.

   IF old_rroad_group_2_property_pivot.LastStatement_id > 0 THEN
      UPDATE rroad_long_varchar
         SET rroad_long_varchar.reference_count = rroad_long_varchar.reference_count - 1
       WHERE rroad_long_varchar.long_varchar_id = old_rroad_group_2_property_pivot.LastStatement_id;
   END IF;

   IF old_rroad_group_2_property_pivot.LastPlanText_id> 0 THEN
      UPDATE rroad_long_varchar
         SET rroad_long_varchar.reference_count = rroad_long_varchar.reference_count - 1 
       WHERE rroad_long_varchar.long_varchar_id = old_rroad_group_2_property_pivot. LastPlanText_id;
   END IF;

END;
Note that it is possible for reference_count to sink below zero when rows are held in the rroad_long_varchar_deferred_increment table.

The following code shows what happens in the Foxhound purge process after old rows in rroad_group_2_property_pivot have been deleted, and it is time to consider deleting old rows in rroad_long_varchar.

The first section applies the deferred increments to the reference_count so the second section won't delete any rows in rroad_long_varchar that are still needed.

Note that rows in the rroad_long_varchar_deferred_increment table are deleted as soon as they are used.
FOR f_fetch_deferred_increment AS c_fetch_deferred_increment INSENSITIVE CURSOR FOR
SELECT rroad_long_varchar_deferred_increment.increment_id    AS @increment_id,
       rroad_long_varchar_deferred_increment.long_varchar_id AS @long_varchar_id
  FROM rroad_long_varchar_deferred_increment
FOR READ ONLY
DO

   -- Apply and delete the deferred increment as an atomic operation.

   BEGIN ATOMIC 

      UPDATE rroad_long_varchar
         SET rroad_long_varchar.reference_count = rroad_long_varchar.reference_count + 1
       WHERE rroad_long_varchar.long_varchar_id = @long_varchar_id;

      DELETE rroad_long_varchar_deferred_increment
       WHERE rroad_long_varchar_deferred_increment.increment_id = @increment_id;

   END; 

   COMMIT;
           
END FOR;
The second section takes care of deleting rroad_long_varchar rows that are no longer referenced.

Note that up to 100 Foxhound sampling process could be inserting data while the purge is running, and any one of them could insert new rows in rroad_long_varchar_deferred_increment; that's the reason for the AND NOT EXISTS predicate in the following FOR loop:
FOR f_long_varchar AS c_long_varchar NO SCROLL CURSOR FOR
SELECT rroad_long_varchar.long_varchar_id   AS @long_varchar_id
  FROM rroad_long_varchar
 WHERE rroad_long_varchar.reference_count <= 0
   AND rroad_long_varchar.long_varchar_id > 0 -- don't delete empty row
   AND NOT EXISTS ( SELECT *                  -- don't delete rows that have deferred increments
                      FROM rroad_long_varchar_deferred_increment
                     WHERE rroad_long_varchar_deferred_increment.long_varchar_id 
                         = rroad_long_varchar.long_varchar_id )
 ORDER BY rroad_long_varchar.long_varchar_id
FOR UPDATE
DO

   DELETE rroad_long_varchar
    WHERE CURRENT OF c_long_varchar;

END FOR;

Final Thoughts: Is This For You?

Before implementing string de-duplication in your application, run a test to remove duplicates to see if the savings are worthwhile.

String de-duplication is worthwhile in Foxhound because a very large number of identical LONG VARCHAR strings are stored in the database. In other applications, where there are fewer duplicates and/or shorter strings, the advantage might not be so great, and alternatives like COMPRESS() might be better.

Your application may already be taking advantage of the "Blob Sharing" feature built in to SQL Anywhere even though it "only occurs when you set values of one column to be equal to those of another column".
In other words, blob sharing only works when your application explicitly creates identical copies (UPDATE t column1 = column2) and it has no effect when your application creates new values separately with no reference to previous identical values.
In Foxhound the code which searches for an existing string value uses a brute-force string comparison "WHERE rroad_long_varchar.long_varchar = @new_long_varchar" supported by an index
CREATE UNIQUE INDEX ux_long_varchar ON dba.rroad_long_varchar ( long_varchar );
but that might not perform well when dealing with giant LONG BINARY values. Although it wasn't the case with Foxhound, you may find an alternative method (e.g., comparing HASH() values) will perform better.

The special exception handling might not be necessary for applications where collisions are unlikely; i.e., in other applications it might be OK to block and wait.

String de-duplication didn't add much runtime overhead to Foxhound, but performing de-duplication on a large existing database can take a very long time; e.g., it can add hours to the process of upgrading a Foxhound 4 database to Foxhound 5.