String de-duplication is the replacement of multiple duplicate string values with a single copy to save space.
Foxhound 5 uses custom code to implement de-duplication of two columns in the database, and this sometimes results in the database file being 50% smaller than in Foxhound 4.
This article shows
- how existing Foxhound code was changed to implement string de-duplication, and
- how concurrency problems were handled to prevent performance problems.
SELECT Changes
INSERT Changes
DELETE Changes
Final Thoughts: Is This For You?
CREATE TABLE Changes
The strings of interest are the Last Statement and Last Plan Text fields on the Foxhound Monitor and Sample History pages:In Foxhound 4 they're stored in this table:
In Foxhound 5 they've been replaced by two unsigned integers:
CREATE TABLE DBA.rroad_group_2_property_pivot ( sampling_id UNSIGNED INT NOT NULL, sample_set_number /* PK X */ UNSIGNED BIGINT NOT NULL, connection_number /* PK */ BIGINT NOT NULL, ... LastStatement LONG VARCHAR NULL, LastPlanText LONG VARCHAR NULL DEFAULT '', ... PRIMARY KEY ( sample_set_number, connection_number ) );
The LastStatement_id and LastPlanText_id columns are effectively foreign key columns pointing to a new table (rroad_long_varchar) where the actual LONG VARCHAR values are stored:
CREATE TABLE dba.rroad_group_2_property_pivot ( sampling_id UNSIGNED INT NOT NULL, sample_set_number /* PK X */ UNSIGNED BIGINT NOT NULL, connection_number /* PK */ BIGINT NOT NULL, ... LastStatement_id UNSIGNED INT NOT NULL DEFAULT 0, LastPlanText_id UNSIGNED INT NOT NULL DEFAULT 0, ... PRIMARY KEY ( sample_set_number, connection_number ) );
The UNIQUE INDEX has two purposes: first, it guarantees there are no duplicates, and second, it supports the SELECT WHERE rroad_long_varchar.long_varchar = @new_long_varchar clause used in a later section.
CREATE TABLE dba.rroad_long_varchar ( long_varchar_id /* PK */ UNSIGNED INT NOT NULL DEFAULT autoincrement, reference_count BIGINT NOT NULL, long_varchar /* U */ LONG VARCHAR NOT NULL, PRIMARY KEY ( long_varchar_id ) ); CREATE UNIQUE INDEX ux_long_varchar ON dba.rroad_long_varchar ( long_varchar );
SELECT Changes
In Foxhound 5 the adhoc view sample_connection has been modified to return the same result set as in Foxhound 4. In other words, even though the underlying rroad_group_2_property_pivot table no longer contains LastStatement and LastPlanText, those columns do appear in the view.Here's how:
In Foxhound 4 the view is a simple "SELECT everything from the base table"
whereas Foxhound 5 uses joins to turn LastStatement_id and LastPlanText_id into the original LastStatement and LastPlanText:
create VIEW sample_connection AS SELECT * FROM rroad_group_2_property_pivot;
Foxhound itself uses the same joins to display LastStatement and LastPlanText on the Monitor and History pages.
create VIEW sample_connection AS SELECT rroad_group_2_property_pivot.sampling_id, rroad_group_2_property_pivot.sample_set_number, -- PRIMARY KEY rroad_group_2_property_pivot.connection_number, -- PRIMARY KEY ... rroad_LastStatement.long_varchar AS LastStatement, rroad_LastPlanText.long_varchar AS LastPlanText, ... FROM rroad_group_2_property_pivot INNER JOIN ( SELECT * FROM rroad_long_varchar ) AS rroad_LastStatement ON rroad_LastStatement.long_varchar_id = rroad_group_2_property_pivot.LastStatement_id INNER JOIN ( SELECT * FROM rroad_long_varchar ) AS rroad_LastPlanText ON rroad_LastPlanText.long_varchar_id = rroad_group_2_property_pivot.LastPlanText_id
INSERT Changes
When the Foxhound sampling process inserts a connection sample, it calls a function to convert the SQL Anywhere LastStatement and LastPlanText properties into unsigned integers:Here's that function:
INSERT rroad_group_2_property_pivot ( sampling_id, sample_set_number, connection_number, ... LastStatement_id, LastPlanText_id, ... SELECT @sampling_id, @sample_set_number, connection_number, ... rroad_get_long_varchar_id ( LastStatement ), rroad_get_long_varchar_id ( LastPlanText ), ... FROM ...
The following section checks if the LONG VARCHAR value has already been stored in the rroad_long_varchar table.
CREATE FUNCTION rroad_get_long_varchar_id ( IN @long_varchar LONG VARCHAR, IN @is_recursive_call VARCHAR ( 1 ) DEFAULT 'N' ) RETURNS UNSIGNED INTEGER NOT DETERMINISTIC BEGIN DECLARE @new_long_varchar LONG VARCHAR; DECLARE @existing_long_varchar_id UNSIGNED INTEGER; DECLARE @existing_long_varchar LONG VARCHAR; DECLARE @sqlstate VARCHAR ( 5 ); DECLARE @errormsg VARCHAR ( 32767 ); DECLARE @sqlcode INTEGER; DECLARE @diagnostic_location VARCHAR ( 20 ); DECLARE @v_existing_long_varchar_id VARCHAR ( 128 ); SET @new_long_varchar = TRIM ( COALESCE ( @long_varchar, '' ) ); IF @new_long_varchar = '' THEN RETURN 0; ELSE
At this point @existing_long_varchar_id is ready to be returned to the caller because it points to the LONG VARCHAR value that already exists in rroad_long_varchar...
SET @existing_long_varchar_id = NULL; SELECT rroad_long_varchar.long_varchar_id INTO @existing_long_varchar_id FROM rroad_long_varchar WHERE rroad_long_varchar.long_varchar = @new_long_varchar; IF @existing_long_varchar_id IS NOT NULL THEN
...but first, some bookkeeping is necessary.
The Foxhound purge process must know when it's safe to delete old LONG VARCHAR values, and it does this by checking that rroad_long_varchar.reference_count is zero.
The following code increments the reference_count by 1 each time a new reference is encountered, or rather, it attempts to increment the reference count:
It is possible that multiple almost-simultaneous calls to this function can reach this point with same LONG VARCHAR value; i.e., Foxhound can sample up to 100 target databases in parallel, and it's possible for the same LastStatement and LastPlanText values could be received from multiple targets.
BEGIN -- handle exception SET TEMPORARY OPTION BLOCKING = 'OFF'; UPDATE rroad_long_varchar SET reference_count = reference_count + 1 WHERE long_varchar_id = @existing_long_varchar_id; SET TEMPORARY OPTION BLOCKING = 'ON';
When that happens, the above UPDATE can fail with SQLCODE -210 Row locked, -306 Deadlock detected, or -307 All threads are blocked.
Rather than have these collisions cause waits or rollbacks, BLOCKING = 'OFF' is used to force all collisions to raise exceptions (even row locks), and those exceptions are all handled by the next section of code which defers the incrementing of reference_count until later.
In the grand scheme of things, incrementing reference_count is very low priority; it is much more important for Foxhound sampling to proceed without delay.
EXCEPTION WHEN OTHERS THEN SELECT SQLCODE INTO @sqlcode; SET TEMPORARY OPTION BLOCKING = 'ON'; IF @sqlcode IN ( -210, -306, -307 ) THEN -- Defer the UPDATE until later. INSERT rroad_long_varchar_deferred_increment ( long_varchar_id ) VALUES ( @existing_long_varchar_id );
Here's the table used by the above INSERT; each row represents one deferred reference_count increment of +1 for one long_varchar_id:
These deferrals are indefinite; the rows inserted in this table aren't needed until the Foxhound purge process is ready to delete unused rows in rroad_long_varchar.CREATE TABLE rroad_long_varchar_deferred_increment ( increment_id UNSIGNED INTEGER NOT NULL DEFAULT AUTOINCREMENT PRIMARY KEY, long_varchar_id UNSIGNED INTEGER NOT NULL );
The above RETURN finishes handling a matching row in the rroad_long_varchar table.
ELSE -- Let the outer block handle the exception. RESIGNAL; END IF; END; -- handle exception RETURN @existing_long_varchar_id;
The next section handles a new LONG VARCHAR value by inserting a new row in rroad_long_varchar:
It is also possible that multiple almost-simultaneous calls to this function can reach this point with same LONG VARCHAR value.
ELSE -- @existing_long_varchar_id IS NULL BEGIN -- handle exception INSERT rroad_long_varchar ( long_varchar_id, reference_count, long_varchar ) VALUES ( DEFAULT, 1, @new_long_varchar );
When that happens, the first execution of the above INSERT statement will work and later executions will (correctly) fail with SQLCODE -196 Index 'ux_long_varchar' for table 'rroad_long_varchar' would not be unique.
The resulting exceptions are handled in the next section, by using recursive calls to this function to handle duplicate LONG VARCHAR values.
The recursive calls will ( should :) work because they won't ( shouldn't :) reach this section of code.
EXCEPTION WHEN OTHERS THEN SELECT SQLCODE, SQLSTATE, ERRORMSG() INTO @sqlcode, @sqlstate, @errormsg; IF @sqlcode = -196 THEN -- Index 'ux_long_varchar' for table 'rroad_long_varchar' would not be unique IF @is_recursive_call = 'N' THEN -- Try a recursive call, and this time @existing_long_varchar_id IS NOT NULL. RETURN rroad_get_long_varchar_id ( @new_long_varchar, 'Y' ); ELSE -- This is already a recursive call, and it is still failing. CALL rroad_exception ( STRING ( @diagnostic_location, '(454eh1) Unexpected error in recursive call to rroad_get_long_varchar_id: ', ' SQLCODE = ', @sqlcode, ', SQLSTATE = ', @sqlstate, ', ERRORMSG() = ', @errormsg ) ); RESIGNAL; END IF; ELSE -- Let the outer block handle the exception. RESIGNAL; END IF; END; -- handle exception RETURN @@IDENTITY; END IF; END IF; END;
DELETE Changes
In Foxhound 4, deleting old values of LastStatement and LastPlanText was easy: just DELETE the rroad_group_2_property_pivot rows holding those values.In Foxhound 5, LastStatement and LastPlanText are stored in rroad_long_varchar, and those rows can't be deleted until rroad_long_varchar.reference_count sinks to zero.
That happens when enough rows in rroad_group_2_property_pivot are deleted to fire the following trigger enough times to force the reference_count down to zero:
Note that it is possible for reference_count to sink below zero when rows are held in the rroad_long_varchar_deferred_increment table.
CREATE TRIGGER trd_rroad_group_2_property_pivot BEFORE DELETE ON rroad_group_2_property_pivot REFERENCING OLD AS old_rroad_group_2_property_pivot FOR EACH ROW BEGIN -- Decrement rroad_long_varchar.reference_count as necessary. IF old_rroad_group_2_property_pivot.LastStatement_id > 0 THEN UPDATE rroad_long_varchar SET rroad_long_varchar.reference_count = rroad_long_varchar.reference_count - 1 WHERE rroad_long_varchar.long_varchar_id = old_rroad_group_2_property_pivot.LastStatement_id; END IF; IF old_rroad_group_2_property_pivot.LastPlanText_id> 0 THEN UPDATE rroad_long_varchar SET rroad_long_varchar.reference_count = rroad_long_varchar.reference_count - 1 WHERE rroad_long_varchar.long_varchar_id = old_rroad_group_2_property_pivot. LastPlanText_id; END IF; END;
The following code shows what happens in the Foxhound purge process after old rows in rroad_group_2_property_pivot have been deleted, and it is time to consider deleting old rows in rroad_long_varchar.
The first section applies the deferred increments to the reference_count so the second section won't delete any rows in rroad_long_varchar that are still needed.
Note that rows in the rroad_long_varchar_deferred_increment table are deleted as soon as they are used.
The second section takes care of deleting rroad_long_varchar rows that are no longer referenced.
FOR f_fetch_deferred_increment AS c_fetch_deferred_increment INSENSITIVE CURSOR FOR SELECT rroad_long_varchar_deferred_increment.increment_id AS @increment_id, rroad_long_varchar_deferred_increment.long_varchar_id AS @long_varchar_id FROM rroad_long_varchar_deferred_increment FOR READ ONLY DO -- Apply and delete the deferred increment as an atomic operation. BEGIN ATOMIC UPDATE rroad_long_varchar SET rroad_long_varchar.reference_count = rroad_long_varchar.reference_count + 1 WHERE rroad_long_varchar.long_varchar_id = @long_varchar_id; DELETE rroad_long_varchar_deferred_increment WHERE rroad_long_varchar_deferred_increment.increment_id = @increment_id; END; COMMIT; END FOR;
Note that up to 100 Foxhound sampling process could be inserting data while the purge is running, and any one of them could insert new rows in rroad_long_varchar_deferred_increment; that's the reason for the AND NOT EXISTS predicate in the following FOR loop:
FOR f_long_varchar AS c_long_varchar NO SCROLL CURSOR FOR SELECT rroad_long_varchar.long_varchar_id AS @long_varchar_id FROM rroad_long_varchar WHERE rroad_long_varchar.reference_count <= 0 AND rroad_long_varchar.long_varchar_id > 0 -- don't delete empty row AND NOT EXISTS ( SELECT * -- don't delete rows that have deferred increments FROM rroad_long_varchar_deferred_increment WHERE rroad_long_varchar_deferred_increment.long_varchar_id = rroad_long_varchar.long_varchar_id ) ORDER BY rroad_long_varchar.long_varchar_id FOR UPDATE DO DELETE rroad_long_varchar WHERE CURRENT OF c_long_varchar; END FOR;
Final Thoughts: Is This For You?
Before implementing string de-duplication in your application, run a test to remove duplicates to see if the savings are worthwhile.String de-duplication is worthwhile in Foxhound because a very large number of identical LONG VARCHAR strings are stored in the database. In other applications, where there are fewer duplicates and/or shorter strings, the advantage might not be so great, and alternatives like COMPRESS() might be better.
Your application may already be taking advantage of the "Blob Sharing" feature built in to SQL Anywhere even though it "only occurs when you set values of one column to be equal to those of another column".
In other words, blob sharing only works when your application explicitly creates identical copies (UPDATE t column1 = column2) and it has no effect when your application creates new values separately with no reference to previous identical values.In Foxhound the code which searches for an existing string value uses a brute-force string comparison "WHERE rroad_long_varchar.long_varchar = @new_long_varchar" supported by an index
but that might not perform well when dealing with giant LONG BINARY values. Although it wasn't the case with Foxhound, you may find an alternative method (e.g., comparing HASH() values) will perform better.CREATE UNIQUE INDEX ux_long_varchar ON dba.rroad_long_varchar ( long_varchar );
The special exception handling might not be necessary for applications where collisions are unlikely; i.e., in other applications it might be OK to block and wait.
String de-duplication didn't add much runtime overhead to Foxhound, but performing de-duplication on a large existing database can take a very long time; e.g., it can add hours to the process of upgrading a Foxhound 4 database to Foxhound 5.