String de-duplication is the replacement of multiple duplicate string values with a single copy to save space.
Foxhound 5 uses custom code to implement de-duplication of two columns in the database, and this sometimes results in the database file being 50% smaller than in Foxhound 4.
This article shows
- how existing Foxhound code was changed to implement string de-duplication, and
- how concurrency problems were handled to prevent performance problems.
CREATE TABLE ChangesSELECT ChangesINSERT ChangesDELETE ChangesFinal Thoughts: Is This For You?
CREATE TABLE Changes
The strings of interest are the
Last Statement and Last Plan Text fields on the Foxhound Monitor and Sample History pages:
In Foxhound 4 they're stored in this table:
CREATE TABLE DBA.rroad_group_2_property_pivot (
sampling_id UNSIGNED INT NOT NULL,
sample_set_number /* PK X */ UNSIGNED BIGINT NOT NULL,
connection_number /* PK */ BIGINT NOT NULL,
...
LastStatement LONG VARCHAR NULL,
LastPlanText LONG VARCHAR NULL DEFAULT '',
...
PRIMARY KEY (
sample_set_number,
connection_number ) );
|
In Foxhound 5 they've been replaced by two unsigned integers:
CREATE TABLE dba.rroad_group_2_property_pivot (
sampling_id UNSIGNED INT NOT NULL,
sample_set_number /* PK X */ UNSIGNED BIGINT NOT NULL,
connection_number /* PK */ BIGINT NOT NULL,
...
LastStatement_id UNSIGNED INT NOT NULL DEFAULT 0,
LastPlanText_id UNSIGNED INT NOT NULL DEFAULT 0,
...
PRIMARY KEY (
sample_set_number,
connection_number ) );
|
The LastStatement_id and LastPlanText_id columns are effectively foreign key columns pointing to a new table (rroad_long_varchar) where the actual LONG VARCHAR values are stored:
CREATE TABLE dba.rroad_long_varchar (
long_varchar_id /* PK */ UNSIGNED INT NOT NULL DEFAULT autoincrement,
reference_count BIGINT NOT NULL,
long_varchar /* U */ LONG VARCHAR NOT NULL,
PRIMARY KEY ( long_varchar_id ) );
CREATE UNIQUE INDEX ux_long_varchar ON dba.rroad_long_varchar ( long_varchar );
|
The UNIQUE INDEX has two purposes: first, it guarantees there are no duplicates, and second, it supports the SELECT
WHERE rroad_long_varchar.long_varchar = @new_long_varchar clause used in a later section.
SELECT Changes
In Foxhound 5 the adhoc view
sample_connection has been modified to return the same result set as in Foxhound 4. In other words, even though the underlying rroad_group_2_property_pivot table no longer contains LastStatement and LastPlanText, those columns
do appear in the view.
Here's how:
In Foxhound 4 the view is a simple "SELECT everything from the base table"
create VIEW sample_connection AS
SELECT *
FROM rroad_group_2_property_pivot;
|
whereas Foxhound 5 uses joins to turn LastStatement_id and LastPlanText_id into the original LastStatement and LastPlanText:
create VIEW sample_connection AS
SELECT rroad_group_2_property_pivot.sampling_id,
rroad_group_2_property_pivot.sample_set_number, -- PRIMARY KEY
rroad_group_2_property_pivot.connection_number, -- PRIMARY KEY
...
rroad_LastStatement.long_varchar AS LastStatement,
rroad_LastPlanText.long_varchar AS LastPlanText,
...
FROM rroad_group_2_property_pivot
INNER JOIN ( SELECT *
FROM rroad_long_varchar
) AS rroad_LastStatement
ON rroad_LastStatement.long_varchar_id = rroad_group_2_property_pivot.LastStatement_id
INNER JOIN ( SELECT *
FROM rroad_long_varchar
) AS rroad_LastPlanText
ON rroad_LastPlanText.long_varchar_id = rroad_group_2_property_pivot.LastPlanText_id
|
Foxhound itself uses the same joins to display LastStatement and LastPlanText on the Monitor and History pages.
INSERT Changes
When the Foxhound sampling process inserts a connection sample, it calls a function to convert the SQL Anywhere LastStatement and LastPlanText properties into unsigned integers:
INSERT rroad_group_2_property_pivot (
sampling_id,
sample_set_number,
connection_number,
...
LastStatement_id,
LastPlanText_id,
...
SELECT @sampling_id,
@sample_set_number,
connection_number,
...
rroad_get_long_varchar_id ( LastStatement ),
rroad_get_long_varchar_id ( LastPlanText ),
...
FROM ...
|
Here's that function:
CREATE FUNCTION rroad_get_long_varchar_id (
IN @long_varchar LONG VARCHAR,
IN @is_recursive_call VARCHAR ( 1 ) DEFAULT 'N' )
RETURNS UNSIGNED INTEGER
NOT DETERMINISTIC
BEGIN
DECLARE @new_long_varchar LONG VARCHAR;
DECLARE @existing_long_varchar_id UNSIGNED INTEGER;
DECLARE @existing_long_varchar LONG VARCHAR;
DECLARE @sqlstate VARCHAR ( 5 );
DECLARE @errormsg VARCHAR ( 32767 );
DECLARE @sqlcode INTEGER;
DECLARE @diagnostic_location VARCHAR ( 20 );
DECLARE @v_existing_long_varchar_id VARCHAR ( 128 );
SET @new_long_varchar = TRIM ( COALESCE ( @long_varchar, '' ) );
IF @new_long_varchar = '' THEN
RETURN 0;
ELSE
|
The following section checks if the LONG VARCHAR value has already been stored in the rroad_long_varchar table.
SET @existing_long_varchar_id = NULL;
SELECT rroad_long_varchar.long_varchar_id
INTO @existing_long_varchar_id
FROM rroad_long_varchar
WHERE rroad_long_varchar.long_varchar = @new_long_varchar;
IF @existing_long_varchar_id IS NOT NULL THEN
|
At this point @existing_long_varchar_id is ready to be returned to the caller because it points to the LONG VARCHAR value that already exists in rroad_long_varchar...
...but first, some bookkeeping is necessary.
The Foxhound purge process must know when it's safe to delete old LONG VARCHAR values, and it does this by checking that rroad_long_varchar.reference_count is zero.
The following code increments the reference_count by 1 each time a new reference is encountered, or rather, it
attempts to increment the reference count:
BEGIN -- handle exception
SET TEMPORARY OPTION BLOCKING = 'OFF';
UPDATE rroad_long_varchar
SET reference_count = reference_count + 1
WHERE long_varchar_id = @existing_long_varchar_id;
SET TEMPORARY OPTION BLOCKING = 'ON';
|
It is possible that multiple almost-simultaneous calls to this function can reach this point with same LONG VARCHAR value; i.e., Foxhound can sample up to 100 target databases in parallel, and it's possible for the same LastStatement and LastPlanText values could be received from multiple targets.
When that happens,
the above UPDATE can fail with SQLCODE -210
Row locked, -306
Deadlock detected, or -307
All threads are blocked.
Rather than have these collisions cause waits or rollbacks, BLOCKING = 'OFF' is used to force all collisions to raise exceptions (even row locks), and those exceptions are all handled by the next section of code which defers the incrementing of reference_count until later.
EXCEPTION WHEN OTHERS THEN
SELECT SQLCODE INTO @sqlcode;
SET TEMPORARY OPTION BLOCKING = 'ON';
IF @sqlcode IN ( -210, -306, -307 ) THEN
-- Defer the UPDATE until later.
INSERT rroad_long_varchar_deferred_increment ( long_varchar_id )
VALUES ( @existing_long_varchar_id );
|
In the grand scheme of things, incrementing reference_count is very low priority; it is much more important for Foxhound sampling to proceed without delay.
Here's the table used by the above INSERT; each row represents one deferred reference_count increment of +1 for one long_varchar_id:
CREATE TABLE rroad_long_varchar_deferred_increment (
increment_id UNSIGNED INTEGER NOT NULL DEFAULT AUTOINCREMENT PRIMARY KEY,
long_varchar_id UNSIGNED INTEGER NOT NULL );
These deferrals are indefinite; the rows inserted in this table aren't needed until the Foxhound purge process is ready to delete unused rows in rroad_long_varchar.
ELSE
-- Let the outer block handle the exception.
RESIGNAL;
END IF;
END; -- handle exception
RETURN @existing_long_varchar_id;
|
The above RETURN finishes handling a matching row in the rroad_long_varchar table.
The next section handles a new LONG VARCHAR value by inserting a new row in rroad_long_varchar:
ELSE -- @existing_long_varchar_id IS NULL
BEGIN -- handle exception
INSERT rroad_long_varchar (
long_varchar_id,
reference_count,
long_varchar )
VALUES (
DEFAULT,
1,
@new_long_varchar );
|
It is also possible that multiple almost-simultaneous calls to this function can reach this point with same LONG VARCHAR value.
When that happens, the first execution of the above INSERT statement will work and
later executions will (correctly) fail with SQLCODE -196
Index 'ux_long_varchar' for table 'rroad_long_varchar' would not be unique.
The resulting exceptions are handled in the next section, by using recursive calls to this function to handle duplicate LONG VARCHAR values.
The recursive calls will ( should :) work because they won't ( shouldn't :) reach this section of code.
EXCEPTION WHEN OTHERS THEN
SELECT SQLCODE, SQLSTATE, ERRORMSG()
INTO @sqlcode, @sqlstate, @errormsg;
IF @sqlcode = -196 THEN -- Index 'ux_long_varchar' for table 'rroad_long_varchar' would not be unique
IF @is_recursive_call = 'N' THEN
-- Try a recursive call, and this time @existing_long_varchar_id IS NOT NULL.
RETURN rroad_get_long_varchar_id ( @new_long_varchar, 'Y' );
ELSE
-- This is already a recursive call, and it is still failing.
CALL rroad_exception ( STRING (
@diagnostic_location, '(454eh1) Unexpected error in recursive call to rroad_get_long_varchar_id: ',
' SQLCODE = ', @sqlcode,
', SQLSTATE = ', @sqlstate,
', ERRORMSG() = ', @errormsg ) );
RESIGNAL;
END IF;
ELSE
-- Let the outer block handle the exception.
RESIGNAL;
END IF;
END; -- handle exception
RETURN @@IDENTITY;
END IF;
END IF;
END;
|
DELETE Changes
In Foxhound 4, deleting old values of LastStatement and LastPlanText was easy: just DELETE the rroad_group_2_property_pivot rows holding those values.
In Foxhound 5, LastStatement and LastPlanText are stored in rroad_long_varchar, and those rows can't be deleted until rroad_long_varchar.reference_count sinks to zero.
That happens when enough rows in rroad_group_2_property_pivot are deleted to fire the following trigger enough times to force the reference_count down to zero:
CREATE TRIGGER trd_rroad_group_2_property_pivot
BEFORE DELETE ON rroad_group_2_property_pivot
REFERENCING OLD AS old_rroad_group_2_property_pivot
FOR EACH ROW
BEGIN
-- Decrement rroad_long_varchar.reference_count as necessary.
IF old_rroad_group_2_property_pivot.LastStatement_id > 0 THEN
UPDATE rroad_long_varchar
SET rroad_long_varchar.reference_count = rroad_long_varchar.reference_count - 1
WHERE rroad_long_varchar.long_varchar_id = old_rroad_group_2_property_pivot.LastStatement_id;
END IF;
IF old_rroad_group_2_property_pivot.LastPlanText_id> 0 THEN
UPDATE rroad_long_varchar
SET rroad_long_varchar.reference_count = rroad_long_varchar.reference_count - 1
WHERE rroad_long_varchar.long_varchar_id = old_rroad_group_2_property_pivot. LastPlanText_id;
END IF;
END;
|
Note that it is possible for reference_count to sink below zero when rows are held in the rroad_long_varchar_deferred_increment table.
The following code shows what happens in the Foxhound purge process after old rows in rroad_group_2_property_pivot have been deleted, and it is time to consider deleting old rows in rroad_long_varchar.
The first section applies the deferred increments to the reference_count so the second section won't delete any rows in rroad_long_varchar that are still needed.
Note that rows in the rroad_long_varchar_deferred_increment table are deleted as soon as they are used.
FOR f_fetch_deferred_increment AS c_fetch_deferred_increment INSENSITIVE CURSOR FOR
SELECT rroad_long_varchar_deferred_increment.increment_id AS @increment_id,
rroad_long_varchar_deferred_increment.long_varchar_id AS @long_varchar_id
FROM rroad_long_varchar_deferred_increment
FOR READ ONLY
DO
-- Apply and delete the deferred increment as an atomic operation.
BEGIN ATOMIC
UPDATE rroad_long_varchar
SET rroad_long_varchar.reference_count = rroad_long_varchar.reference_count + 1
WHERE rroad_long_varchar.long_varchar_id = @long_varchar_id;
DELETE rroad_long_varchar_deferred_increment
WHERE rroad_long_varchar_deferred_increment.increment_id = @increment_id;
END;
COMMIT;
END FOR;
|
The second section takes care of deleting rroad_long_varchar rows that are no longer referenced.
Note that up to 100 Foxhound sampling process could be inserting data while the purge is running, and any one of them could insert new rows in rroad_long_varchar_deferred_increment; that's the reason for the AND NOT EXISTS predicate in the following FOR loop:
FOR f_long_varchar AS c_long_varchar NO SCROLL CURSOR FOR
SELECT rroad_long_varchar.long_varchar_id AS @long_varchar_id
FROM rroad_long_varchar
WHERE rroad_long_varchar.reference_count <= 0
AND rroad_long_varchar.long_varchar_id > 0 -- don't delete empty row
AND NOT EXISTS ( SELECT * -- don't delete rows that have deferred increments
FROM rroad_long_varchar_deferred_increment
WHERE rroad_long_varchar_deferred_increment.long_varchar_id
= rroad_long_varchar.long_varchar_id )
ORDER BY rroad_long_varchar.long_varchar_id
FOR UPDATE
DO
DELETE rroad_long_varchar
WHERE CURRENT OF c_long_varchar;
END FOR;
|
Final Thoughts: Is This For You?
Before implementing string de-duplication in your application, run a test to remove duplicates to see if the savings are worthwhile.
String de-duplication is worthwhile in Foxhound because a very large number of identical LONG VARCHAR strings are stored in the database. In other applications, where there are fewer duplicates and/or shorter strings, the advantage might not be so great, and alternatives like COMPRESS() might be better.
Your application may already be taking advantage of the
"Blob Sharing" feature built in to SQL Anywhere even though it
"only occurs when you set values of one column to be equal to those of another column".
In other words, blob sharing only works when your application explicitly creates identical copies (UPDATE t column1 = column2) and it has no effect when your application creates new values separately with no reference to previous identical values.
In Foxhound the code which searches for an existing string value uses a brute-force string comparison "WHERE rroad_long_varchar.long_varchar = @new_long_varchar" supported by an index
CREATE UNIQUE INDEX ux_long_varchar ON dba.rroad_long_varchar ( long_varchar );
but that might not perform well when dealing with giant LONG BINARY values. Although it wasn't the case with Foxhound, you may find an alternative method (e.g., comparing HASH() values) will perform better.
The special exception handling might not be necessary for applications where collisions are unlikely; i.e., in other applications it might be OK to block and wait.
String de-duplication didn't add much runtime overhead to Foxhound, but performing de-duplication on a large existing database can take a very long time; e.g., it can add hours to the process of upgrading a Foxhound 4 database to Foxhound 5.