Monday, December 2, 2013

Evaluating A Candidate For CLUSTERED

Question: How can I tell if a column is a good candidate for a CLUSTERED index?

Glib Answer: Create a clustered index on that column and see if query performance is improved.

Longer Answer: If the rows are inserted in more-or-less the same order as values in the column, and the column appears in predicates that might benefit from a clustered index on that column (such as a BETWEEN range query), then see the Glib Answer above.

Seriously... :)

In SQL Anywhere the CLUSTERED attribute is more of a performance hint than a constraint. In other words, if a clustered index already exists then SQL Anywhere will try to insert new rows in the same physical order as values in the index, but there's no guarantee. And there's REALLY no guarantee if a non-clustered index already exists and you change it with ALTER INDEX:
ALTER INDEX x ON t CLUSTERED;
In fact, that ALTER INDEX will run instantly; SQL Anywhere will not move anything around, and it's up to you to tell the truth about insert ordering if you want good performance.

The flip side is that it's really easy to run ALTER INDEX x ON t CLUSTERED statements, even in production, and see if performance improves; you can always change it back with this:
ALTER INDEX x ON t NONCLUSTERED;
In other words, the Glib Answer ain't so glib after all.

Answer the question!

Sometimes a column is "somewhat clustered", and you'd like to know "how clustered". If the rows are perfectly ordered according to the values in some other column, you can use that fact to check the candidate column; e.g., a PRIMARY KEY DEFAULT AUTOINCREMENT column will perfectly define the row order if the DEFAULT is always used when inserting.

Here's some code showing how to compare the ordering of a candidate column with a DEFAULT AUTOINCREMENT column:
-- DBA.t (table_id 738) in ddd16 - Dec 1 2013 2:31:56PM - Print - Foxhound © 2012 RisingRoad
CREATE TABLE DBA.t ( -- 725,161 rows, 15.5M total = 13.3M table + 0 ext + 2.3M index, 23 bytes per row
   pkey               /* PK        */ INTEGER NOT NULL DEFAULT autoincrement,
   candidate_column                   TIMESTAMP NULL,
   CONSTRAINT ASA78 PRIMARY KEY ( -- 2.3M
      pkey )
 );

WITH ordered_t AS ( SELECT ROW_NUMBER() OVER ( ORDER BY t.pkey ) AS row_number,
                           t.candidate_column
                      FROM t ) 
SELECT 'Disordered row count: ' AS "Row count", 
       COUNT(*)
  FROM ( SELECT * FROM ordered_t ) AS ordered_t1
       INNER JOIN
       ( SELECT * FROM ordered_t ) AS ordered_t2
       ON ordered_t2.row_number = ordered_t1.row_number + 1
 WHERE ordered_t2.candidate_column < ordered_t1.candidate_column 
UNION ALL
SELECT 'Total row count:',
       COUNT(*)
  FROM t;

Disordered row count:  98,790
Total row count:      725,161
  • The WITH clause on lines 9 through 11 creates a local view called ordered_t. That view deals with the fact there may be gaps between adjacent values of t.pkey. It does that by calling ROW_NUMBER() to replace t.pkey with the sequence 1, 2, 3 with no gaps.

  • The FROM clause on lines 14 through 17 joins two copies of the ( SELECT * FROM ordered_t ) view, matching each row in one copy with the next (adjacent) row in the other copy.

  • The WHERE clause on line 18 looks for candidate_column values that are out of order in those adjacent rows.

  • The SELECT on lines 12 through 18 displays how many rows are out of order, and the UNION ALL SELECT on lines 19 through 22 shows the total row count.
The result shows that a quite a few rows are out of order (1 out of 7) so maybe a CLUSTERED index won't help.

Let's try one

Here's a range query that was tested with no index on candidate_column, with a non-clustered index, and with a clustered index. Each test was run twice, with and without capturing a graphical plan, and the database was restarted before each test to eliminate effects of the cache:
SELECT COUNT(*)
  FROM t
 WHERE candidate_column BETWEEN CURRENT DATE - 1000
                            AND CURRENT DATE - 900;
Here are the results:
                                Graphical Plan - Time       ISQL
                                to fetch the first row  Execution Time 
                                ----------------------  --------------

No index on t.candidate_column          0.85942             0.852

CREATE INDEX xcandidate_column          0.50568             0.439
   ON t ( candidate_column )

ALTER INDEX xcandidate_column           0.09796             0.08
   ON t CLUSTERED;
The second column shows the FirstRowRunTime in seconds reported in the graphical plans, and the third column shows what ISQL reported when run without capturing plans. The two columns are consistent, showing that
  • the non-clustered index shaved almost 50% off the no-index execution time, and

  • declaring the index as CLUSTERED reduced the time by 90%.
So... in this case, for the distribution of data in this table, for this particular query... CLUSTERED wins!

Here's a snippet from the graphical plan for the no-index test, showing that an intra-query Parallel Table Scan was used:

The plan for the non-clustered index test shows no more table scan, now it's using a Parallel Index Only Retrieval Scan; remember, it's just counting rows so it doesn't actually have to retrieve anything from the table itself:

The clustered index plan looks exactly the same as the non-clustered index plan, except it says "Clustered index yes"... and it's faster.


No comments: