Wednesday, July 27, 2011

Beware CURRENT TIMESTAMP

Did you know that your computer can travel in time? Into the future and backwards?



Here are three reasons SELECT CURRENT TIMESTAMP can suddenly return a value earlier than it did a moment ago:

  • the autumn "fall back" daylight savings time adjustment has just been made,

  • a leap second has just been added, or

  • the system clock has drifted forward and then been reset disciplined.
Daylight savings time you know about, and if you're one of the lucky ones you don't have to (or never did have to) deal with it.

Leap seconds haven't happened since 2008 so we're probably due for another one soon, but really... who cares? OK, astronomers care... anyone else?

My favorite is the last one...
When the W32Time Manager receives time samples, it uses special algorithms in NTP to determine which of the time samples is the most appropriate for use. The time service also uses another set of algorithms to determine which of the configured time sources is the most accurate. When the time service has determined which time sample is best, based on the above criteria, it adjusts the local clock rate to allow it to converge toward the correct time. If the time difference between the local clock and the selected accurate time sample (also called the time skew) is too large to correct by adjusting the local clock rate, the time service sets the local clock to the correct time. This adjustment of clock rate or direct clock time change is known as clock discipline. - How the Windows Time Service Works

Dilbert.com



Is This a Bohrbug? or a Heisenbug?


Here's a Murphy's Law scenario...
  • You see an error message like "Cannot convert '00:00:67' to a timestamp".

  • You fix the code.

  • Later on you see the same error message coming from a different location in the code.

  • At this point you ask "Should I search all the code for similar errors?"

  • If "Yes" then the effort will be wasted because only that second location needs fixing.

  • If "No" then there is a third location that needs fixing, and it will eventually show up in testing... or production (this is Murphy's Law, after all).

  • When you fix the third location, and ask the question again, the result is the same... there will always be at least one more error until you go looking for it ahead of time.

Show Me The Code!


Here's the pseudocode for a loop that executes every five seconds...

WHILE ... LOOP;

SET t = CURRENT TIMESTAMP;

perform a task that takes a varying amount of time;

SET delay = 5 seconds - DATEDIFF ( between t and CURRENT TIMESTAMP in seconds );

WAITFOR DELAY STRING ( '00:00:', delay );

END LOOP;

In the real world, the code's a bit harder; here are some excerpts from the actual SQL inside Foxhound...

DECLARE @from_timestamp TIMESTAMP;
DECLARE @waitfor_msec BIGINT;
DECLARE @waitfor_sec INTEGER;

...

WHILE ... LOOP

...

SET @from_timestamp = CURRENT TIMESTAMP;

CALL rroad_monitor_main();

...

SET @waitfor_msec = 5000 - rroad_f_datediff_msec ( @from_timestamp, CURRENT TIMESTAMP );

IF @waitfor_msec < 1000 THEN
SET @waitfor_sec = 1; -- minimum additional wait, even if this loop pass has taken a long time
ELSE
SET @waitfor_sec = CAST ( ROUND ( CAST ( @waitfor_msec AS DECIMAL ( 11, 2 ) ) / 1000.0, 0 ) AS INTEGER );
END IF;

...

WAITFOR DELAY STRING ( '00:00:', @waitfor_sec );

...

END LOOP;

The function call rroad_f_datediff_msec ( ... ) on line 17 is some legacy code from the days when DATEDIFF ( MILLISECOND, ... ) returned INTEGER instead of BIGINT like it does now in Version 12.

The IF statement on lines 19 through 23 accounts for the fact that the earlier CALL might have used up a lot of time, resulting in a very small or even negative @waitfor_msec value. If the value is less 1000 milliseconds (one second), @waitfor_sec is set to 1, otherwise @waitfor_sec is calculated using the ridiculously careful conversions on line 22.

The WAITFOR DELAY on line 27 uses a value most likely in the range '00:00:1' through '00:00:4'.

So... why this error message?



SQLCODE = -157, SQLSTATE = 53018, ERRORMSG() = Cannot convert '00:00:67' to a timestamp

The answer is that CURRENT TIMESTAMP on line 17 returned a value that was about a minute earlier than the value on line 11, and that happened because Windows had disciplined the system clock between the two references to CURRENT TIMESTAMP.

And Windows did that because the system clock had slowly, over days, drifted forward in time... it was just Foxhound's bad luck that the Windows clock discipline had been put into effect at a point between the two CURRENT TIMESTAMP references.

Here are some possible numbers: The CALL statement took 1 second and Windows set the system clock back by 63 seconds, so rroad_f_datediff_msec() returned -62000, so...


SET @waitfor_msec = 5000 - ( -62000 ) = 67000

...which resulted in "Cannot convert '00:00:67' to a timestamp".

Here's the fix: change IF to a CASE that deals with stupid-big values...

CASE

WHEN @waitfor_msec < 1000 THEN
SET @waitfor_sec = 1; -- minimum additional wait, even if this loop pass has taken a long time

WHEN @waitfor_msec > 5000 THEN
SET @waitfor_sec = 1; -- deal with the fact CURRENT TIMESTAMP can move backwards

ELSE
SET @waitfor_sec = CAST ( ROUND ( CAST ( @waitfor_msec AS DECIMAL ( 11, 2 ) ) / 1000.0, 0 ) AS INTEGER );

END CASE;

Yes, this was the second time in over a year that the same bug was found and fixed in a different code location.

Yes, this time the answer to the Murphy's Law question was "Yes, let's spend the time and look for any other code that might have the same problem."

And yes, of course the result was "Nope, you wasted your time, Foxhound only has two delay loops like this."


No comments: