Monday, September 24, 2012

Fix Twice, Ship Once

"Measure twice, cut once" is often crudely interpreted to mean "think before you act", but it's more subtle than that: cutting material is a two step process where the first step (measuring for the cut) can be repeated without harm, and even when you make a mistake measuring there's no immediate harm... BUT, if you base the second step (cutting) on a measurement that's too short, the harm is immediate and permanent... it cannot be undone.

Fixing a late-stage bug in a computer program is similar: Nobody will notice how careful you are, but they certainly will notice when you get it wrong. That's what makes late-stage bugs so much fun: They're bigger and badder than bugs found earlier, more folks notice them (system testers, quality assurance, end users, project managment), they're harder to find and fix, and if your fix doesn't stick then... well, maybe the harm isn't permanent but the embarrasment will feel like it.

There are a number of ways you can "measure twice" when fixing a bug. The first way is to look around for a second instance of the same kind of problem. Here's an example from an UPDATE SET that resulted in the bogus display value Transaction Running Time = 41,170d 15h 13m 40s:

      = IF TRIM ( COALESCE ( rroad_group_2_property_pivot.TransactionStartTime, '' ) ) = ''
           THEN 0
                   rroad_f_dateadd_msec ( @datediff_msec_between_target_and_local,  
                                          rroad_group_2_property_pivot.TransactionStartTime ), 
                   @sample_recorded_at )
The TransactionStartTime column is a VARCHAR sring that is sometimes empty, and when it is, the transaction_running_time column is supposed to be set to zero.

The trouble with that code is that
  • TransactionStartTime is declared as NOT NULL DEFAULT '1900-01-01',

  • so it's never NULL but sometimes '1900-01-01'

  • and when that happens it sails right past the IF TRIM COALESCE test for "empty"

  • and the DATEDIFF returns a value like 3,557,142,820,286

  • which gets displayed as the truly monstrous 41,170d 15h 13m 40s.
The solution was to check for all the likely representations of "empty":
      = IF TRIM ( COALESCE ( rroad_group_2_property_pivot.TransactionStartTime, '' ) ) 
           IN ( '', '1900-01-01', '0000-01-00 00:00:00.000000' )
Yes, the COALESCE could have been eliminated because the column's NOT NULL, but it doesn't hurt so it got left in. And yes, it's a dumb mistake, there are better ways to write the code, etcetera.

But that's not the point. The point is, when fixing a bug, to look around to see if exactly the same mistake was made anywhere else... and it turned out that yes, the LastReqTime column was also declared NOT NULL DEFAULT '1900-01-01', and it was also mishandled with the same IF TRIM COALESCE = '' logic, and it needed exactly the same fix to avoid bogus display values.

The second way to "measure twice" when fixing a bug is to look for a second independent method to fix the bug or even just prevent the symptom. An example of this was some "reset" logic that got carried away setting columns back to zero and reset three columns that should never have been touched.

The solution was to remove the incorrect code from the two locations it was found... but what if that didn't entirely fix the problem? What else could be done?

Well, the three columns in question should never be updated with smaller values; all three should should remain unchanged or be assigned larger values. A second, independent fix for this symptom was to write a trigger to enforce this constraint:
CREATE TRIGGER tru_rroad_alerts_criteria 
   BEFORE UPDATE ON rroad_alerts_criteria
   REFERENCING OLD AS old_rroad_alerts_criteria NEW AS new_rroad_alerts_criteria

   -- Prevent these columns from decreasing in value.

   IF new_rroad_alerts_criteria.previous_sample_set_number_processed < old_rroad_alerts_criteria.previous_sample_set_number_processed THEN
      SET new_rroad_alerts_criteria.previous_sample_set_number_processed = old_rroad_alerts_criteria.previous_sample_set_number_processed;
   END IF;

   IF new_rroad_alerts_criteria.previous_gathered_sample_finished_at < old_rroad_alerts_criteria.previous_gathered_sample_finished_at THEN
      SET new_rroad_alerts_criteria.previous_gathered_sample_finished_at = old_rroad_alerts_criteria.previous_gathered_sample_finished_at;
   END IF;

   IF new_rroad_alerts_criteria.current_sample_finished_at < old_rroad_alerts_criteria.current_sample_finished_at THEN
      SET new_rroad_alerts_criteria.current_sample_finished_at = old_rroad_alerts_criteria.current_sample_finished_at;
   END IF;


The third way to "measure twice" when fixing a bug is to admit that where there's one bug there's probably two, or more, and now's a good time to take a quick look around for other bugs in the same module or section of code. This is different from the "second instance of the same bug" described earlier; here you're looking for unrelated bugs in a section of code that's, well, known to be buggy, and...

  • you've got that module open on the workbench,

  • and you're already poking around in the code,

  • so why not look around some more?
Management may not like the third approach, especially if the code's been frozen: "We're not looking for trouble, we're trying to catch the market!" They might have a point if you start fixing stuff that doesn't need fixing, like "ugly" code, or bugs that are unlikely to surface any time soon.

But, there's no harm in looking, and reporting what you find, and you might just find something big and ugly that slipped through all the walkthroughs.

No comments: