Friday, February 13, 2009

Revisiting REGEXP_SUBSTR

In a past life (way back before Christmas) I said some harsh things about SQL Anywhere's support for regular expressions:

It's going to be a very long time before I learn what "positive lookbehind zero-width assertion" means, or code something like '(?<=new\\s)york' in an otherwise readable SELECT. If you already know about regular expressions, you might be happy, but maybe not if you grew up worrying about maintainability.
Top 10 Cool New Features in SQL Anywhere 11 November 3, 2008
I'm not going to say anything nice about regular expressions (they are abominable!)
JPEG Width and Height December 23, 2008

Stand back! I do regular expressions!




I can't say I know regular expressions yet, but I've been doing them lately.

And getting caught by a certain (ahem) behavior change introduced between two successive builds of SQL Anywhere 11.

Now, those familiar with SQL Anywhere's release cycle will know that behavior changes between major point releases are very rare, say after 11.0.0 but before 11.0.1. One reason for the rarity that the documentation (Help) is only shipped with major point releases, and it's the documentation that contains the "Behavior Changes" sections.

So, imagine my surprise when SQL code published in this blog, not once but twice, stopped working properly after I upgraded from the Generally Availability (GA) version 11.0.0.1264 to a later Express Bug Fix (EBF) version 11.0.0.1549.

Suddenly, the wonderful REGEXP_SUBSTR function went from being case-insensitive, just like every other normal string operation in SQL Anywhere, to being case sensitive.

Now, it's not as if I didn't know something was coming. I did see the early warnings, like this one that shipped with the GA software:
As a result of feedback received during the 11.0.0 beta period, the behavior of REGEXP and REGEXP_SUBSTR will be changed in an early 11.0.0 EBF. The exact extent of the change is still being considered, but the intent is to make the regular expression behavior closer to that which is available in other languages (such as Perl, Java, .NET and other products) and to allow for the possibility of additional flexibility in determining how regular expressions are matched.

Also, SIMILAR TO, REGEXP, and REGEXP_SUBSTR will be changed in an early 11.0.0 EBF so that [[:lower:]] and [[:upper:]] are case sensitive regardless of database case sensitivity.
SQL Anywhere 11.0.0 Release Notes

So here's what I thought: If I ever (snork!) use things with names like [[:lower:]] and [[:upper:]], I'll be sure to be careful!

In the meantime, I relied on the Help: "Regular expression matching is case insensitive on case-insensitive databases."

Little did I know that what I should have paid attention to is this comment on the same page: "SQL Anywhere syntax for regular expressions matches closely the Perl 5 syntax".

Here's how the build 1549 read-me described the change:
   ================(Build #1431  - Engineering Case #537171)================

The search condition REGEXP and system function REGEXP_SUBSTR() were using
the database's collation to determine if a literal, or character class range,
in the pattern matched the string. For example, if the database was case
insensitive and accent insensitive, matches were case insensitive and accent
insensitive as well. Ranges were evaluated using the collation sort order.
This resulted in different behavior than other tools such as Perl, Java,
.NET, etc. This has been fixed so that REGEXP and REGEXP_SUBSTR() only match
a literal in a pattern if it is the exact same character. Ranges in character
classes (for example '[A-F]') only match characters which have character
set encoding greater than or equal to the encoding of the first character
in the range (A in '[A-F]') and less than or equal to the encoding of the
second character in the range (F in '[A-F]').

Note, this change does not affect the SIMILAR TO search expression, which
continues to use the collation to determine character equivalence and evaluate
character class ranges.
That's why the Help is so important: It is written by native language speakers rather than professional Locale Lawyers.

Let's translate, into English:
   ================(Build #1431  - Engineering Case #537171)================

REGEXP and REGEXP_SUBSTR() are now case sensitive; i.e., 'A' is not equal to 'a'.
That's probably a good thing. Folks who know regular expressions (not me, not yet) say so.

But I still got caught. I have to fix two blog posts, PLUS I got royally embarrassed when I copied my own blog code into a client application and watched it... fail!

Stand back! I'm fixing code!

For the record, here's the bad code:
SET @padded_marker = REGEXP_SUBSTR (
@hex_string,
'(FF)+([0-9A-F][0-9A-E]|[0-9A-E][0-9A-F])',
@current_pos );
and here's the fix... see all the leetle a's and e's and f's? That's what case sensitivity does for you:
SET @padded_marker = REGEXP_SUBSTR (
@hex_string,
'((FF)|(ff))+([0-9A-Fa-f][0-9A-Ea-e]|[0-9A-Ea-e][0-9A-Fa-f])',
@current_pos );
Just gimme a minute, I'm gonna put "fix old blog posts" on my to-do list...
-- Fix JPEG Width and Height

-- Fix Loading Folders
...I'll get around to it soon, trust me!

No comments: