Friday, December 21, 2007

The Unicode Soft Hyphen vs. the Latin 1 Soft Hyphen (Cage Death Match)

As I delve further into the move to Unicode support in Tiburón (the next RAD Studio release) and pouring over the Unicode standards documents and what Windows thinks about Unicode I've stumbled across and interesting controversy.   A quick Google search turned up this interesting summary of the controversy.  I went looking for information about this single code-point, U+00AD, because I clearly noticed a discrepancy between how the Microsoft .NET frameworks System.Char.GetUnicodeCategory() function categorizes this character and what the actual Unicode.org database says.  Microsoft and Latin 1 (ISO-8859-1) insists that this is a visible character, while the Unicode.org standard says it is not.  I'm guessing that MS based their code on pre 4.0 versions of the Unicode standard.

Here is a rebuttal to the above article

"That text is unfortunately too easy to misread (and overinterpret!)."

Now folks are parsing words!  Uh... excuse me??!  Standards text should be more clear than that. Heck, even the Unicode.org had "misread" or "overinterpreted" it for the first three major revisions of the standard.  Here's the relevant text from the ISO-8859-1 standard:

5.3.3 SOFT HYPHEN (SHY)
A graphic character that is imaged by a graphic symbol identical with, or similar to, that representing HYPHEN, for use when a line break has been established within a word. [emphasis mine]

Ok.. when I stand on my head and when the phase of the moon is just right (that's called sarcasm, folks), I can see a vague reference to it not be an rendered character except in specific circumstances...  Had they really meant this character to not be considered graphically rendered, maybe they should have adopted text similar to the non-breaking space:

5.3.2 NO-BREAK SPACE (NBSP)
A graphic character the visual representation of which consists of the absence of a graphic symbol, for use when a line break is to be prevented in the text as presented. [emphasis mine]

I'm going to be blunt here.  This is stupid.  Latin 1, ISO-8859-1, EBCDIC etc... all pre-date the establishment of the Unicode standard (which was established in 1991).  The first 256 code points in the Unicode standard purport to be equivalent to ISO-8859-1.  In this case, I'm inclined to just special case that one character and change it's Unicode classification from "Cf" (Control, Format) to "Pd" (Punctuation, Dash).  This would make it consistent with how MS Office regards this character and how it is classified in the .NET Framework.  Open Notepad, or Word and type Alt+0173 on the numpad.  You'll see a hyphen show up.  Save the document in Notepad as Unicode, then dump it as hex and you'll get something like this:

000000: FF FE AD 00 0D 00 0A 00  00 00 00 00 00 00 00 00 ................

There's the little-endian Byte Order Mark (BOM) as the first two bytes, then the next two bytes are the U+00AD Soft Hyphen character.  Maybe I'm imagining things...  But I'm sure I see that character on the display.

Who would have thought that one code point in the whole of the Unicode standard which covers many thousands of code points would be so polarizing?  Maybe this is all hyperbole on my part, but the first article linked above really goes out of its way to try and smack down the Unicode.org once and for all.  Also, the rebuttal has an air of indignant arrogance and throughout the text just rudely brushes off (and rewords) the various assertions which from my point of view seemed reasonable.  I've not found a well reasoned explanation as to why the Unicode.org decided to make the change.  Even something like "Oops!  We screwed the pooch on that one!  Sorry folks!"  Maybe they could have taken "common usage" (even though it may be incorrect usage) into account and simply introduced a new code point to do what they wanted (It's not like they're going to run out of code points anytime soon :)

I guess that's the beauty of standards;  There are so many to choose from :).  Oh, and there are all those folks that are going to immediately jump on this and start bashing MS as being hostile toward standards or simply ignoring them.  Yeah, that's right folks there's a whole committee at MS called "How can we mess with the standards bodies?".  I'm simply not agreeing with MS because they're MS, I just looked at the history, facts (as I've been able to discover them) and drew my own conclusion.

I've been exposed the the ISO standards process.  I know the lengths by which they're suppose to look out for the interests of the relevant industry as a whole.  I can imagine the "Soft Hyphen" sub-committee heated discussions :).  "Punctuation!"  "Format Control!" "Punctuation!" "Format Control!" "I know you are but what am I!" "Neener Neener!"

All I want is my unit tests to pass!

With that, have a great Holiday Season.  Make sure you spend some quality time with family and friends.  Always designate a driver or call a cab.