Friday, December 21, 2007

The Unicode Soft Hyphen vs. the Latin 1 Soft Hyphen (Cage Death Match)

As I delve further into the move to Unicode support in Tiburón (the next RAD Studio release) and pouring over the Unicode standards documents and what Windows thinks about Unicode I've stumbled across and interesting controversy.   A quick Google search turned up this interesting summary of the controversy.  I went looking for information about this single code-point, U+00AD, because I clearly noticed a discrepancy between how the Microsoft .NET frameworks System.Char.GetUnicodeCategory() function categorizes this character and what the actual Unicode.org database says.  Microsoft and Latin 1 (ISO-8859-1) insists that this is a visible character, while the Unicode.org standard says it is not.  I'm guessing that MS based their code on pre 4.0 versions of the Unicode standard.

Here is a rebuttal to the above article

"That text is unfortunately too easy to misread (and overinterpret!)."

Now folks are parsing words!  Uh... excuse me??!  Standards text should be more clear than that. Heck, even the Unicode.org had "misread" or "overinterpreted" it for the first three major revisions of the standard.  Here's the relevant text from the ISO-8859-1 standard:

5.3.3 SOFT HYPHEN (SHY)
A graphic character that is imaged by a graphic symbol identical with, or similar to, that representing HYPHEN, for use when a line break has been established within a word. [emphasis mine]

Ok.. when I stand on my head and when the phase of the moon is just right (that's called sarcasm, folks), I can see a vague reference to it not be an rendered character except in specific circumstances...  Had they really meant this character to not be considered graphically rendered, maybe they should have adopted text similar to the non-breaking space:

5.3.2 NO-BREAK SPACE (NBSP)
A graphic character the visual representation of which consists of the absence of a graphic symbol, for use when a line break is to be prevented in the text as presented. [emphasis mine]

I'm going to be blunt here.  This is stupid.  Latin 1, ISO-8859-1, EBCDIC etc... all pre-date the establishment of the Unicode standard (which was established in 1991).  The first 256 code points in the Unicode standard purport to be equivalent to ISO-8859-1.  In this case, I'm inclined to just special case that one character and change it's Unicode classification from "Cf" (Control, Format) to "Pd" (Punctuation, Dash).  This would make it consistent with how MS Office regards this character and how it is classified in the .NET Framework.  Open Notepad, or Word and type Alt+0173 on the numpad.  You'll see a hyphen show up.  Save the document in Notepad as Unicode, then dump it as hex and you'll get something like this:

000000: FF FE AD 00 0D 00 0A 00  00 00 00 00 00 00 00 00 ................

There's the little-endian Byte Order Mark (BOM) as the first two bytes, then the next two bytes are the U+00AD Soft Hyphen character.  Maybe I'm imagining things...  But I'm sure I see that character on the display.

Who would have thought that one code point in the whole of the Unicode standard which covers many thousands of code points would be so polarizing?  Maybe this is all hyperbole on my part, but the first article linked above really goes out of its way to try and smack down the Unicode.org once and for all.  Also, the rebuttal has an air of indignant arrogance and throughout the text just rudely brushes off (and rewords) the various assertions which from my point of view seemed reasonable.  I've not found a well reasoned explanation as to why the Unicode.org decided to make the change.  Even something like "Oops!  We screwed the pooch on that one!  Sorry folks!"  Maybe they could have taken "common usage" (even though it may be incorrect usage) into account and simply introduced a new code point to do what they wanted (It's not like they're going to run out of code points anytime soon :)

I guess that's the beauty of standards;  There are so many to choose from :).  Oh, and there are all those folks that are going to immediately jump on this and start bashing MS as being hostile toward standards or simply ignoring them.  Yeah, that's right folks there's a whole committee at MS called "How can we mess with the standards bodies?".  I'm simply not agreeing with MS because they're MS, I just looked at the history, facts (as I've been able to discover them) and drew my own conclusion.

I've been exposed the the ISO standards process.  I know the lengths by which they're suppose to look out for the interests of the relevant industry as a whole.  I can imagine the "Soft Hyphen" sub-committee heated discussions :).  "Punctuation!"  "Format Control!" "Punctuation!" "Format Control!" "I know you are but what am I!" "Neener Neener!"

All I want is my unit tests to pass!

With that, have a great Holiday Season.  Make sure you spend some quality time with family and friends.  Always designate a driver or call a cab.

3 comments:

  1. Thank for the Unicode information.

    Will Tiburón provide "must have" routines for Unicode process? like below:

    1. Load From File, Save To File //suport BOM and UTF-8

    2. Smart and Fast Pos, Replace suppport below parameters:

    Start, Direction, CaseSensitive,

    Now I write the code by ourself or copy from Internet.

    But I do need them in Delphi RTL.
    I think they are "must have" routines.
    I like Delphi to provide more string process routines because they are heavy used. I use Delphi 2006.


    Thanks,

    Bear

    ReplyDelete
  2. I think you misread the ISO-8859-1 sentence:

    "A graphic character that is imaged by a graphic symbol identical with, or similar to, that representing HYPHEN, for use when a line break has been established within a word."

    I read it this way:

    "A graphic character that is imaged [description of how to show it], for use [description of when the user should use it]"

    I can absolutely no reference at all, that an ISO-8859-1 SOFT HYPHEN should not be rendered in some cases. On the contrary, it clearly states how it should always be rendered.

    The HTML standard is about how to render things, and may change how to render specific symbols.

    With regard to the original standards, a line-ending hyphen has typographically historically been smaller than an en-dash, and typographically doesn't make much sense in other places than at the line ending.

    An editor can be designed in several ways. An editor can try to represent the bytes in the file that is being edited, in a way that make the user able to control exactly, what bytes are stored in the file. But the editor can also focus on creating nice layouts. If you're creating source code, you would prefer the first behavior, and if you're creating a sales brochure, you would prefer the second behavior.

    As you can see, the application defines whether to shows the hyphen or not - the ISO/Unicode standards don't.

    It is a mistake to assume, that the character set should define the exact behavior of a soft hyphen in an editor.

    ReplyDelete
  3. I agree with #4. The text states that it is only visible when breaking a word. It's not the best way to put it in words, however. And I'm pretty sure it's not the way it's been used. But what else could a *soft* hyphen mean?

    ReplyDelete

Please keep your comments related to the post on which you are commenting. No spam, personal attacks, or general nastiness. I will be watching and will delete comments I find irrelevant, offensive and unnecessary.