Friday, January 11, 2008

Unicode Character Categorization.

Another very common item that will hit many folks is how do you easily determine what category a particular Unicode code-point belongs to.  In the simpler time of ASCII (American Standard Code for Information Interchange) and even the slightly more enlightened period of ISO-8859-x most folks would just assume that if a code-point fell between $41 and $5A, it was an upper-case letter (never mind all those ISO-8859-x characters with diacritics up above $80).  For compilers (or even interpreted and scripting languages), it was just this simple since most compilers didn't allow identifier characters outside the normal ASCII range of ($00-$7F).

This just isn't the case anymore.  With literally many thousands of different code-points available in the whole of the Unicode specification, simple range tests (and even the more sophisticated set "in" tests) just do not cut it.  While you will still be able to do things like this:

function NextWord(const S: string; var I: Integer): string;
var
  Start: Integer;
begin
 
// Skip any leading whitespace or control characters
  while (I <= Length(S)) and (S[I] in [#1..#31,' ']) do Inc(I);
  Start := I;
  // Now gather up the word
  while (I <= Length(S)) and (S[I] in ['A'..'Z', 'a'..'z']) do Inc(I);
  Result := Copy(S, Start, I - Start);
end;

This code works fine for much of the western world... as long as there are no ä, û or similar characters.  Sure this code could be updated to include those characters, but the definition of the code points > $80 differ between ISO-8859-1 (Latin1) and, say, ISO-8859-5 (Cyrillic).  For example, the code-point for the Latin1 ä ($E4) is the Cyrillic Ф ($E4).  This doesn't even take into account the far eastern languages, where the number of code-points far exceed the single-byte range (0-$FF).  In those cases they have to provide escape maps.  There is a "lead-byte" which tells processing code to interpret the next byte using a different table.   There can be many different lead-bytes that denote different tables.  That's the bad news.

Now for the good news.  First of all, with Tiburón, the above code will compile just fine (you'll get a warning "WideChar reduced to byte char in set expressions" on the "in" expressions) and will actually function as originally designed.  The compiler will still generate code to ensure that the WideChar at S[I] is within the byte range of the set (Object Pascal sets can only have 256 elements and can only be a maximum of 32 bytes in size).  So you won't get any false-positives if the string contained, say, Ł (U+0141).  Also, if you are OK with how the above works and still do not need this code to accept the full gamut of Unicode code-points and want to suppress the warning just add the {$WARN WIDECHAR_REDUCED OFF} directive to the source, or the -W-WIDECHAR_REDUCED command-line switch.  The compiler told you about a potential problem and you now have an idea of how to fix it.

But what if you really wanted to make this code a little more Unicode-friendly?  With the many thousands of Unicode code-points, it is simply impractical to provide a huge set that now holds > 256 elements.  That is where we'll be introducing a full gamut of character categorization functions.  To get an idea of what I'm talking about, we've modeled these functions after a similar set of functions in the Microsoft .NET framework.  Take a look at the System.Char class.  It has many static functions such as IsLetter, IsNumber, IsPunctuation, etc...  How could the above code be written to be more Unicode friendly?

function NextWord(const S: string; var I: Integer): string;
var
  Start: Integer;
begin
  // Skip any leading whitespace or control characters
  while (I <= Length(S)) and (IsWhitespace(S, I) or IsControl(S, I)) do Inc(I);
  Start := I;
  // Now gather up the word
  while (I <= Length(S)) and IsLetter(S, I) do
  begin
    if IsSurrogate(S, I) then Inc(I);

    Inc(I);
  end;
  Result := Copy(S, Start, I - Start);
end;

The implementation of these functions is fully table driven and is derived directly from information provided by the Unicode.org.  The data is processed, compressed and accessed using very fast table look-ups.  This processing of the tables is done during our build process, which generates a resource which is then linked in.  It is based on the latest version (5.0) of the Unicode specification, so if that ever changes, all we'll need to do is get the latest data files and process them.

11 comments:

  1. Note: Unit cUnicodeChar from http://fundementals.sourceforge.net/

    ReplyDelete
  2. What about "case SomeWideChar of"?
    will that work, or it will be reduced to AnsiChar?

    ReplyDelete
  3. "Any way to make it more dynamically updatable for already compiled applications?"

    Efficiency and flexibility are inversely related: why are you compiling applications if you want them to be dynamic ? I realize that you might want some parts to be fast and some to be flexible, but in general, one has to live with the tool one picks: Delphi applications are highly optimized for speed (Object Pascal is optimized for explicit readability and type-checking).

    If your applications "...haven't seen the light of day for 2 years...", why do you need to change it at all? And even if you do, Alan has said that the old code will do what the old code was meant to do.

    If your old applications need to change, then why it is unreasonable for you to have to change them?

    Cheers

    ReplyDelete
  4. Dušan,

    That will continue to work as it does today. A WideChar is an ordinal type and case xxx of works with any ordinal type.

    Allen

    ReplyDelete
  5. btw, please release out all the new OTA specifications as soon as possible since the new IDE is fully unicode and cannot use the current plug-in that was written in previous OTA.

    ReplyDelete
  6. Sounds like a great idea. Will you be easing the transition by providing a D2007 update that adds AnsiChar versions of all these functions? It'd be a lot easier to make our code work with the new functions in D2007 first, and *then* recompile with D2008/UnicodeString.

    ReplyDelete
  7. May be it'll be even better:

    Somewhere in RTL:

    type
    TCharType = (ctWhitespace, ctControl, ctLetter, ctSurrogate);

    Then in function NextWord:

    while (I

    ReplyDelete
  8. Sorry, forbidden character.
    ...Then in function NextWord:

    while (I less_or_equal Length(S)) and (CharType(S[I]) in [ctWhitespace, ctControl]) do ...

    And for cases:

    case CharType(S[I]) of
    ctWhitespace, ctControl: ...
    ctLetter: ...
    etc.

    ReplyDelete
  9. Kryvich,

    Yes, there will be an enumeration similar to what you describe and a function that returns the category of the given character. The categories map directly to those described by the unicode.org.

    Allen.

    ReplyDelete
  10. Note that there is a lot of unicode helper functions in the Delphi Jedi (JCL) library from former Delphi-Gem Unicode Library see "JclWideStrings.pas" and example .\jcl\examples\windows\widestring\WideStringExample.dpr

    ReplyDelete

Please keep your comments related to the post on which you are commenting. No spam, personal attacks, or general nastiness. I will be watching and will delete comments I find irrelevant, offensive and unnecessary.