Friday, January 11, 2008

Unicode Character Categorization.

Another very common item that will hit many folks is how do you easily determine what category a particular Unicode code-point belongs to.  In the simpler time of ASCII (American Standard Code for Information Interchange) and even the slightly more enlightened period of ISO-8859-x most folks would just assume that if a code-point fell between $41 and $5A, it was an upper-case letter (never mind all those ISO-8859-x characters with diacritics up above $80).  For compilers (or even interpreted and scripting languages), it was just this simple since most compilers didn't allow identifier characters outside the normal ASCII range of ($00-$7F).

This just isn't the case anymore.  With literally many thousands of different code-points available in the whole of the Unicode specification, simple range tests (and even the more sophisticated set "in" tests) just do not cut it.  While you will still be able to do things like this:

function NextWord(const S: string; var I: Integer): string;
var
  Start: Integer;
begin
 
// Skip any leading whitespace or control characters
  while (I <= Length(S)) and (S[I] in [#1..#31,' ']) do Inc(I);
  Start := I;
  // Now gather up the word
  while (I <= Length(S)) and (S[I] in ['A'..'Z', 'a'..'z']) do Inc(I);
  Result := Copy(S, Start, I - Start);
end;

This code works fine for much of the western world... as long as there are no ä, û or similar characters.  Sure this code could be updated to include those characters, but the definition of the code points > $80 differ between ISO-8859-1 (Latin1) and, say, ISO-8859-5 (Cyrillic).  For example, the code-point for the Latin1 ä ($E4) is the Cyrillic Ф ($E4).  This doesn't even take into account the far eastern languages, where the number of code-points far exceed the single-byte range (0-$FF).  In those cases they have to provide escape maps.  There is a "lead-byte" which tells processing code to interpret the next byte using a different table.   There can be many different lead-bytes that denote different tables.  That's the bad news.

Now for the good news.  First of all, with Tiburón, the above code will compile just fine (you'll get a warning "WideChar reduced to byte char in set expressions" on the "in" expressions) and will actually function as originally designed.  The compiler will still generate code to ensure that the WideChar at S[I] is within the byte range of the set (Object Pascal sets can only have 256 elements and can only be a maximum of 32 bytes in size).  So you won't get any false-positives if the string contained, say, Ł (U+0141).  Also, if you are OK with how the above works and still do not need this code to accept the full gamut of Unicode code-points and want to suppress the warning just add the {$WARN WIDECHAR_REDUCED OFF} directive to the source, or the -W-WIDECHAR_REDUCED command-line switch.  The compiler told you about a potential problem and you now have an idea of how to fix it.

But what if you really wanted to make this code a little more Unicode-friendly?  With the many thousands of Unicode code-points, it is simply impractical to provide a huge set that now holds > 256 elements.  That is where we'll be introducing a full gamut of character categorization functions.  To get an idea of what I'm talking about, we've modeled these functions after a similar set of functions in the Microsoft .NET framework.  Take a look at the System.Char class.  It has many static functions such as IsLetter, IsNumber, IsPunctuation, etc...  How could the above code be written to be more Unicode friendly?

function NextWord(const S: string; var I: Integer): string;
var
  Start: Integer;
begin
  // Skip any leading whitespace or control characters
  while (I <= Length(S)) and (IsWhitespace(S, I) or IsControl(S, I)) do Inc(I);
  Start := I;
  // Now gather up the word
  while (I <= Length(S)) and IsLetter(S, I) do
  begin
    if IsSurrogate(S, I) then Inc(I);

    Inc(I);
  end;
  Result := Copy(S, Start, I - Start);
end;

The implementation of these functions is fully table driven and is derived directly from information provided by the Unicode.org.  The data is processed, compressed and accessed using very fast table look-ups.  This processing of the tables is done during our build process, which generates a resource which is then linked in.  It is based on the latest version (5.0) of the Unicode specification, so if that ever changes, all we'll need to do is get the latest data files and process them.