Wednesday, July 16, 2008

Tiburón - String Theory

No, not that String Theory, or even this one. What this is about is an interesting extension to AnsiString. During the field test cycle of Tiburón and our own internal porting of the IDE code (which was accomplished in about 1.5 months, by 2-3 folks, with > 2 million LOC), it became clear that there was a need for easily encoding UTF16 character data as UTF8. For the astute among you, you probably already know about the RTL defined UTF8String which is really just an alias to AnsiString. The problem is that it is UTF8 in name only. Unless you explicitly ensured that only UTF8 data was placed into the payload, it could just as easily hold normal Ansi character data. We needed to make the use of UTF8String easier. As we looked at how AnsiString worked, it was clear that AnsiString always had this "affinity" to carry it's data payload encoded as whatever the RTL had determined was the active code page, at runtime. So we wondered, "what if we could create an AnsiString type where the programmer determined at compile time, the code page affinity for AnsiString?"

It turns out that, to Windows, UTF8 encoding is just another code page. Down in the RTL, the conversions to/from UnicodeString or WideString use the Windows API functions WideCharToMultiByte() and MultiByteToWideChar(). One of the parameters is the code page identifier to or from which the data is converted. If you pass CP_UTF8 (65001) to those functions, they'll convert between UTF16 and UTF8. This is a lossless conversion. In Tiburón, we're introducing an enhancement to declaring your own "typed" AnsiString. You have always been able to create a unique type based on any intrinsic type by declaring the new type with the "typed type" syntax:

MyString = type AnsiString;

This would create a new type that is assignment compatible* with AnsiString, but with a unique type name and a unique RTTI structure. We've used this in VCL to distinguish normal "strings" from special strings such as TFileName or TCaption. By creating these unique string types, it was possible to create property editors that would be associated with a specific type of property, regardless of which component it used on or what the property name was. This is how only the Caption property on many components will automatically update the live design-time component as you type in the Caption value.

The thing is, with the above declaration, MyString will continue to always have an affinity for whatever the current runtime code page was. So, we introduced the following syntax for AnsiString only:

MyString = type AnsiString(<1..65534>);

You can now control the code page affinity of any "typed type" AnsiString at compile time. The "parameter" to AnsiString must be a word constant expression. The values 0 and 65535 have special meanings. 0 is a normal AnsiString, and 65535 ($FFFF) means "no affinity." $FFFF is worth noting here as already being declared as a RawByteString. When assigning between AnsiStrings or passing them as parameters, if the code page affinity of the source and destination strings are different, an automatic conversion is done. In order to minimize potential dataloss during the conversion, all conversions go "through" UTF16. However, a string with an affinity of $FFFF tells both the compiler and the RTL, that none of these conversions should be done and to just move over the payload. In practice, however, there would be only a few instances of needing to use RawByteString, but it is there for your use.

So we now have the following declarations in the System unit:

UTF8String = type AnsiString(CP_UTF8);
RawByteString = type AnsiString($FFFF);

Like I stated previously, any assignment (or passing as a parameter to procedure or function) where the code page affinity between the source and destination are different, the payload will automatically be converted. Say you have a function that must only take UTF8 data. You can declare it like this:

procedure WriteUTF8Data(const Data: UTF8String);
// write UTF8 data to stream, file, socket, etc...

Now, no matter what type of string you pass to this procedure, you know that the payload will have been coerced into UTF8. Pass a normal AnsiString, and the data arrives as the UTF8 version of that AnsiString converted from whatever the active codepage was or whatever that AnsiString's affinity was set to. Pass a UnicodeString or WideString to the function and it too will be converted to UTF8. Pretty cool, no?

With the title to this post... All those physicists out there are going to hate me and Google.. hehe ;-)


*"Assignment compatible" means that you can assign or pass as a parameter from one "typed type" to another "typed type" or the intrinsic type on which it was based. They are not "var/out parameter" compatible. This means you can pass them by value to a function but not by reference.


  1. So, if a string actually contains bytes instead of text characters one should use RawByteString rather than AnsiString, is that correct?

  2. This looks pretty nice, indeed. Well done.

  3. Giel,
    That would be one case, yes. However, you still need to be careful about passing it around. If any conversions happen along the code path, you run the risk of corruption. As long as you're in control of the whole call chain, it should be perfectly safe.

  4. Sounds great - but what's the C++ Builder syntax like for it?

  5. It's an interesting theory. But I hardly think that ANSI strings in different code pages will be used in an application _simultaneously_. I assume that only AnsiString(0), AnsiString(65001) and AnsiString(65535) will be used widely across an application.

    I have a question. The code:

    AnsiStr1, AnsiStr2: AnsiString;
    AnsiStr1 := AnsiStr2;

    Whether the compiler will insert the check for equality of code pages in this case?

  6. Roddy, the C++ syntax has not been worked out yet, as far as I know.

  7. Kryvich,

    > Whether the compiler will insert the check for equality of
    > code pages in this case?

    That has and always will end up calling _LStrAsg() in the RTL, which is where this check is performed. So, yes, in that case there is a check to ensure the source and destination code pages are the same and if not, a conversion is done.


  8. Kryvich,

    My understanding is that this would be done at compile time, so since the two types are already in the same codepage, the system would not add any new code to your program. The beauty of this change is that if you need it, its available. if you don't, its carry on as normal.

    I do commend Allen and the rest of the gang for a great solution here. This "theory" has such a good natural feel to it.

  9. This sounds really cool. But like Steven, I figured this was something figured out by the compiler, not something handled at run time. After all, doesn't the compiler already know exactly which code page is associated with any given string variable?

    I expected the compiler would turn

    s1 := s2;

    into any of a few different RTL calls based on its knowledge of the code pages associated with those variables. If the code pages are the same, then it's simply a call to the same old _LStrAsg:

    _LStrAsg(s1, s2);

    If they're different, then it would be a call to some other function, like this:

    _LStrAsgCP(s1, s2, CodePageOf(s1), CodePageOf(s2));

    where CodePageOf would be a compile-time function just like SizeOf.

    So, since that's apparently not how it works, can you talk about some of the reasons behind the implementation?

    I can think of a reason: when the code page is unknown. Suppose you read some data into a RawByteString, and later you learn what the code page should be. If the code page is strictly a compile-time attribute, then you'd need to have 65536 string variables available so you could assign the RawByteString into the variable with the right code page. Ugh. So you'd need to have some way of setting the code page after the fact. That sounds easy enough. But I have one question: What's the type of the variable that you use to hold the string, now that you know its code page? Certainly not any of the types you've mentioned so far, since those all have code pages associated with them in advance.

    This new feature carries a change in the internal representation of a string, doesn't it? It's no longer just the reference count, length, character data, and null byte, is it? I guess the System.StrRec type is updated with the new structure.

  10. Seems like good theory, but please make some compiler directive to disable implicit ANSI string conversions for compatibility sake.

  11. And one question, to make the theory more clear. If I declare

    RawBytes: RawByteString;

    procedure WriteUTF8Data(const Data: UTF8String);

    and pass RawBytes to WriteUTF8Data, the Data will not be converted to UTF8?

  12. Allen, thank you for the answer!

    So, it's not a compile-time check, but a runtime-check for codepages. Every time when the string assignment take place a program will check the codepages of strings.

  13. And finally I would like to say - the theory must be verified on LHC asap :)

  14. "So, yes, in that case there is a check to ensure the source and destination code pages are the same and if not, a conversion is done."

    Out of curiosity - what is converted, and to what? For example, Shift-JIS also contains Russian characters, but not the other way around - so if you compared a Shift-JIS string to a KOI8R string, that could generate different results depending on the implementation of the conversion.

    (I'm guessing both are converted to UTF-*, but you never know...)

  15. This means that any string which we'll post in a DB using descendants of TStringField through .AsString will be converted to Unicode?

    I'd rather expect to have TDataSet.Encoding: TEncoding which in case of 'nil' will point to app. codepage and TField.Encoding: TEncoding which in case of 'nil' will point to TDataSet.Encoding. This of course for string/memo/blob fields... Also an action in TDataSet's popup menu which will 'Set the [x] encoding for the fields...' will be wellcomed.

    Another question: what about a 'Tiburon - Char theory' and 'Tiburon - PChar/PAnsiString theory'?

  16. Michael, Allen said that all conversions go through UTF-16. So if you're assigning a Shift JIS string to a Russian string, it would go like this hypothetical code for pre-Tiburón versions:

    ws: WideString;
    ja: AnsiString {(932)};
    ru: AnsiString {(1251)};

    ja := GetJapaneseString();
    ws := AnsiToWide(ja, 932);
    ru := WideToAnsi(ws, 1251);

    In particular, what _doesn't_ happen is a direct transcoding from one Ansi code page to another:

    ru := AnsiToAnsi(ja, 932, 1251); // No.

    Two reasons for avoiding that: First, the OS doesn't provide anything for that; it only supports MultibyteToWideChar and WideCharToMultibyte. Second, it would be a nightmare to write converters for every possible combination of code pages. Much easier to just provide conversions between each code page and UTF-16.

  17. Rob, now that I read your answer, I realize I posted the wrong question :)

    What I was *actually* wondering is what happens if you *compare* the two.

    ja: AnsiString(932);
    ru: AnsiString(1251);

    ja := 'あいうえおАБВГД';
    ru := '?????АБВГД;
    if ru = ja then

    Unless both ja and ru are converted in memory to UTF-16 for the purpose of comparison (they *should* be), certain situations might get you incorrect results depending on the exact conversion logic - if, for instance, the second operand is converted to the same codepage as the first operand, then the above code would consider the strings as being identical, since the unknown characters would end up as question marks (however, it would NOT be considered the same the other way around).

  18. Hi Allen!

    I agree with the basic principle, because it makes code page conversion a simple type conversion, which means all the existing rules for type conversion, overloaded procedure selection can apply. However, I do not agree with some details.

    Conversion from Unicode to codepages is lossy, and it's not clear what should happen with unrepresentable characters. You may want to throw exceptions, replace unrepresentable characters with wildcard characters, or even do transliterations.

    Note you can assign integers to reals, because this is a straightforward conversion, mostly lossless, but not assign reals to integers, because it is lossy and not clear what should happen (round or trunc). Explicit conversion functions exist for this. Besides making the programmer specify the desired behaviour, these procedure also serve as a warning that something gets lost. You don't want to loose a spacecraft because the programmer didn't know some procedure only called in special circumstances accepts integer and the real got automatically truncated so the wrong rockets are fired.

    Automatic conversion from Unicode to codepage will cause the same kind of failures. Further, you will be forced to make either the conversion configurable, or add conversion functions (for people not wanting wildcards but exceptions). Explicit conversion from unicode to codepage will be both safer and better understadable for people in the end.

    The second thing I don't agree with is the use of integers to describe a codepage. The use of constants will ansistring(CP_UTF8) make appear as ansistring(65001) in the debugger, error messages, basically all type display except the original source code, making it harder to understand what code page type a string is while debugging. +, -, *, div make no sense for code page numbers. A code page indentifier should be an enumeration type.

  19. Daniël,

    Yes, conversion from Unicode to codepages can be lossy. The compiler will issue a warning about any implicit convertions that are potentially lossy. It will also give a warning with less dire consequences when going the other way around (codepage -> Unicode = lossless). The intent is that these conversions would be "worked out" of the application to only exist at the boundaries for legacy I/O purposes.

    Internally, the RTL uses the Windows API and will use the code page value directly, which is simply defined as an integer value. If we used an enum, it is not extendable and burns in a fixed number of valid codepages. The customers would have to get an update (a breaking update at that) in order to use any new codepages or encodings. Also, using an enum would require a lookup table to translate a given enum element to a codepage value before each call to Windows.


  20. Michael Madsen:

    If you do this:
    ja := ‘あいうえおАБВГД’;
    actually you will get
    ja := ‘あいうえお?????’;
    because ANSI 932 codepage doesn't contain Russian letters 'АБВГД'.

    Rob Kennedy:

    It is senseless to assign a Shift JIS string to a Russian string because ANSI 1251 codepage doesn't contain Japanese letters. You will get only '??????????'.

  21. Kryvich:
    CP932 (and Shift-JIS) can express Cyrillic (and Greek, for that matter) letters just fine - it's often used in so-called "SJIS art".

    Microsoft's reference page for cp932, lead byte 0x84:

    As for whether or not it's senseless - in most cases, yes, but that doesn't mean someone out there won't need it for some special case. Knowing how it will then be handled is good to know.

  22. Michael Madsen -
    CP932 can express Cyrillic - I didn't know it, thanks for the tip.

    Anyway, it's an one-way conversion, because ANSI 1251 Cyrillic codepage not contains Japanese and Greek characters.

  23. Hi,

    What about ShortString types (like "string[70]")- how will these get converted?


    sUnicode : String;
    sShort : String[70];

    sShort:='abc'; // which code page context is being used here?

  24. Any word on how the conversion between a ShortString & UnicodeString is handled?

  25. HS,

    For ShortString conversions, they will remain tied to the active codepage. If you want to encode them differently, you'll need to manually handle that task.


  26. Allen,

    Makes sense - thanks!

  27. This may be a dumb question, but what is the default definition of just plain "String"?

  28. Two questions:

    1. I take it the internal memory layout of string variables will have to change? Currently there's a length and a refcount stored at negative indexes from the pointer -- is there now a code page as well?

    2. Will TReader/TWriter have methods for reading and writing RawByteStrings without converting them to UTF-8?

  29. [...] there are places that point to links, like for example a comment from this external blog. We broke any/all of those people. And with the non-human readable links that the new site was [...]

  30. [...] there are places that point to links, like for example a comment from this external blog. We broke any/all of those people. And with the non-human readable links that the new site was [...]


Please keep your comments related to the post on which you are commenting. No spam, personal attacks, or general nastiness. I will be watching and will delete comments I find irrelevant, offensive and unnecessary.