Wednesday, July 16, 2008

Tiburón - String Theory

No, not that String Theory, or even this one. What this is about is an interesting extension to AnsiString. During the field test cycle of Tiburón and our own internal porting of the IDE code (which was accomplished in about 1.5 months, by 2-3 folks, with > 2 million LOC), it became clear that there was a need for easily encoding UTF16 character data as UTF8. For the astute among you, you probably already know about the RTL defined UTF8String which is really just an alias to AnsiString. The problem is that it is UTF8 in name only. Unless you explicitly ensured that only UTF8 data was placed into the payload, it could just as easily hold normal Ansi character data. We needed to make the use of UTF8String easier. As we looked at how AnsiString worked, it was clear that AnsiString always had this "affinity" to carry it's data payload encoded as whatever the RTL had determined was the active code page, at runtime. So we wondered, "what if we could create an AnsiString type where the programmer determined at compile time, the code page affinity for AnsiString?"

It turns out that, to Windows, UTF8 encoding is just another code page. Down in the RTL, the conversions to/from UnicodeString or WideString use the Windows API functions WideCharToMultiByte() and MultiByteToWideChar(). One of the parameters is the code page identifier to or from which the data is converted. If you pass CP_UTF8 (65001) to those functions, they'll convert between UTF16 and UTF8. This is a lossless conversion. In Tiburón, we're introducing an enhancement to declaring your own "typed" AnsiString. You have always been able to create a unique type based on any intrinsic type by declaring the new type with the "typed type" syntax:

MyString = type AnsiString;

This would create a new type that is assignment compatible* with AnsiString, but with a unique type name and a unique RTTI structure. We've used this in VCL to distinguish normal "strings" from special strings such as TFileName or TCaption. By creating these unique string types, it was possible to create property editors that would be associated with a specific type of property, regardless of which component it used on or what the property name was. This is how only the Caption property on many components will automatically update the live design-time component as you type in the Caption value.

The thing is, with the above declaration, MyString will continue to always have an affinity for whatever the current runtime code page was. So, we introduced the following syntax for AnsiString only:

MyString = type AnsiString(<1..65534>);

You can now control the code page affinity of any "typed type" AnsiString at compile time. The "parameter" to AnsiString must be a word constant expression. The values 0 and 65535 have special meanings. 0 is a normal AnsiString, and 65535 ($FFFF) means "no affinity." $FFFF is worth noting here as already being declared as a RawByteString. When assigning between AnsiStrings or passing them as parameters, if the code page affinity of the source and destination strings are different, an automatic conversion is done. In order to minimize potential dataloss during the conversion, all conversions go "through" UTF16. However, a string with an affinity of $FFFF tells both the compiler and the RTL, that none of these conversions should be done and to just move over the payload. In practice, however, there would be only a few instances of needing to use RawByteString, but it is there for your use.

So we now have the following declarations in the System unit:

UTF8String = type AnsiString(CP_UTF8);
RawByteString = type AnsiString($FFFF);

Like I stated previously, any assignment (or passing as a parameter to procedure or function) where the code page affinity between the source and destination are different, the payload will automatically be converted. Say you have a function that must only take UTF8 data. You can declare it like this:

procedure WriteUTF8Data(const Data: UTF8String);
// write UTF8 data to stream, file, socket, etc...

Now, no matter what type of string you pass to this procedure, you know that the payload will have been coerced into UTF8. Pass a normal AnsiString, and the data arrives as the UTF8 version of that AnsiString converted from whatever the active codepage was or whatever that AnsiString's affinity was set to. Pass a UnicodeString or WideString to the function and it too will be converted to UTF8. Pretty cool, no?

With the title to this post... All those physicists out there are going to hate me and Google.. hehe ;-)


*"Assignment compatible" means that you can assign or pass as a parameter from one "typed type" to another "typed type" or the intrinsic type on which it was based. They are not "var/out parameter" compatible. This means you can pass them by value to a function but not by reference.