Wednesday, July 16, 2008

Tiburón - String Theory

No, not that String Theory, or even this one. What this is about is an interesting extension to AnsiString. During the field test cycle of Tiburón and our own internal porting of the IDE code (which was accomplished in about 1.5 months, by 2-3 folks, with > 2 million LOC), it became clear that there was a need for easily encoding UTF16 character data as UTF8. For the astute among you, you probably already know about the RTL defined UTF8String which is really just an alias to AnsiString. The problem is that it is UTF8 in name only. Unless you explicitly ensured that only UTF8 data was placed into the payload, it could just as easily hold normal Ansi character data. We needed to make the use of UTF8String easier. As we looked at how AnsiString worked, it was clear that AnsiString always had this "affinity" to carry it's data payload encoded as whatever the RTL had determined was the active code page, at runtime. So we wondered, "what if we could create an AnsiString type where the programmer determined at compile time, the code page affinity for AnsiString?"

It turns out that, to Windows, UTF8 encoding is just another code page. Down in the RTL, the conversions to/from UnicodeString or WideString use the Windows API functions WideCharToMultiByte() and MultiByteToWideChar(). One of the parameters is the code page identifier to or from which the data is converted. If you pass CP_UTF8 (65001) to those functions, they'll convert between UTF16 and UTF8. This is a lossless conversion. In Tiburón, we're introducing an enhancement to declaring your own "typed" AnsiString. You have always been able to create a unique type based on any intrinsic type by declaring the new type with the "typed type" syntax:

MyString = type AnsiString;

This would create a new type that is assignment compatible* with AnsiString, but with a unique type name and a unique RTTI structure. We've used this in VCL to distinguish normal "strings" from special strings such as TFileName or TCaption. By creating these unique string types, it was possible to create property editors that would be associated with a specific type of property, regardless of which component it used on or what the property name was. This is how only the Caption property on many components will automatically update the live design-time component as you type in the Caption value.

The thing is, with the above declaration, MyString will continue to always have an affinity for whatever the current runtime code page was. So, we introduced the following syntax for AnsiString only:

MyString = type AnsiString(<1..65534>);

You can now control the code page affinity of any "typed type" AnsiString at compile time. The "parameter" to AnsiString must be a word constant expression. The values 0 and 65535 have special meanings. 0 is a normal AnsiString, and 65535 ($FFFF) means "no affinity." $FFFF is worth noting here as already being declared as a RawByteString. When assigning between AnsiStrings or passing them as parameters, if the code page affinity of the source and destination strings are different, an automatic conversion is done. In order to minimize potential dataloss during the conversion, all conversions go "through" UTF16. However, a string with an affinity of $FFFF tells both the compiler and the RTL, that none of these conversions should be done and to just move over the payload. In practice, however, there would be only a few instances of needing to use RawByteString, but it is there for your use.

So we now have the following declarations in the System unit:

UTF8String = type AnsiString(CP_UTF8);
RawByteString = type AnsiString($FFFF);

Like I stated previously, any assignment (or passing as a parameter to procedure or function) where the code page affinity between the source and destination are different, the payload will automatically be converted. Say you have a function that must only take UTF8 data. You can declare it like this:

procedure WriteUTF8Data(const Data: UTF8String);
// write UTF8 data to stream, file, socket, etc...

Now, no matter what type of string you pass to this procedure, you know that the payload will have been coerced into UTF8. Pass a normal AnsiString, and the data arrives as the UTF8 version of that AnsiString converted from whatever the active codepage was or whatever that AnsiString's affinity was set to. Pass a UnicodeString or WideString to the function and it too will be converted to UTF8. Pretty cool, no?

With the title to this post... All those physicists out there are going to hate me and Google.. hehe ;-)


*"Assignment compatible" means that you can assign or pass as a parameter from one "typed type" to another "typed type" or the intrinsic type on which it was based. They are not "var/out parameter" compatible. This means you can pass them by value to a function but not by reference.

Tuesday, July 1, 2008

Brand New Day...

Well, my access card worked.  I guess I still have a job :-).

I just finished listening to a company (Embarcadero not Borland) wide conference call announcing to the whole company the closure of the Embarcadero+CodeGear deal. Wayne Williams, our new boss, made some very encouraging statements. Most notably was that just like ER/Studio, RapidSQL, and DBArtisan, Delphi/C++Builder are core Embarcadero product offerings. This means that these products are key to the business. Yes, there are many other products being sold, incubated and introduced, but it is the aforementioned products that form the pillars on which the company is based. Without them, many of the other products would not be possible. Like the foundation of a home, you just don't take a metaphoric jackhammer to them and expect the structure (the company) to remain sound.

Another encouraging (or maybe scary, depending upon your perspective) point was that we are the last independent tools vendor with the breadth of offerings we have out there. This means no vendor or stack lock-in. We have tools for nearly every database and OS platform out there. We are also one the of the very few software companies that offer very strong non-Open Source tools right along side tools built either on or for Open Source stacks. JBuilder and 3rdRail are built on top of the very popular Eclipse framework. Delphi for PHP and 3rdRail are built for the very popular and widely used Open Source PHP and Ruby/Rails environments, respectively.

How things will change or even stay the same is still being planned and scoped. A lot of work had been done between the announcement of this deal and its close, but now is where most of the work can actually take place. Now that we're no longer joined to Borland, we can now chart a new course under a new captain. I wish all the best for Borland as I've seen many happy and exciting days while there.

An interesting anomaly is that many of the CodeGear folks have had their service bridged, which means that some of us have, on paper, now worked for Embarcadero longer then they've existed :-). This is now day 6022 for me.