Wednesday, January 9, 2008

DPL & Unicode - a toss up.

So far it's looking like a toss-up between folks wanting more information on the Delphi Parallel Library and those wanting more information about the shift to Unicode.  I think both are extremely important and it is no surprise given the feedback.  Since it is still not clear whether or not DPL will make it into the next release, I may opt to begin talking more about Unicode... then again, maybe not :-).

Right now, I'm wrestling with some compiler issues related to debugging when a generic type is instantiated... needless to say it's making the work on DPL a little tough.  This is par for the course when you're trying to retrofit the airplane while it is still in flight :-).  If it takes more than a few days before this is resolved, I'll probably jump back over to Unicode.  That area is working and the team is full speed ahead on it.

Just to clear some things up, I'm going to answer a few of the common questions folks have about the move to Unicode.

Is there a new Unicode string type or are you just using WideString?

Yes, there is a new data type.  UnicodeString.  It will be reference-counted just like AnsiString and unlike WideString which is a BSTR.  This new data type will be mapped to string which is currently an AnsiString underlying type depending on how it is declared.  Any string declared using the [] operator (string[64]) will still be a ShortString which is a length prefixed AnsiChar array.  The UnicodeString payload will match that of the OS, which is UTF16.  This means you can, at times, have surrogate pairs for characters.  For characters that fall outside the Basic Multilingual Plane (BMP).

Will I be able to still use the AnsiString type?

Yes.  No existing types are being taken away.

What about Char and PChar?

Char will be an alias for WideChar and PChar will be an alias for PWideChar.

Will I have to explicitly call the "W" versions of the Windows API?

For all the Windows API header translations that CodeGear provides, your code should not have to change to call the "W" version.  Since it has always been our intent to make this change at some point in the future, we have been specially processing the header translations (since Delphi 2 if you must know ;-) to ease this transition.  If you want more details on how we do this you can visit the JEDI website for guidelines on how to use these tools.  We'll be providing some updates for these tools in order to properly process a header to use the "W" versions.

Why didn't you just use UTF8?  It's more compact than UTF16.

This was considered.  However, this would have forced far more conversions throughout the VCL code as it talks to the Windows API, and it would have introduced a lot of very subtle breakages in much of user code.  While a lot of code out there already handles DBCS (Double-byte character sets), that same code does not correctly handle characters that consist of > 2 bytes.  In UTF8 a single character can be represented by as many as 6 bytes. [Correction: This is not the case in true UTF8.  5 and 6 byte sequences are illegal in UTF8 (thanks Aleksander)]  In UTF8 a single character can be represented by as many as 4 bytes.  Finally, UTF16 is the native format used internally by Windows itself.  By calling directly to the "W" APIs, the "A" translation layer that Windows has is bypassed and should, in theory, increase performance in some cases.

OMG!!  All my code is going to break!  I can't handle this!!

Now hold on there.  Before you get your knickers in a knot,  please take a moment to fully understand the impact of this change and how to best prepare for it today.  As we're in this process of working in Tiburon, we've been capturing a lot of the common pitfalls and idioms many of you are likely to encounter.  We'll also be working on ways to get this information out to our customers.  Blogs, Whitepapers, and other articles will be the vehicles by which we'll provide this information.  We do understand that there are some types of applications that will be affected more than others.  Many of you have written your own handy-dandy library of string processing functions and classes.  The top categories of things you'll need to watch out for are:

  • Assumptions about the size of Char.
  • Use of string as a storage medium for data other than character data.
  • SizeOf(Buffer) <> Length(Buffer) where Buffer: array[0..x] of Char;
  • File I/O (console I/O will still be down converted to Ansi data since you it can be redirected)
  • set of Char; should be changed to set of AnsiChar;
    • You should also consider starting to use the new character classification functions in Tiburon.
  • If your code must still operate on Ansi string data, then simply be more explicit about it.  Change the declaration to AnsiString, AnsiChar, and PAnsiChar.  This can be done today and will recompile unchanged in Tiburon.

What about the Windows 9x OS?

Not going to happen.  If you absolutely must continue to support those operating systems, RAD Studio 2007 is a great choice.  I realize this may not be a popular decision for some markets, but it is getting harder and harder to support an operating system that is barely even tacitly supported by MS themselves.  We've even looked into MSLU (Microsoft Layer for Unicode) and that is not a very viable option since in order to get it to work with Delphi we'd have to duplicate a lot of the code that is in the COFF based .LIB file that is provided only for VC++.  Yes there is the unicows.dll, but that is not where the "magic" happens.  So, Windows 2000 and newer will be the supported target platforms.

In the coming months, I'll try and show some common code constructs that will need to be modified along with a lot of common code that will just work either way.  It is has been pleasantly surprising how much code works as the latter, and how easy it has been to get the former to behave like the latter.