Thursday, January 10, 2008

And now the $100,000 question.

Will there be a switch to control when string = UnicodeString?

The current assumption about that is, no.  Let me explain why.

DCU compatibility and linking problems - Suppose you built unit A with the switch to Unicode mode.  Then you built unit B with it off.  You can not link Unit A with B because of the type mismatches.  Sure, some things might work, but lots of things won't.  Things like assigning event handlers, var/out parameters and other would seem to fail with type mismatches.  It gets even more confusing and frustrating for the user when they look at the source code and see string declared everywhere, yet things don't work.  Even if we delivered two complete sets of DCUs, one for AnsiString and one for UnicodeString, there is still the problem with packages, design-time components, and third-parties.

The IDE would be natively built in Unicode mode - This requires that all design-time components be built in Unicode mode.  You could not install a component package or IDE enhancement built in Ansi mode.  The upshot of this is that the design-time experience could be very different than runtime.

Third-parties would have to deliver Ansi and Unicode versions of their components - I don't want to impose this kind of effort on our valuable and critical third-party market.  From a source code standpoint, there is a lot (a majority) of code that can be compiled either way, but that doesn't help with testing, packaging, and install size.  Delivering two versions of the same library for the same version of Delphi just doubled the testing effort.

Piecemeal moves to Unicode seem "safe" and are easier to "sell" to management, but there are real land mines - When you look at Unicode as merely a "way to display Kanji or Cyrillic characters on the screen" it is easy to conceptually think Unicode is only about display rendering.  Unicode is really about a storage and data format for human-readable text.  Display rendering is only one part of it.  If you strictly relegate Unicode only to the visual portions of your application, you're ignoring the real "meat" of your application.  A holistic approach is needed.  If one were to take the seemingly "safe" route and only do a portion of the application at a time, you run the risk of hiding an untold number of "pinch-points" throughout your application where implicit or even explicit conversions are taking place.  If a Unicode code-point does not map to the currently active Ansi codepage, that character(code-point) is dropped during the conversion.   When it comes back out of the system and needs to be rendered, data is lost.  The trick is finding all those "pinch-points" and figuring out what to do with them.

Character indexing and normal string manipulation remain unchanged - The UnicodeString is still a 1-based index, reference counted, lifetime managed entity.  It is no different from AnsiString in this regard. The difference is that it has a payload of UTF16 (word-sized) elements instead of byte sized elements.  String assignments, indexing, implicit conversions, etc all continue to work as expected.  Length(UnicodeStringVar) returns the number of elements the same as Length(AnsiStringVar).

Code that must use AnsiStrings should be explicit - If your code absolutely must use AnsiStrings, you can explicitly change the declarations to AnsiString.  You can do this right now with your existing code.

string is already a Unicode string - In the Delphi for .NET compiler, string has been equivalent to System.String which is a UTF16 element based string.  Many of our customers have already had to deal with this fact, and have survived the transition very well.

An example.

As we've been working on making sure things compile with the new Unicode compiler, it has been surprising even for us as to how much code we have that simply just works.  SysUtils.pas contains a lot of filename manipulation functions that do a lot of string manipulation internally.  Take ExtractFileExt(), for example;

function ExtractFileExt(const FileName: string): string;
var
  I: Integer;
begin
  I := LastDelimiter('.' + PathDelim + DriveDelim, FileName);
  if (I > 0) and (FileName[I] = '.') then
    Result := Copy(FileName, I, MaxInt) else
    Result := '';
end;

This function simply recompiled and worked as is.  Granted, it is only a few lines of code, but what I'm not showing here is code path for the LastDelimiter function.  This function cases the Filename parameter to a PChar, then calls StrScan.  Since all the functions that take PChar parameters do not do implicit conversions, we've provided overloaded versions of these functions.  So even if you do a lot of "PChar" buffer manipulation, we've got those functions covered.

Beefed up warnings and hints.

Another thing we're doing to try and help folks easily identify sections of their code where they may need to inspect it, is the addition of more warnings.  When the compiler sees certain code constructs such as implicit string conversions, strange pointer casting, etc extra diagnostic information will be output.  Another compiler feature we've added is the ability to elevate any one or all warnings to be an error.  We've actually been going through our own code (and I'm a little embarrassed to say we haven't been particularly "warning free") eliminating all warnings from the code and then elevating the warnings to a error.  Now our own build processes will literally fail when someone checks in code that generates a warning.

Illusions, Delusions, Fantasy and Reality.

I hold no delusions that this change will be bump-free and every lick of code out there will work without a hitch.  There will be a class of applications and libraries that will be affected far more than others.  Our goal is to ensure that the vast majority of our users out there will see as little disruption as possible.  Also, for those cases where disruption is bound to happen, we're working on providing tooling and education to assist in this transition.  The cold-hard reality is that this change is arguably late in coming. This has been a perennial request for at least the last 5 - 6 years (probably longer).  Getting on track, focusing our efforts, and addressing a real need for a large segment of our customers is sometimes a little painful.

Maintaining a strong bias for backward compatibility has come at a price.  There are segments of customers clamoring for major sweeping changes to things like VCL (add skinning, a new data binding model, XML streaming, etc.).  We, too, fantasize regularly about things like "what if we could ignore the past and just pick up the pieces and go for it?"  The cold-hard reality is that we've built over the last 13 years high expectations about what customers have come to rely on from release to release.  I know that.

For many of you, your first exposure to Delphi was maybe version 3 or later.  Many of you never experienced the largest transition in Delphi's history, the move from 16bit Windows to 32bit Windows in the Delphi 1 to 2 cycle.  Lots of changes happened there.  The Integer data type grew to a 32 bit entity.  string became a managed, reference counted, heap-allocated, entity that also managed to maintain a lot of "value semantics" encapsulated in an underlying reference type.  The change was embraced because finally a string could hold huge amounts of data.  You could "cast" the string to a PChar and call a Windows API function since it was also null-terminated.

Today, I realize that the landscape is different.  There are many, many years of history.  The Internet has permeated our very existence.  The world has shrunk on account of this new level of connectivity.  Countless millions of lines of code have been written.  Given all of that, the need to communicate using a common, unified and standard encoding is paramount.

As Delphi moves into emerging markets, especially in the far east, if we are to continue to find acceptance and carve out our place, strong clear Unicode support is paramount.  In many of those markets, governments are beginning to legislate and enforce how applications interact with character data.  While this doesn't necessarily apply to the private sector, that sector does take a cue from those requirements and cannot afford to one day be shut out of certain jobs and markets.  They too see the value and reasoning behind these rule and elect to follow suit.

Finally, I do not intend to fully shut the door on this issue as I know it will have (is having) a polarizing effect.  I do, however, want to make sure people get as informed as possible.  Agree or disagree, that's fine.  One thing I learned early on in my career here at CodeGear (and Borland) was to truly think about a problem.  Don't just pop-off with the first reason you can find about why something won't work.  Also, continue to challenge your own conclusions and position.  Don't be afraid to be wrong (and don't assume that is advice only for the "other guy" either).  Get the facts.  I'll help by presenting as many facts as I'm able.  Let the games begin :-).