Wednesday, January 9, 2008

DPL & Unicode - a toss up.

So far it's looking like a toss-up between folks wanting more information on the Delphi Parallel Library and those wanting more information about the shift to Unicode.  I think both are extremely important and it is no surprise given the feedback.  Since it is still not clear whether or not DPL will make it into the next release, I may opt to begin talking more about Unicode... then again, maybe not :-).

Right now, I'm wrestling with some compiler issues related to debugging when a generic type is instantiated... needless to say it's making the work on DPL a little tough.  This is par for the course when you're trying to retrofit the airplane while it is still in flight :-).  If it takes more than a few days before this is resolved, I'll probably jump back over to Unicode.  That area is working and the team is full speed ahead on it.

Just to clear some things up, I'm going to answer a few of the common questions folks have about the move to Unicode.

Is there a new Unicode string type or are you just using WideString?

Yes, there is a new data type.  UnicodeString.  It will be reference-counted just like AnsiString and unlike WideString which is a BSTR.  This new data type will be mapped to string which is currently an AnsiString underlying type depending on how it is declared.  Any string declared using the [] operator (string[64]) will still be a ShortString which is a length prefixed AnsiChar array.  The UnicodeString payload will match that of the OS, which is UTF16.  This means you can, at times, have surrogate pairs for characters.  For characters that fall outside the Basic Multilingual Plane (BMP).

Will I be able to still use the AnsiString type?

Yes.  No existing types are being taken away.

What about Char and PChar?

Char will be an alias for WideChar and PChar will be an alias for PWideChar.

Will I have to explicitly call the "W" versions of the Windows API?

For all the Windows API header translations that CodeGear provides, your code should not have to change to call the "W" version.  Since it has always been our intent to make this change at some point in the future, we have been specially processing the header translations (since Delphi 2 if you must know ;-) to ease this transition.  If you want more details on how we do this you can visit the JEDI website for guidelines on how to use these tools.  We'll be providing some updates for these tools in order to properly process a header to use the "W" versions.

Why didn't you just use UTF8?  It's more compact than UTF16.

This was considered.  However, this would have forced far more conversions throughout the VCL code as it talks to the Windows API, and it would have introduced a lot of very subtle breakages in much of user code.  While a lot of code out there already handles DBCS (Double-byte character sets), that same code does not correctly handle characters that consist of > 2 bytes.  In UTF8 a single character can be represented by as many as 6 bytes. [Correction: This is not the case in true UTF8.  5 and 6 byte sequences are illegal in UTF8 (thanks Aleksander)]  In UTF8 a single character can be represented by as many as 4 bytes.  Finally, UTF16 is the native format used internally by Windows itself.  By calling directly to the "W" APIs, the "A" translation layer that Windows has is bypassed and should, in theory, increase performance in some cases.

OMG!!  All my code is going to break!  I can't handle this!!

Now hold on there.  Before you get your knickers in a knot,  please take a moment to fully understand the impact of this change and how to best prepare for it today.  As we're in this process of working in Tiburon, we've been capturing a lot of the common pitfalls and idioms many of you are likely to encounter.  We'll also be working on ways to get this information out to our customers.  Blogs, Whitepapers, and other articles will be the vehicles by which we'll provide this information.  We do understand that there are some types of applications that will be affected more than others.  Many of you have written your own handy-dandy library of string processing functions and classes.  The top categories of things you'll need to watch out for are:

  • Assumptions about the size of Char.
  • Use of string as a storage medium for data other than character data.
  • SizeOf(Buffer) <> Length(Buffer) where Buffer: array[0..x] of Char;
  • File I/O (console I/O will still be down converted to Ansi data since you it can be redirected)
  • set of Char; should be changed to set of AnsiChar;
    • You should also consider starting to use the new character classification functions in Tiburon.
  • If your code must still operate on Ansi string data, then simply be more explicit about it.  Change the declaration to AnsiString, AnsiChar, and PAnsiChar.  This can be done today and will recompile unchanged in Tiburon.

What about the Windows 9x OS?

Not going to happen.  If you absolutely must continue to support those operating systems, RAD Studio 2007 is a great choice.  I realize this may not be a popular decision for some markets, but it is getting harder and harder to support an operating system that is barely even tacitly supported by MS themselves.  We've even looked into MSLU (Microsoft Layer for Unicode) and that is not a very viable option since in order to get it to work with Delphi we'd have to duplicate a lot of the code that is in the COFF based .LIB file that is provided only for VC++.  Yes there is the unicows.dll, but that is not where the "magic" happens.  So, Windows 2000 and newer will be the supported target platforms.

In the coming months, I'll try and show some common code constructs that will need to be modified along with a lot of common code that will just work either way.  It is has been pleasantly surprising how much code works as the latter, and how easy it has been to get the former to behave like the latter.

42 comments:

  1. This sounds really interresting but I have big problems with

    "This new data type will be mapped to string which is currently an AnsiString underlying type depending on how it is declared."

    and

    "If your code must still operate on Ansi string data, then simply be more explicit about it. Change the declaration to AnsiString, AnsiChar, and PAnsiChar. This can be done today and will recompile unchanged in Tiburon."

    This is totally unacceptable for us (the company I'm working in). We have to support many applications with millions of source lines of code, some of which can still be compiled with BP7 (with a help of many IFDEFs, of cource). There is no way this code can be cleaned up in time for Tiburon. And I'm totally sure it will break if string becomes a UTF-16 datatype.

    What we need is a compiler switch that will default to Ansi mode for existing applications and for Unicode mode for new applications. That way we can still support old code while we can start working from scratch on Unicode-supporting applications.

    I'm pretty sure that we will not upgrade to Tiburon if string will be aliased to UnicodeString.

    ReplyDelete
  2. Hi,

    what I miss is a compiler-switch to change the mapping for the string-type.
    It would make sense to choose the old behaviour, where string is mapped to AnsiString. That allows smoother step-by-step migration of existing applications.

    Michael

    ReplyDelete
  3. You mention support for Windows 2000 and later. How about NT4?

    ReplyDelete
  4. Thanks for the insight. For the first time since D7 I am actually excited about a new Delphi release.

    To convert my applications to Unicode I don't mind a little code breaking and it sure does not sound too bad.

    One good thing about getting us Unicode so late is that you do not have to support W98. Two years ago I would have screamed about it, but most of our customers are now on XP.

    ReplyDelete
  5. For those such as gabr above who say they cannot use unicode strings yet, would a text search and replace from
    : string;
    to
    : AnsiString;
    to change declarations perhaps work?

    ReplyDelete
  6. Glad to see you are finally not letting backwards compatibility hold the future hostage. I am very excited about the new features coming in Tiburon. It would be even more exciting if the Win64 compiler was included as a preview :-)

    ReplyDelete
  7. [quote]In UTF8 a single character can be represented by as many as 6 bytes.[/quote]
    Not so according to http://www.utf-8.com, so I hope this is not really how Delphi's future UTF-8 algorthms are implemented.

    Maximum allowed byte span for a valid UTF-8 character is 4 bytes, with the following bit pattern:

    11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

    This pattern of 21 free bits covers codepoints in range $010000-$1FFFFF, and together with the 1-byte, 2-byte and 3-byte patterns gives a total of 2,097,152 possible encoded characters.

    Not all of these are valid, though - some are reserved UTF-16 surrogate pairs, as defined by the Unicode standard. Any decent UTF-8 parsing algorithm should account for those, too.

    ReplyDelete
  8. Great news ! However, as a lot of people, I've to support "old" programs where changing all String types to AnsiString would be painful... and also clearly a big waste of time. I seriously hope there will be a flag or a checkbox somewhere that will map automatically the String data type to the AnsiChar one. Seriously. I really want to design new Unicode apps, but backward compatibility is also very important for legacy apps. If nothing is done about this issue, I'll stick to D2007 for a looong time ;-)

    ReplyDelete
  9. I know you're not a C++ man, but what's going to happen to the automated HPP gerneration. Currently, Delphi "string" comes out as AnsiString in the HPPs, which is not terribly helpful. I assume that it will need to come out as String, and then String will be typedef'd to AnsiString or UnicodeString accordingly...?

    ReplyDelete
  10. Woohooo! We've been waiting for proper unicode support in Delphi for ages, good work Codegear! :)

    ReplyDelete
  11. People, are you realistic, asking for such things as inconsistent switch to unicode due to some issues with 15 year old products??

    I agree, there are still applications out in the market which still need to be supported and which need to run on 98. BUT they are not the majority, and these developers will still be able to use D7 or D2007.

    I'm convinced that this switch to Unicode and the decision to change/improve the VCL comes much too late, and that Codegear lost much of its advantages due this delay. We shouldn't try to delay this switch even more if we still hope that Delphi stays a semi-major player instead of degrading it to a niche product for some software relicts.

    ReplyDelete
  12. DanB --

    It is not true that DevExpress has dropped C++Builder support. That's totally false.

    Nick

    ReplyDelete
  13. This will be exciting. I am using WideString a lot currently and I will be looking forward to UnicodeString. I'm sure you are making the right decisions in terms of breaking only so much existing code as necessary.

    ReplyDelete
  14. I think some companies cannot simply just use a replace function to rename String to AnsiString since sometimes there is IFDEF inside the code to support legacy system.

    Option to turn on/off the string mapping to unicode will be nice if codegear have time to implement it in the new delphi.

    of course this is no a problem if codegear push this problem to that company to write a parser to replace this.

    ReplyDelete
  15. - C Johnson,

    "what is the correct datatype for an 8 bit ascii CHAR, if its not CHAR??"

    the 8 bit char is AnsiChar and the 8 bit pchar is PAnsiChar.

    ReplyDelete
  16. We definitely need that compiler-switch to change the mapping for the string-type, leaving old applications with no need of Unicode alone.

    ReplyDelete
  17. Nick

    I did not say they dropped all C++ builder support, but what I did say is true: Their latest VCL product, ExpressSkins, does not support C++Builder.

    Here is what DevExpress CTO Julian M Bucknall has to say about it on thier forums:

    "We decided, at a late stage admittedly, not to support C++Builder with ExpressSkins in the first release"

    While he does not rule out adding support later on, he does say:

    "It does mean though that it is *unlikely* that we'll be adding support
    for C++Builder in our new VCL products. Not unless there's some drastic changes to the product and in the market."

    It sounds like the decision is based on a) the perception that the C++ Builder market is small and b) the compatability problems that delphi and c++ have in the current product... and c) a lack of effort on codegears part to help:

    "Another thought and then I'll go. I am the CTO of CodeGear's
    (arguably) largest third-party control partner. Have I received an
    email, a phone call, a visit from the new C++Builder Product Manager
    at CodeGear? That would be no. From Nick Hodges, his Delphi
    equivalent, sure. But from Alisdair Meredith? Complete silence.
    Reflect on that."

    ReplyDelete
  18. I'm sure there must be a compiler-switch to change the mapping for the string-type.

    ReplyDelete
  19. 1. I am sure that there is a compiler-switch to change the mapping for the string-type, isn
    't there?

    2. Is it possible to partial declare string as AnsiString for some component libraries and partial declare string as UnicodeString for the rest code of a project. Because those libraries without source might not be compatible with UnicodeString.

    3. The name UnicodeString is unprofessional. But it seems to have no alternative choice.?

    ReplyDelete
  20. A quick note to 22. comment:
    It looks really strange, when the string type is called Unicodestring and the Char type is called WideChar.

    ReplyDelete
  21. I will have great problems with a product that does not have a compiler-switch for char / string default widths. When I look at some projects here ( 10M lines of code) it will be a hell of a task to make the code function correctly.

    Furthermore: the code that currently uses widestrings needs to altered to accommodate for the UnicodeString / Widestring name difference.

    Make a compiler switch! If not, we will be very uninteresting to upgrade D2007 due to incompatibilities which is I think the action you want from your customers.

    ReplyDelete
  22. Dear Allen,

    I'll have to agree with the guys who need a compiler switch for legacy source code compatibility. We'll need to turn Unicode off sometimes.
    It'll be a lot easier for you guys to add an option, than for us to acquire permission from only God knows who, to change millions lines of code. and it won't go very well with Version Control Systems.

    ReplyDelete
  23. I very appreciate a movement of Delphi to Unicode, this is a long-awaited feature for me! I'm firmly convinced, you'll make it right!

    I just advise you to add a compiler warning for implicit conversions AnsiString UnicodeString, AnsiChar WideChar. It'll simplify elimination of accidental bugs during an ANSI to Unicode conversion.

    Also I'd like to have possibility to declare string constants in both encodings, i.e.
    const
    AnsiChar1 = AnsiChar('A'); // ANSI encoded character
    UnicodeChar1 = WideChar('A'); // UTF8 encoded character
    AnsiStr = AnsiString('ANSI encoded string');
    UnicodeStr = UnicodeString('UTF8 encoded string');

    - Something like this.

    ReplyDelete
  24. Thanks for shareing this with us! I'm eagerly awaiting Unicode support in Delphi too.

    I can understand the RTL and VCL need to make the move to UnicodeString. But please, do that with explicit types!
    This way, the meaning of string and (P)Char can still be kept at Ansi - best done via a compiler option like the old $LONGSTRINGS

    As long as seamless transformations between ShortString, AnsiString, UnicodeString, UTF8String, UTF16String and UTF32String can be made, all will be good.

    IMHO, Tiburion should offer a type for all three mayor Unicode encodings (UTF8, UTF16 and UTF32) - including encoding-specific implementations for things like: Length(), Copy(), Delete(), CharPos(), StringPos(), StringReplace(), Lower/UpperCase(), etc.

    Maybe bind these to the type itself, as is done in DotNet? For example :

    type
    UTF8String = record(BaseString)
    public
    class function Length: Integer; inline; override; overload;
    function Length: Integer; inline; override; overload;
    // etc
    end;

    class function UTF8String.Length: Integer;
    begin
    Result := 0;
    end;

    function UTF8String.Length: Integer;
    begin
    Result := ReferenceCountedStringHeader(Self)._Length;
    end;


    Just a thought...

    ReplyDelete
  25. Hi Allen, I am quite uncomfortable with the new name "UnicodeString".
    I have written my point of view on the Unicode stuff. Please take a quick look.
    http://stanleyxu2005.blogspot.com/2008/01/random-thoughts-on-unicode_10.html

    ReplyDelete
  26. A compiler switch or a project option allowing to choose the String mapping onto AnsiString or WideString is a must.

    ReplyDelete
  27. "This new data type will be mapped to string which is currently an AnsiString"

    This is simply unacceptable! There's no way you can check a multi-million line project for occurences of string, Char, PChar and array of Char. May it be deprecated, many of us do use string as a memory buffer sometimes, PChar for some hacking, etc. IMHO the best way to make the change would be the following: provide a compiler option that would look like this:

    Legacy string types:
    Map to:
    (*) Unicode
    ( ) Ansi

    Compiler notifications:
    ( ) None
    (x) Warning
    ( ) Error

    If you set "Compiler notifications" to other than "None", the compiler should generate warnings/errors on occurences of these legacy types (string, Char, PChar, array of Char, etc). This way, old applications could be ported gradually: instead of checking and rewriting everything, development would have time to go through to code and fix it (much better to let the compiler find weak points than simply searching for words in hundreds of files), but the projects would still compile and work as expected.

    ReplyDelete
  28. Of course, the above mentioned procedure would work with the Ansi & Warning options set, Unicode & Warning should be default.

    ReplyDelete
  29. Just another vote for a Win 9x compatible Delphi going forward. I don't need Unicode support right now but do need 9x support. If you can't build in a Unicode vs 9x compatibility switch, consider folding what new features/updates you can into D2007 and extending it's life as a legacy product for a couple of years (i.e. dual versions). Otherwise you may see a long upgrade drought for many of your users like you did from D7 to D2006. Quite frankly Delphi support all the way back to 9x is one of the few competitive advantage Codegear has, otherwise I could just jump on the .Net bandwagon and be forced to only support the Microsoft latest, like everyone else.

    ReplyDelete
  30. Mozilla regrets about UTF-16?
    http://weblogs.mozillazine.org/roc/archives/2008/01/string_theory.html

    ReplyDelete
  31. a Mozilla developer prefers UTF-8 above UTF-16LE.
    For delphi Utf16 makes alot of sense because it is so close to the win32 api, which uses utf16le

    ReplyDelete
  32. We have to support and further develop legacy applications that also rely on 3d party components/libs. It will be unrealistic to do all the digging/correction/testing required. So to me it translates to: "no switch"="no upgrade"="no money for CodeGear"

    ReplyDelete
  33. Leonard: Do you really have that many customers using Win9x still? Asking CodeGear to support Win95 is like asking Toyota to supply parts for a car from 1950. I think the customers CodeGear would lose by not supporting Unicode would be much greater than those they would keep by supporting Win9x.

    ReplyDelete
  34. I agree with the requirement of a switch.
    I don't see why I have to buy both BDS2007 and BDS2008 just to support older projects and have both (bloated) versions installed on my machine.

    ReplyDelete
  35. What about current errors in DBRTL with WideChar?
    See: http://qc.codegear.com/wc/qcmain.aspx?d=52511

    ReplyDelete
  36. @Craig: I most certainly still target Win9x, and even develop on WinME. Quite a lot of functionality and convenience was lost in XP that I still use. Indeed, I was hoping for a fully .NET free W32 development environment in Tiburon.

    But I understand and agree with the reasons Allen presents. As long as CodeGear still allows me to register D7, that's fine by me. Maintaining legacy code has more problems than just Unicode. I'll have both sets of IDE installed in future, unavoidable.

    ReplyDelete
  37. hello,ihave a problem in delphi 2007 update pack3 trial edtion.
    the equation of "alt+0161" after exiting from Label->caption converts to the " ? ".
    what is the reason ?
    could you please help me , here are some pictures of that problem.
    thank you.
    kdarabi@gmail.com

    ReplyDelete
  38. http://www.noavari.com/images/err.jpg

    ReplyDelete
  39. I totally agree CodeGear plans for Unicode. Make it one big step to Unicode only applications. string is mapped to UnicodeString. API calls are mapped to W version.

    This will require to check every line of code we have but on a long run it is the only way.

    There is not automatic or easy switch from Ansi application to Unicode application.

    ReplyDelete
  40. Please keep it backward compatible. I don't think users, who have already spent years developing, want to spend another one or more years to correct their code, so it can work properly with this new Delphi version.

    ReplyDelete
  41. Mehmet Erol SanliturkSeptember 27, 2008 at 3:15 AM

    I tried trial version of Delphi 2009 . Programs using "CHAR" and "STRING" are mostly broken , and many generated executables are generating run-time exceptions which they were working perfectly with previous Delphi compilers .

    I think it is very disappointing to map String to UnicodeString , Char to 2-byte Char , because this new mapping breaks almost all programs using these types . It is not an easy task to "FIX" existing source codes due to coding assumptions and usage which can not be detected simple "FIND" and "REPLACE" logic . It is necessary to write a complete compiler-like program to make intended conversions .
    Instead of using this new mapping , as a new usage the types would be introduced such as UnicodeChar like AnsiChar , and String left in its old usage . The new programs would use UnicodeString for only Unicode required parts . Such an approach would not necessitate a complete re-write of programs .
    It is my opinion that this new design decision will prevent upgrading of existing Delphi and/or C++ Builder installations and addition of new ones to existing software houses .

    ReplyDelete
  42. Anyone remember using a computer pre-Windows? If so, how many of you still expect borland/codegear to support this? No? - Then why would you expect the latest 2009 compiler to support windows 9x? If you have a large product used by a big client who still uses 9x, continue to use the compiler relative to then. Today however, it's about speed. The quicker you get to market, the better - simply because your great idea will be out of date before you can sneeze - and unless your in first, you've lost the edge. Not only that, but to be saleable your product must also have the latest bells and whistles, look and feel (in general), to all the other apps on the market, otherwise your clients simply complain and whinge about why it can't do that or this. This is what RAD is all about. And it follows that unicode is part of that process, allowing developers to write multilanguage programs without all the pain. Unicode is a MUST - and if your personal gripe is that the latest ide doesn't support some app you wrote 20 years ago, then a business reality pill is for you. And if your app is huge, and porting is going to cost you heaps of time - then invest it wisely, and re-design - take advantage of what you once did manually in code, for which you can now buy COTS (components off the Shelf).

    Everybody now raves about 64 bit. Big deal - as long as the OS will run our 32 bit program - i don't really care if we compile to 32 or 64. It's when the program stops functioning, we have to take note - say okay, technologies moving on, we either evolve with it - or we die. And if that means investment to bring us upto speed - and there's a financial business outcome - so be it. Painful, but there it is.

    ReplyDelete

Please keep your comments related to the post on which you are commenting. No spam, personal attacks, or general nastiness. I will be watching and will delete comments I find irrelevant, offensive and unnecessary.