Wednesday, January 9, 2008

More FAQs about Unicode in Tiburón

Based on some of the comments I've already received, here are a few more common questions folks have:

Why not just provide a new set of controls for Unicode?

Because Unicode is not just about displaying characters.  If Unicode were relegated only to the display surface, and all the rest of the program were left alone, you're not really Unicode enabled.  To do Unicode correctly, you really have to start shedding the whole ASCII/ANSI notion of character data.  Even with only one function remaining as Ansi within an application, you've introduced what I call a "pinch-point."  If you look at how the character data flows through a system, if there are any places that perform implicit or explicit down/up conversions, the potential for data loss has increased.  The more pinch-points in an application the less likely it will function properly with code-points outside of the active code page.  We simply cannot take a piecemeal approach to this change.  As we did the analysis, the problems with an incremental approach far outweigh the complete approach.

What about C++?  All the .HPP files and code is generated as AnsiString.

This has been a major sticking point. Truth be told, this problem came nearly to scuttling the whole project if we could not find a workable solution.  The solution we came up with probably deserves a post all to itself.  Several other interesting things "fell" out of this solution that will benefit everyone.

All of my files are in ASCII/ANSI.  Will they have to be Unicode now? Can I still read the old versions?

For this we "stole" a few items from the .NET playbook.  There is now a TEncoding class that does exactly what it implies.  In fact it closely matches the System.Text.Encoding class in interface and functionality.  We also have done a lot of work to single-source the VCL for Win32 and the VCL for .NET.  Because these code bases are heavily shared (with very few IFDEF's), we can provide like functionality.  A lot of you use TStrings.[LoadFrom|SaveTo]File or TString.[LoadFrom|SaveTo]Stream.  There are now overloaded versions that take an instance of the TEncoding class.  The LoadFromFile function also now defaults to checking for the Byte Order Mark (BOM) and will select the correct encoding based on that.  Other traditional File I/O or any manual I/O using TStream.Read or TStream.Write, will need to be reviewed for possible adjustments.  Text File Device Drivers (TFDD), and if you don't know what they are you can ignore this, will always do down conversions from Unicode to Ansi since this is what redirected Console I/O expects, and what most existing code that uses them already expects.

Do all my sources files have to be Unicode now?

Nope.  You can continue to leave your source files encoded as Ansi.  The IDE code editor has supported many different encodings for several releases.  UTF8 would be a good encoding to use if you plan on embedding higher order code points in strings and/or comments.

What about my existing project and the DFM files?

The streaming format for DFM files has not changed.  It has supported strings encoded as Unicode for several releases already.  This was required all the way back in the Kylix/CLX days and more recently because of VCL for .NET.  Preemptive side comment: Hey! DFM files should really be XML!  Different discussion and not germane to this post.

 

I'm sure there will be plenty more comments and speculation along with the odd panic-stricken comment from some.  I'll try to address as many as I can.