The Oracle at Delphi: More FAQs about Unicode in Tiburón

Wednesday, January 9, 2008

More FAQs about Unicode in Tiburón

Based on some of the comments I've already received, here are a few more common questions folks have:

Why not just provide a new set of controls for Unicode?

Because Unicode is not just about displaying characters. If Unicode were relegated only to the display surface, and all the rest of the program were left alone, you're not really Unicode enabled. To do Unicode correctly, you really have to start shedding the whole ASCII/ANSI notion of character data. Even with only one function remaining as Ansi within an application, you've introduced what I call a "pinch-point." If you look at how the character data flows through a system, if there are any places that perform implicit or explicit down/up conversions, the potential for data loss has increased. The more pinch-points in an application the less likely it will function properly with code-points outside of the active code page. We simply cannot take a piecemeal approach to this change. As we did the analysis, the problems with an incremental approach far outweigh the complete approach.

What about C++? All the .HPP files and code is generated as AnsiString.

This has been a major sticking point. Truth be told, this problem came nearly to scuttling the whole project if we could not find a workable solution. The solution we came up with probably deserves a post all to itself. Several other interesting things "fell" out of this solution that will benefit everyone.

All of my files are in ASCII/ANSI. Will they have to be Unicode now? Can I still read the old versions?

For this we "stole" a few items from the .NET playbook. There is now a TEncoding class that does exactly what it implies. In fact it closely matches the System.Text.Encoding class in interface and functionality. We also have done a lot of work to single-source the VCL for Win32 and the VCL for .NET. Because these code bases are heavily shared (with very few IFDEF's), we can provide like functionality. A lot of you use TStrings.[LoadFrom|SaveTo]File or TString.[LoadFrom|SaveTo]Stream. There are now overloaded versions that take an instance of the TEncoding class. The LoadFromFile function also now defaults to checking for the Byte Order Mark (BOM) and will select the correct encoding based on that. Other traditional File I/O or any manual I/O using TStream.Read or TStream.Write, will need to be reviewed for possible adjustments. Text File Device Drivers (TFDD), and if you don't know what they are you can ignore this, will always do down conversions from Unicode to Ansi since this is what redirected Console I/O expects, and what most existing code that uses them already expects.

Do all my sources files have to be Unicode now?

Nope. You can continue to leave your source files encoded as Ansi. The IDE code editor has supported many different encodings for several releases. UTF8 would be a good encoding to use if you plan on embedding higher order code points in strings and/or comments.

What about my existing project and the DFM files?

The streaming format for DFM files has not changed. It has supported strings encoded as Unicode for several releases already. This was required all the way back in the Kylix/CLX days and more recently because of VCL for .NET. Preemptive side comment: Hey! DFM files should really be XML! Different discussion and not germane to this post.

I'm sure there will be plenty more comments and speculation along with the odd panic-stricken comment from some. I'll try to address as many as I can.

23 comments:

PeterJanuary 9, 2008 at 8:47 AM
Some folks have complain about the compatibility with old apps.
In Delphi we have a {$H+} or {$H-} switch. Can we do the same with the Unicode stuff:

$UC+:
string = UnicodeString (UTF16)
char = 2 byte char

$UC-:
string = old Delphi string
UnicodeString not known
char = 1 byte char
ReplyDelete
Replies
DanBJanuary 9, 2008 at 2:01 PM
Well, much as I hate to be the one making odd panic-stricken comments :) I did a "dir *.pas /s" on my components directory. 1377 Files, 33 megabytes of source code. And that's not counting several component suites that install elsewhere, like DevExpress. Sure, some of those components are still supported and might be updated in a timely manner, until I hear either "your existing components will work" or "codegear will fix them for you", I remain shaken.

It seems much more sensible to support chars and wide chars side by side and let people migrate at their own pace. The community managed to make stings and unicode strings work side by side without help from codegear (ala TNT)... Codegear could do that even better, and make it standard for people to build on. But forcing us all to migrate in an all or nothing manner... ouch...
ReplyDelete
Replies
HHoffmannJanuary 9, 2008 at 3:16 PM
I agree with C. Johnson in regard to not using XML as .DFM file format for the IDE.
My experience (better our all inside our companies area) has shown that the software products we use to develop controllers slowed down dramatically since they had been ported from Win32 to the .NET platform.
Don't get me wrong - it is nothing in general against .NET but it is somehow easy and state-of-the-art to put all data into XML files there and those have to re-read later as well during a build or compile run (and that takes longer; in our case around 2,5 to 4 times slower than before XML files usage).
This makes us losing time doing programming (during compile we can't code...)
ReplyDelete
Replies
SimonJanuary 9, 2008 at 4:26 PM
Peter Wrote:

> $UC-:
> UnicodeString not known

IMO, the compiler should not map UnicodeString to a String type when the $UC- switch is in use, but the UnicodeString type should still be available.
ReplyDelete
Replies
Ottar HolstadJanuary 9, 2008 at 5:32 PM
I have to convert dfm's from text back to binary on a weekly basis. When there is more than around 32MB of dfm-files to compile, the linker craps out with an internal error, especially when compiling bpl's. Converting a few dfm's to binary so that the total is under 32MB usually helps. This is io Delphi 6. Are newer versions bettar at this?
ReplyDelete
Replies
Atle SmelvaerJanuary 9, 2008 at 5:49 PM
C Johnson, HHoffmann, he explicitly said .dfm XML discussions should not be done here. Since you dropped the apple: I think what you say about slowdown from .dfm text to xml is bull* (excuse the language). This is entirely dependant on how the parser is done. Old .dfm style are limiting and generate just as big or bigger files, and actually have less potential to be fast. But they should do their own (or free existing fast) traverse parser instead of using MSXML.
ReplyDelete
Replies
Atle SmelvaerJanuary 9, 2008 at 6:00 PM
I assume old string indexes will still work the same for the new?

var
s: string;
s := 'Test';

length(s)=4 and s[4]='t'?
ReplyDelete
Replies
Lars FosdalJanuary 9, 2008 at 6:37 PM
As most, I am somewhat divided on the .dfm's as xml (.dmfx?), but I see one clear advantage: Future component versions that add/remove attributes can allow using the same .dfmx on two different Delphi versions without the unknown or missing property problems. Unknown/unused properties can be preserved.

Another advantage: Making tools that can interact with the .dfm files without actually knowing the classes on the form will be significantly easier with an xml based format than working with today's typed .dfm contents.

In theory, a .dfmx could support adding additional attributes to form content that doesn't necessarily map to delphi class published properties. One idea would be to embed help comments linked to fields for a help generation tool.

In addition, you have significantly improved the possibilities of seamless localisation of f.x. form texts.

As for space consumption - there is always the .dmfxz option? :)

On the Unicode String mapping - What actually happens to Char? Will it also become Unicode? The impact of changing the implicit underlying string format to Unicode will prolly need some new errors/warning for us that like to fiddle with individual characters in strings. Maybe we need some new string tools to make it more efficient?
ReplyDelete
Replies
Tom MillerJanuary 9, 2008 at 9:49 PM
I agree that the DFM should be XML and that opening up the XML parser as an open source project would mean the community would make is scream (FastMM project). But we are here to talk about Unicode.

What I would like to see sooner then later as we could start looking at our code now, is a tool that would point out possible problem areas in your code.
ReplyDelete
Replies
Allen BauerJanuary 10, 2008 at 12:20 AM
Chris,

That code would work just fine.

Allen.
ReplyDelete
Replies
Allen BauerJanuary 10, 2008 at 12:22 AM
Exactly what advantage do you get from a DFM as XML? The extra size and parser overhead seem to negate any advantages.

Allen.
ReplyDelete
Replies
Allen BauerJanuary 10, 2008 at 12:23 AM
Atle,

Element index is no different than before. That code works fine.

Allen.
ReplyDelete
Replies
KryvichJanuary 10, 2008 at 3:35 AM
For a form which contains many images and/or other binary data, the binary DFM format will be more preferable than the XML-format.
ReplyDelete
Replies
Franz J. BauerJanuary 10, 2008 at 6:28 AM
For now (D2007) you can not write in an UTF8-coded source:

private
var FLänge : Integer;
published
property Länge: Integer read FLänge write FLänge;

The private var is OK but the property identifier makes an error.
I just wonder if Unicode will also be enabled for RTTI structures.?
ReplyDelete
Replies
Allen BauerJanuary 10, 2008 at 7:10 AM
Franz (excellent name ;-),

That is correct. Right now, non-ascii identifiers in published sections is not allowed. We're going to be reviewing this limitation for Tiburon.

Allen.
ReplyDelete
Replies
Jonathan MoraJanuary 10, 2008 at 12:46 PM
(In D2007)
private
var FLänge : Integer;
published
property Länge: Integer read FLänge write FLänge;

Put your cursor before the "ä" in "var FLänge : Integer;". Then press down arrow key. works as expected

Put your cursor after the "ä" in "var FLänge : Integer;". Then press down arrow key. cursor is positioned 1 character to the right
ReplyDelete
Replies
Lars FosdalJanuary 10, 2008 at 6:06 PM
Ref. #13 - Personally, I would believe that the possible advantages that I mentioned in #9 outweigh the relatively minor parsing overhead.

In my previous position we handled realtime datastreams in XML for some 30.000 instruments from 25 exchanges on a low end PC (10's of millions of packets over a day) without any mentionable impact on CPU load. Surely we can handle reading and manipulating a DOM tree during coding/design and parsing a few changed .dfmx files during compile?

In context of post #16, if you take a more .rc like approach to .dmfx files and keep f.x. bitmaps external to the file in design time, it would be possible that you may actually save space by referring an external bitmap file in multiple .dfmx's rather having multiple embedded copies in each .dfm.

Before someone go "but thats an advantage to have all resources in one place" and "you will forget what bitmaps you need"... Would you forget what units your project need? Include bitmap or include unit - same concept. Actually the "embeddedness" of f.x. ImageList have always been a challenge when you need to change it's contents.

IMO, having the descriptive .dfmx form as XML at design time have definitive benefits that are not easily added to the existing .dfm format (again see #9).

As for the linked .dfmx - why would it need to be in XML in the .exe? A BISON (Binary JSON) approach would be possible - essentially removing the human readable portion of XML and leaving a compact binary stream.
ReplyDelete
Replies
Atle SmelvaerJanuary 10, 2008 at 10:12 PM
If you are going to adjust RTTI for unicode, why not please adjust RTTI to handle private and protected published sections also. As a stepstone toward good object structure also in TForm etc. descendants.

Actually, the good effects of XML inside .DFM could cover a big article.

You say bigger files from XML, then tell me how? I did a little samle with a lot of different types in .DFM. The textbased was 1413 bytes and the XML was 1027 bytes.

Here's the textbased:
object fmMain: TfmMain
Left = 416
Top = 45
ClientHeight = 745
ClientWidth = 781
Color = clBtnFace
Font.Charset = DEFAULT_CHARSET
Font.Color = clWindowText
Font.Height = -11
Font.Name = 'Verdana'
object gridResults: TcxGrid
Left = 0
Top = 0
Width = 781
Height = 255
object viewResults: TcxGridDBTableView
OnMouseMove = viewResultsMouseMove
OnCellDblClick = viewResultsCellDblClick
OnCustomDrawCell = viewResultsCustomDrawCell
DataController.DataSource = dSOAP.srcTrips
DataController.Summary.DefaultGroupSummaryItems =
DataController.Summary.FooterSummaryItems =
DataController.Summary.SummaryGroups =
OptionsBehavior.CellHints = True
OptionsCustomize.ColumnFiltering = False
OptionsCustomize.ColumnGrouping = False
object viewResultsdeparture: TcxGridDBColumn
Properties.Alignment.Horz = taCenter
Properties.Alignment.Vert = taVCenter
Properties.Items =

And actually, when you update VCL RTTI to a better level with object reference handling etc. then you really would want the XML model.

But all that could only be covered in a big article.

And what tells you that XML loading have to be slower than the textbased? It all depends on how you choose to load it. I do not recommend using MSXML for this, you need to do serial loading top to bottom, moving data to properties as you go (just as today). You don't need to create a complete object tree representing the values before you move them into the properties...
ReplyDelete
Replies
Atle SmelvaerJanuary 10, 2008 at 10:16 PM
Looks like the samplecode did not get through.

If you want to see the XML and DFM sample, I've posted them inside borland.public.attachments with "XML versus textbased DFM" as subject.
ReplyDelete
Replies
Lars FosdalJanuary 10, 2008 at 11:20 PM
XML size will only be a problem if you embed binary data (bitmaps, etc) in the XML file. It suddenly struck me that this is currently done in the .dfm files too... Hence, the size issue is not really a deciding factor.

IMO, binary content in forms that are not component stream generated should be stored as separate file references where possible.
ReplyDelete
Replies
Rasmus Møller SelsmarkJanuary 15, 2008 at 4:18 PM
Any information on when the Tiburón field test will start?
ReplyDelete
Replies
Pavel SJanuary 17, 2008 at 12:21 PM
Ad #1, #3 :
I think it is very important for us to have the "compatibility switch" between Unicode and non-Unicode (similar to {$H+} .
I am not so concerned about my applications in which I could change easily all types from String to AnsiString but about all my third party tools, many of which are pretty legacy and whose inner workings I do not understand so much.
Without the compatibility switch many of us would be stuck with D2007 for a very long time which could have quite a negative effect on sales of new Delphi versions. It would also create a demand for a long-time support of D2007.
Summing it up : not having this compatibility switch would be a very unfortunate decision IMO.
ReplyDelete
Replies
Kevin NiehageJanuary 20, 2008 at 8:58 PM
It's nice to see that the Unicode support within Delphi evolves.
therefore I only wanted to say something about that *.dfmX thingy.
I'd prefer it to stay as it is as XML is too overrated when it comes to storing information.
After all, I like the way the *.dfm's look like as they can be handled intuitively (they look like pascal code)...
ReplyDelete
Replies

Add comment

Please keep your comments related to the post on which you are commenting. No spam, personal attacks, or general nastiness. I will be watching and will delete comments I find irrelevant, offensive and unnecessary.

The Oracle at Delphi

Wednesday, January 9, 2008

More FAQs about Unicode in Tiburón

23 comments:

Blog Archive

Popular Posts

Labels

MVP

My Blog List