The Oracle at Delphi: Meanwhile, back at the (Unicode) ranch

Monday, January 28, 2008

Meanwhile, back at the (Unicode) ranch

I use the TStrings.ReadFromFile/WriteToFile methods all the time to read/write text files and manipulate them. Does this mean all these files are now Unicode and I can't read/write the files in the old format?

I've gotten this specific question a few times recently, so I figured I'd address it directly. The bottom line here is that we've got you covered. The defaults for those TStrings methods will now write the files ANSI encoded based on the active code page and will read the files based on whether or not the file contains a Byte-Order-Mark (BOM). If a BOM is found it will read the data encoded as the BOM indicates. If no BOM is found it will read it as ANSI and up-convert based on the current active codepage. All your files written with pre-Tiburón versions of Delphi will still be read in just fine, with the caveat that as long as you read with the active codepage the same as with what was written. Likewise, any file written with Tiburón (unless you override things which I'll describe in a moment) should be readable with a the pre-Tiburón version. At this point, only the most common BOM formats are detected (UTF16 Little-Endian, UTF16 Big-Endian and UTF8).

That's great news! But what if I want these files to be written as Unicode?

We realize that many Delphi applications will need to continue to interact with other applications or datasources, many of which can only handle data in ANSI or even ASCII. This is why TStrings exhibit the default behavior I described above. However, many of you will want to read/write text data using the TStrings class a loss-less Unicode format, be that Little-Endian UTF16, Big-Endian UTF16, UTF8, UTF7, etc... This led us to introduce a new TEncoding class to SysUtils.pas. This class is very similar in methods and functionality that you can find in the System.Text.Encoding class in the .NET Framework. That class contains several static functions that return standard instances of TEncoding class descendants that handle many of the common encoding formats. You can also create your own TEncoding class descendant should you feel the need to handle encodings that may be unique to your situation (say, for Base64 or Quoted-Printable).

The TEncoding class is key to how the TStrings class now handles being able to read/write its contents in the various formats. There are now overloaded versions of the TStrings.ReadFrom/WriteToXXXX functions that now take an instance of a TEncoding class. In most cases you'll just want to pass in one of the singleton versions that you can obtain by calling one of the class static methods on the TEncoding class. For instance, suppose you want to write the data out as UTF8, you'd call it like this:

var
  S: TStringList;
begin
  S := TStringList.Create;
  ...
  S.WriteToFile('config.txt', TEncoding.GetUTF8);
  ...
end;

Without the extra parameter, 'config.txt' would simply be converted and written out as ANSI encoded based on the current active codepage. The interesting thing to note here is that you don't need to change the read code since TStrings will automatically detect the encoding based on the BOM and do the right thing. If you wanted to force the file to read and written using a specific codepage, you can create an instance of TMBCSEncoding and pass in the code page you want to use into the constructor. Then you use that instance to read and write the file since the specific codepage may not match the user's active codepage.

I use INI files for configuration and extensively use TIniFile and TMemIniFile. What about those?

The same thing holds for these classes in that the data will be read and written as ANSI data. Since INI files have always been traditionally ANSI (ASCII) encoded, it may not make sense to convert these. It will depend on the needs of your application. If, you do wish to change to use a Unicode format, we'll offer ways to use the TEncoding classes to accomplish that as well.

In all the above cases, the internal storage will be Unicode and any data manipulation you do with string will continue to function as expected. (Unless you're doing some of those odd data-payload tricks I've mentioned previously). Conversions will automatically happen when reading and writing the data. Those of you that are familiar with VCL for .NET, you already know how all of the above works since the new overloads added to TStrings were introduced with VCL for .NET and use the System.Text.Encoding class for all these operations.

35 comments:

GielJanuary 28, 2008 at 6:28 AM
Thanks for this great information Allen. I still do have two questions:

1. Can TMemIniFile.ReadString / WriteString handle characters up to #255 which represent binary data instead of text? Maybe this can be accomplished with a custom TEncoding class?

2. Do you have any thoughts on the Text type (WriteLn etc.) ?
ReplyDelete
Replies
Bear XuJanuary 28, 2008 at 1:12 PM
Really good.

In our application, we hvae our own UnicodeFile unit. and INI file need to be Unicode very much for those string parameters the user define!!

Hope we can test this early.
ReplyDelete
Replies
David HowesJanuary 28, 2008 at 1:13 PM
What about files that are clearly Unicode but have no BOM? Many unicode libs have functions that auto-detect unicode. Will Delphi?
ReplyDelete
Replies
M J MarshallJanuary 28, 2008 at 2:58 PM
Providing some mechanism for determining the encoding of streams/text without a BOM is essential, especially when dealing with external files. A case in point: XML does not require a BOM in many cases.
One solution that would minimize overhead is to define an optional event (eg TStrings.OnFindEncoding?) which could be invoked if no BOM is detected and would allow applications to apply their own rules/heuristics to identify the encoding.
ReplyDelete
Replies
Bear XuJanuary 28, 2008 at 3:20 PM
>>A case in point: XML does not require a BOM in many cases.

Yes! If export XML from Excel, there is no BOM. Hope TStringList can handle this automantically.
ReplyDelete
Replies
Qian XuJanuary 28, 2008 at 3:23 PM
If my program tries to read a file without a BOM. I will check the encoding of the file, instead of assuming it is in ANSI mode.
ReplyDelete
Replies
Sebastian ZJanuary 28, 2008 at 4:49 PM
I'd prefer to have UTF8 as the default encoding when saving a StringList. If the default is ANSI and somebody forgets the additional parameter, this would easily result in a "unicode loss" bug.
ReplyDelete
Replies
ConcernedJanuary 28, 2008 at 6:05 PM
What if I want the strings stored internally in a TStringList to be ASCII/ANSI strings??? (i.e. I have a long list of keywords or an english dictionary word list which has no need for unicode) - storing it in the new TStringList will immediately double my memory consumption for exactly the same data... how do we handle this?

Or will there now be a TAnsiStringList class that we should use instead???
ReplyDelete
Replies
David HowesJanuary 28, 2008 at 7:43 PM
When outputting files is it possible to control if a BOM is written or not? Some applications won't recognise/expect a BOM and will not work correctly if they encounter one.
ReplyDelete
Replies
Fatih Tolga AtaJanuary 28, 2008 at 11:25 PM
Thanks Allen,
The unicode files without BOM is very important issue for some topics. For instance, BOM is a problem in php files. So both bom reading and writing should be optional property or parameter.
ReplyDelete
Replies
Allen BauerJanuary 29, 2008 at 1:24 AM
Concerned,

"Or will there now be a TAnsiStringList class that we should use instead???"

We'll certainly consider this.

Allen.
ReplyDelete
Replies
Allen BauerJanuary 29, 2008 at 1:26 AM
"When outputting files is it possible to control if a BOM is written or not?"
"The unicode files without BOM is very important issue for some topics."

I will have to check on this about whether or not we'll allow writing the files without the BOM. I'd imagine there would be a way to do it, and if not right now, I'll add it as a suggestion.

Allen.
ReplyDelete
Replies
Allen BauerJanuary 29, 2008 at 1:29 AM
Sebastian,

The presumption here is that just because the strings are Unicode, your application doesn't magically start injecting Unicode characters into them. If your application was read/writing ANSI data, the conversion is loss-less since the code-page remains constant. The only time you may encounter loss is if there is a mismatch between the reader's and writer's codepage.

Allen.
ReplyDelete
Replies
Bruce McGeeJanuary 29, 2008 at 1:43 AM
Will methods like LoadFromStream and SaveToStream will be overloaded as well?
ReplyDelete
Replies
Allen BauerJanuary 29, 2008 at 2:15 AM
Bruce,

"Will methods like LoadFromStream and SaveToStream will be overloaded as well?"

Yes. I should have been more clear. I intended to indicate that by the reference to that group of functions by "TStrings.ReadFrom/WriteToXXXX."

Allen.
ReplyDelete
Replies
Allen BauerJanuary 29, 2008 at 3:43 AM
Clinton,

Yes. That function will move just fine. No changes are needed at all.

Allen.
ReplyDelete
Replies
Bruce McGeeJanuary 29, 2008 at 4:02 AM
Will the new Unicode work under Windows NT, or will it be strictly for Win2k and above. I don't mind that Win9x (Windows Playstation?), won;t be supported.
ReplyDelete
Replies
Allen BauerJanuary 29, 2008 at 4:10 AM
Bruce,

We're still evaluating whether or not we'll certify targeting of NT. I would imagine that it should work since NT was Unicode from the start. However, there may be some APIs and functionality that do not exist on earlier NT versions which may make some things incompatible. Anything before NT4 SP4+, I highly doubt would work well.

Allen.
ReplyDelete
Replies
Bruce McGeeJanuary 29, 2008 at 4:25 AM
Thanks. If NT4 is supported, then I think it's reasonable to expect users have at least SP4(a) installed. The only ones I have to worry about are fully patched.
ReplyDelete
Replies
Jolyon SmithJanuary 29, 2008 at 6:24 AM
"We’ll certainly consider this." (TAnsiStringList)

This does not inspire confidence. Unicode is a "must have" for the future, but it's an utter irrelevance for most existing Delphi applications.

TStringList should be ANSI.

TUnicodeList should introduce a new list-class for supporting Unicode strings (TUnicodeStringList - would contain unnecessary redundancy in the name IMHO - what other Unicode things might be in a list? Unicode Integers?).

By definition ONLY NEW Delphi applications will make use of Delphi Unicode support - forcing existing applications to jump through hoops simply to work as they did before is a recipe for losing upgrade sales.

I'd rather wait a little longer for a successful Unicode delivery than get an early one which does not provide a practicable transition for existing, strictly ANSI, applications (which again, by definition, is surely pretty much ALL existing Delphi applications).

Unless there is a radical rethink this could be fatal mis-step for Delphi/CodeGear. I fear it may be too late for the re-think that is required though.

:(
ReplyDelete
Replies
Qian XuJanuary 29, 2008 at 10:40 AM
"We'll certainly consider this." (TAnsiStringList)

No. What must be added is
1. TStringList.SaveToFileEx(FileName: string, BS: TBOMStrategy)
2. TStringList.DefaultBOMStrategy: TBOMStrategy.

Allen, you have mentioned in previous post, that a switch between UnicodeString and AnsiString means double effort. But you see, sometimes double effort cannot be avoided. ;-)
ReplyDelete
Replies
Pavel SJanuary 29, 2008 at 12:06 PM
Forcing existing Delphi applications to Unicode would be a VERY_BAD_THING. As Jolyon said, D2007 could be the last Delphi bought by a large percent of your users.
ReplyDelete
Replies
KryvichJanuary 29, 2008 at 2:45 PM
It'd be desirable to have the ANSI encoded string fields in a database:

TFieldType = (ftUnknown, ftString, ftAnsiString{!!!}, ...

In some cases it'd be the big memory saving.
ReplyDelete
Replies
Bruce McGeeJanuary 29, 2008 at 6:54 PM
I will need some way to override the BOM, especially for XML.

Otherwise, I'm really happy with how CodeGear is going about this, and I don't see any big migration problems. And trust me, I'm concerned about forward and backward compatibility.

I suspect that my biggest issue will be replacing PChar with PByte in a couple of places and adding some IFDEFS for previous versions of the compiler.
ReplyDelete
Replies
m. Th.January 29, 2008 at 11:21 PM
Thanks for the info Allen,

More and more it seems that you need some refactorings like in QC
#56885 and QC #56886 (To convert from 'string' to 'AnsiString', 'PChar'
to 'PByte' like when we use Ctrl+Shift+J - but on more determined
scope). Also, can you give more insights how we'll deal with TField
descendants. I assume that .AsString will return an Unicode one isn't?
We'll have a .AsAnsiString? Then we need to be very careful to not
corrupt data in the cases in which we use pseudo-ansi strings to store
encoded data (Blobs, RTF data etc.). Good series of blog posts. Keep
them coming.
ReplyDelete
Replies
A KJanuary 30, 2008 at 12:17 AM
we have a lot of places with following code:

var
S: String;
I: Integer;
begin
Stream.ReadBuffer(I, SizeOf(I));
SetLength(S, I);
if I > 0 then
Stream.ReadBuffer(S[1], I);
end;

will this code complile under new version?
ReplyDelete
Replies
Allen BauerJanuary 30, 2008 at 12:28 AM
A K,

Unfortunately, that code will compile fine, but the resulting string will be corrupted. If you changed the declaration of S: String to S: AnsiString, it will function as expected now *and* in Tiburón. As I've always stated, file I/O will be one of the areas of largest impact and will require more careful examination of the code.

Allen.
ReplyDelete
Replies
A KJanuary 30, 2008 at 12:34 AM
very bad. We have a huge project 1.1M of lines being developed for last 7 years. And check all places and fix all strings will be a big problem.

By the way, there are a lot of components with out sources or with limited sources... Seems impossible to fix them.

May be better to have a global switch. Something like UNISODESTRINGS ON|OFF?
ReplyDelete
Replies
Jolyon SmithJanuary 30, 2008 at 5:23 AM
C Johnson > I hope you meant LongInt. DWORD is LongWord, i.e. unsigned.

Q: 64-bit signed integer will be .... LongerInt? ;) LOL

But it won't be a "problem" if Integer becomes 64-bit, unless you have code that directly relies on the max/min of a 32-bit Integer. At least, not a problem in the same way as with ANSI/Unicode, i.e. inherent (and silent) data loss.
ReplyDelete
Replies
M J MarshallJanuary 30, 2008 at 11:05 AM
What will make this migration so much easier is a utility that can analyze a file or set of files and highlight potential Unicode-related problems such as possible string/AnsiString/WideString mis-use, use of PChar, etc. Combined with a special comment ({UNICODEOK}?) to indicate procedures already "cleared", this will alleviate a LOT of problems and programmer stress, and thus make Delphi2008 a very successful upgrade.
I know I'd find this invaluable...
ReplyDelete
Replies
Jolyon SmithJanuary 31, 2008 at 4:59 AM
Sorry, it wasn't at all clear that you were referring specifically to string lengths. You appeared to be talking about the use of "Integer" type variables, not a specific instance of a current "Integer" value in the RTL.

But anyway, how useful, really, is a 2GB string? (2Giga WideChars = 1 Giga Unicode characters).

I don't think it would be unreasonable to leave the length component of String RTTI as a 32bit signed Int, if that helps with compatability (although I agree the used of a signed value for something that can never be negative is something of an anomaly).
ReplyDelete
Replies
A KJanuary 31, 2008 at 9:22 PM
Ok, if I have a huge project which heavily uses assumption that string is a chain of one-byte symbols. Then, if I simply replace all occurencies of String with AnsiString, my project will compile and work well under Delphi 2008, will it?
ReplyDelete
Replies
Allen BauerFebruary 1, 2008 at 1:42 AM
A K,

Yes, with some qualifications. Any RTL function you call that takes "var" string parameters may not work if the param is Unicode string and you try to pass an AnsiString. Also, event handlers should not be changed. Other than that, most things should continue to work.

Allen.
ReplyDelete
Replies
Allen BauerFebruary 2, 2008 at 10:59 AM
C Johnson,

String, PChar and Char will symetrically promote to UnicodeString, PWideChar, and WideChar respectively. Just like today where you have to be explicit about any call to a "W" function in terms of using the right WideChar, PWideChar and, currently, WideString, the inverse will be true with Tiburón; you must explicitly use the "A" version of the API and be consistent in the usage of AnsiString, PAnsiChar, and AnsiChar.

In the above function, you should not change the declaration to AnsiString, but rather simply assign the result to an AnsiString variable which will implicitly convert it to Ansi.

Allen.
ReplyDelete
Replies
JasonApril 21, 2008 at 10:16 AM
Hi, thanks for updating this blog.

I studied your blogs a little and came to the conclusion that the whole conversion to Unicode has major impact on my written software. Though I totally agree with the shift, is there any test version, alpha/beta, trial available to start rewriting or is it all too early?

Will it help to start rewriting it all to wide strings and later on all change it back to normal operations? I am using many functions that loop through characters based on increments of single bytes.

Will there be any fast general find/replace function and sorting routine? This would help a lot of rewriting difficulties. Any reaction appreciated.

Regards Jason
ReplyDelete
Replies

Add comment

Please keep your comments related to the post on which you are commenting. No spam, personal attacks, or general nastiness. I will be watching and will delete comments I find irrelevant, offensive and unnecessary.

The Oracle at Delphi

Monday, January 28, 2008

Meanwhile, back at the (Unicode) ranch

35 comments:

Blog Archive

Popular Posts

Labels

MVP

My Blog List