Monday, January 28, 2008

Meanwhile, back at the (Unicode) ranch

 

I use the TStrings.ReadFromFile/WriteToFile methods all the time to read/write text files and manipulate them.  Does this mean all these files are now Unicode and I can't read/write the files in the old format?

I've gotten this specific question a few times recently, so I figured I'd address it directly.  The bottom line here is that we've got you covered.  The defaults for those TStrings methods will now write the files ANSI encoded based on the active code page and will read the files based on whether or not the file contains a Byte-Order-Mark (BOM).  If a BOM is found it will read the data encoded as the BOM indicates.  If no BOM is found it will read it as ANSI and up-convert based on the current active codepage.  All your files written with pre-Tiburón versions of Delphi will still be read in just fine, with the caveat that as long as you read with the active codepage the same as with what was written.  Likewise, any file written with Tiburón (unless you override things which I'll describe in a moment) should be readable with a the pre-Tiburón version.  At this point, only the most common BOM formats are detected (UTF16 Little-Endian, UTF16 Big-Endian and UTF8).

That's great news!  But what if I want these files to be written as Unicode?

We realize that many Delphi applications will need to continue to interact with other applications or datasources, many of which can only handle data in ANSI or even ASCII.  This is why TStrings exhibit the default behavior I described above.  However, many of you will want to read/write text data using the TStrings class a loss-less Unicode format, be that Little-Endian UTF16, Big-Endian UTF16, UTF8, UTF7, etc...  This led us to introduce a new TEncoding class to SysUtils.pas.  This class is very similar in methods and functionality that you can find in the System.Text.Encoding class in the .NET Framework.  That class contains several static functions that return standard instances of TEncoding class descendants that handle many of the common encoding formats.  You can also create your own TEncoding class descendant should you feel the need to handle encodings that may be unique to your situation (say, for Base64 or Quoted-Printable).

The TEncoding class is key to how the TStrings class now handles being able to read/write its contents in the various formats.  There are now overloaded versions of the TStrings.ReadFrom/WriteToXXXX functions that now take an instance of a TEncoding class.  In most cases you'll just want to pass in one of the singleton versions that you can obtain by calling one of the class static methods on the TEncoding class.  For instance, suppose you want to write the data out as UTF8, you'd call it like this:

var
S: TStringList;
begin
S := TStringList.Create;
...
S.WriteToFile('config.txt', TEncoding.GetUTF8);
...
end;

Without the extra parameter, 'config.txt' would simply be converted and written out as ANSI encoded based on the current active codepage.  The interesting thing to note here is that you don't need to change the read code since TStrings will automatically detect the encoding based on the BOM and do the right thing.  If you wanted to force the file to read and written using a specific codepage, you can create an instance of TMBCSEncoding and pass in the code page you want to use into the constructor.  Then you use that instance to read and write the file since the specific codepage may not match the user's active codepage.


I use INI files for configuration and extensively use TIniFile and TMemIniFile.  What about those?


The same thing holds for these classes in that the data will be read and written as ANSI data.  Since INI files have always been traditionally ANSI (ASCII) encoded, it may not make sense to convert these.  It will depend on the needs of your application.  If, you do wish to change to use a Unicode format, we'll offer ways to use the TEncoding classes to accomplish that as well.


In all the above cases, the internal storage will be Unicode and any data manipulation you do with string will continue to function as expected. (Unless you're doing some of those odd data-payload tricks I've mentioned previously).  Conversions will automatically happen when reading and writing the data.  Those of you that are familiar with VCL for .NET, you already know how all of the above works since the new overloads added to TStrings were introduced with VCL for .NET and use the System.Text.Encoding class for all these operations.

35 comments:

  1. Thanks for this great information Allen. I still do have two questions:

    1. Can TMemIniFile.ReadString / WriteString handle characters up to #255 which represent binary data instead of text? Maybe this can be accomplished with a custom TEncoding class?

    2. Do you have any thoughts on the Text type (WriteLn etc.) ?

    ReplyDelete
  2. Really good.

    In our application, we hvae our own UnicodeFile unit. and INI file need to be Unicode very much for those string parameters the user define!!

    Hope we can test this early.

    ReplyDelete
  3. What about files that are clearly Unicode but have no BOM? Many unicode libs have functions that auto-detect unicode. Will Delphi?

    ReplyDelete
  4. Providing some mechanism for determining the encoding of streams/text without a BOM is essential, especially when dealing with external files. A case in point: XML does not require a BOM in many cases.
    One solution that would minimize overhead is to define an optional event (eg TStrings.OnFindEncoding?) which could be invoked if no BOM is detected and would allow applications to apply their own rules/heuristics to identify the encoding.

    ReplyDelete
  5. >>A case in point: XML does not require a BOM in many cases.

    Yes! If export XML from Excel, there is no BOM. Hope TStringList can handle this automantically.

    ReplyDelete
  6. If my program tries to read a file without a BOM. I will check the encoding of the file, instead of assuming it is in ANSI mode.

    ReplyDelete
  7. I'd prefer to have UTF8 as the default encoding when saving a StringList. If the default is ANSI and somebody forgets the additional parameter, this would easily result in a "unicode loss" bug.

    ReplyDelete
  8. What if I want the strings stored internally in a TStringList to be ASCII/ANSI strings??? (i.e. I have a long list of keywords or an english dictionary word list which has no need for unicode) - storing it in the new TStringList will immediately double my memory consumption for exactly the same data... how do we handle this?

    Or will there now be a TAnsiStringList class that we should use instead???

    ReplyDelete
  9. When outputting files is it possible to control if a BOM is written or not? Some applications won't recognise/expect a BOM and will not work correctly if they encounter one.

    ReplyDelete
  10. Thanks Allen,
    The unicode files without BOM is very important issue for some topics. For instance, BOM is a problem in php files. So both bom reading and writing should be optional property or parameter.

    ReplyDelete
  11. Concerned,

    "Or will there now be a TAnsiStringList class that we should use instead???"

    We'll certainly consider this.

    Allen.

    ReplyDelete
  12. "When outputting files is it possible to control if a BOM is written or not?"
    "The unicode files without BOM is very important issue for some topics."

    I will have to check on this about whether or not we'll allow writing the files without the BOM. I'd imagine there would be a way to do it, and if not right now, I'll add it as a suggestion.

    Allen.

    ReplyDelete
  13. Sebastian,

    The presumption here is that just because the strings are Unicode, your application doesn't magically start injecting Unicode characters into them. If your application was read/writing ANSI data, the conversion is loss-less since the code-page remains constant. The only time you may encounter loss is if there is a mismatch between the reader's and writer's codepage.

    Allen.

    ReplyDelete
  14. Will methods like LoadFromStream and SaveToStream will be overloaded as well?

    ReplyDelete
  15. Bruce,

    "Will methods like LoadFromStream and SaveToStream will be overloaded as well?"

    Yes. I should have been more clear. I intended to indicate that by the reference to that group of functions by "TStrings.ReadFrom/WriteToXXXX."

    Allen.

    ReplyDelete
  16. Clinton,

    Yes. That function will move just fine. No changes are needed at all.

    Allen.

    ReplyDelete
  17. Will the new Unicode work under Windows NT, or will it be strictly for Win2k and above. I don't mind that Win9x (Windows Playstation?), won;t be supported.

    ReplyDelete
  18. Bruce,

    We're still evaluating whether or not we'll certify targeting of NT. I would imagine that it should work since NT was Unicode from the start. However, there may be some APIs and functionality that do not exist on earlier NT versions which may make some things incompatible. Anything before NT4 SP4+, I highly doubt would work well.

    Allen.

    ReplyDelete
  19. Thanks. If NT4 is supported, then I think it's reasonable to expect users have at least SP4(a) installed. The only ones I have to worry about are fully patched.

    ReplyDelete
  20. "We’ll certainly consider this." (TAnsiStringList)

    This does not inspire confidence. Unicode is a "must have" for the future, but it's an utter irrelevance for most existing Delphi applications.

    TStringList should be ANSI.

    TUnicodeList should introduce a new list-class for supporting Unicode strings (TUnicodeStringList - would contain unnecessary redundancy in the name IMHO - what other Unicode things might be in a list? Unicode Integers?).

    By definition ONLY NEW Delphi applications will make use of Delphi Unicode support - forcing existing applications to jump through hoops simply to work as they did before is a recipe for losing upgrade sales.

    I'd rather wait a little longer for a successful Unicode delivery than get an early one which does not provide a practicable transition for existing, strictly ANSI, applications (which again, by definition, is surely pretty much ALL existing Delphi applications).

    Unless there is a radical rethink this could be fatal mis-step for Delphi/CodeGear. I fear it may be too late for the re-think that is required though.

    :(

    ReplyDelete
  21. "We'll certainly consider this." (TAnsiStringList)

    No. What must be added is
    1. TStringList.SaveToFileEx(FileName: string, BS: TBOMStrategy)
    2. TStringList.DefaultBOMStrategy: TBOMStrategy.

    Allen, you have mentioned in previous post, that a switch between UnicodeString and AnsiString means double effort. But you see, sometimes double effort cannot be avoided. ;-)

    ReplyDelete
  22. Forcing existing Delphi applications to Unicode would be a VERY_BAD_THING. As Jolyon said, D2007 could be the last Delphi bought by a large percent of your users.

    ReplyDelete
  23. It'd be desirable to have the ANSI encoded string fields in a database:

    TFieldType = (ftUnknown, ftString, ftAnsiString{!!!}, ...

    In some cases it'd be the big memory saving.

    ReplyDelete
  24. I will need some way to override the BOM, especially for XML.

    Otherwise, I'm really happy with how CodeGear is going about this, and I don't see any big migration problems. And trust me, I'm concerned about forward and backward compatibility.

    I suspect that my biggest issue will be replacing PChar with PByte in a couple of places and adding some IFDEFS for previous versions of the compiler.

    ReplyDelete
  25. Thanks for the info Allen,

    More and more it seems that you need some refactorings like in QC
    #56885 and QC #56886 (To convert from 'string' to 'AnsiString', 'PChar'
    to 'PByte' like when we use Ctrl+Shift+J - but on more determined
    scope). Also, can you give more insights how we'll deal with TField
    descendants. I assume that .AsString will return an Unicode one isn't?
    We'll have a .AsAnsiString? Then we need to be very careful to not
    corrupt data in the cases in which we use pseudo-ansi strings to store
    encoded data (Blobs, RTF data etc.). Good series of blog posts. Keep
    them coming.

    ReplyDelete
  26. we have a lot of places with following code:

    var
    S: String;
    I: Integer;
    begin
    Stream.ReadBuffer(I, SizeOf(I));
    SetLength(S, I);
    if I > 0 then
    Stream.ReadBuffer(S[1], I);
    end;

    will this code complile under new version?

    ReplyDelete
  27. A K,

    Unfortunately, that code will compile fine, but the resulting string will be corrupted. If you changed the declaration of S: String to S: AnsiString, it will function as expected now *and* in Tiburón. As I've always stated, file I/O will be one of the areas of largest impact and will require more careful examination of the code.

    Allen.

    ReplyDelete
  28. very bad. We have a huge project 1.1M of lines being developed for last 7 years. And check all places and fix all strings will be a big problem.

    By the way, there are a lot of components with out sources or with limited sources... Seems impossible to fix them.

    May be better to have a global switch. Something like UNISODESTRINGS ON|OFF?

    ReplyDelete
  29. C Johnson > I hope you meant LongInt. DWORD is LongWord, i.e. unsigned.

    Q: 64-bit signed integer will be .... LongerInt? ;) LOL


    But it won't be a "problem" if Integer becomes 64-bit, unless you have code that directly relies on the max/min of a 32-bit Integer. At least, not a problem in the same way as with ANSI/Unicode, i.e. inherent (and silent) data loss.

    ReplyDelete
  30. What will make this migration so much easier is a utility that can analyze a file or set of files and highlight potential Unicode-related problems such as possible string/AnsiString/WideString mis-use, use of PChar, etc. Combined with a special comment ({UNICODEOK}?) to indicate procedures already "cleared", this will alleviate a LOT of problems and programmer stress, and thus make Delphi2008 a very successful upgrade.
    I know I'd find this invaluable...

    ReplyDelete
  31. Sorry, it wasn't at all clear that you were referring specifically to string lengths. You appeared to be talking about the use of "Integer" type variables, not a specific instance of a current "Integer" value in the RTL.

    But anyway, how useful, really, is a 2GB string? (2Giga WideChars = 1 Giga Unicode characters).

    I don't think it would be unreasonable to leave the length component of String RTTI as a 32bit signed Int, if that helps with compatability (although I agree the used of a signed value for something that can never be negative is something of an anomaly).

    ReplyDelete
  32. Ok, if I have a huge project which heavily uses assumption that string is a chain of one-byte symbols. Then, if I simply replace all occurencies of String with AnsiString, my project will compile and work well under Delphi 2008, will it?

    ReplyDelete
  33. A K,

    Yes, with some qualifications. Any RTL function you call that takes "var" string parameters may not work if the param is Unicode string and you try to pass an AnsiString. Also, event handlers should not be changed. Other than that, most things should continue to work.

    Allen.

    ReplyDelete
  34. C Johnson,

    String, PChar and Char will symetrically promote to UnicodeString, PWideChar, and WideChar respectively. Just like today where you have to be explicit about any call to a "W" function in terms of using the right WideChar, PWideChar and, currently, WideString, the inverse will be true with Tiburón; you must explicitly use the "A" version of the API and be consistent in the usage of AnsiString, PAnsiChar, and AnsiChar.

    In the above function, you should not change the declaration to AnsiString, but rather simply assign the result to an AnsiString variable which will implicitly convert it to Ansi.

    Allen.

    ReplyDelete
  35. Hi, thanks for updating this blog.

    I studied your blogs a little and came to the conclusion that the whole conversion to Unicode has major impact on my written software. Though I totally agree with the shift, is there any test version, alpha/beta, trial available to start rewriting or is it all too early?

    Will it help to start rewriting it all to wide strings and later on all change it back to normal operations? I am using many functions that loop through characters based on increments of single bytes.

    Will there be any fast general find/replace function and sorting routine? This would help a lot of rewriting difficulties. Any reaction appreciated.

    Regards Jason

    ReplyDelete

Please keep your comments related to the post on which you are commenting. No spam, personal attacks, or general nastiness. I will be watching and will delete comments I find irrelevant, offensive and unnecessary.