Monday, January 28, 2008

Meanwhile, back at the (Unicode) ranch


I use the TStrings.ReadFromFile/WriteToFile methods all the time to read/write text files and manipulate them.  Does this mean all these files are now Unicode and I can't read/write the files in the old format?

I've gotten this specific question a few times recently, so I figured I'd address it directly.  The bottom line here is that we've got you covered.  The defaults for those TStrings methods will now write the files ANSI encoded based on the active code page and will read the files based on whether or not the file contains a Byte-Order-Mark (BOM).  If a BOM is found it will read the data encoded as the BOM indicates.  If no BOM is found it will read it as ANSI and up-convert based on the current active codepage.  All your files written with pre-Tiburón versions of Delphi will still be read in just fine, with the caveat that as long as you read with the active codepage the same as with what was written.  Likewise, any file written with Tiburón (unless you override things which I'll describe in a moment) should be readable with a the pre-Tiburón version.  At this point, only the most common BOM formats are detected (UTF16 Little-Endian, UTF16 Big-Endian and UTF8).

That's great news!  But what if I want these files to be written as Unicode?

We realize that many Delphi applications will need to continue to interact with other applications or datasources, many of which can only handle data in ANSI or even ASCII.  This is why TStrings exhibit the default behavior I described above.  However, many of you will want to read/write text data using the TStrings class a loss-less Unicode format, be that Little-Endian UTF16, Big-Endian UTF16, UTF8, UTF7, etc...  This led us to introduce a new TEncoding class to SysUtils.pas.  This class is very similar in methods and functionality that you can find in the System.Text.Encoding class in the .NET Framework.  That class contains several static functions that return standard instances of TEncoding class descendants that handle many of the common encoding formats.  You can also create your own TEncoding class descendant should you feel the need to handle encodings that may be unique to your situation (say, for Base64 or Quoted-Printable).

The TEncoding class is key to how the TStrings class now handles being able to read/write its contents in the various formats.  There are now overloaded versions of the TStrings.ReadFrom/WriteToXXXX functions that now take an instance of a TEncoding class.  In most cases you'll just want to pass in one of the singleton versions that you can obtain by calling one of the class static methods on the TEncoding class.  For instance, suppose you want to write the data out as UTF8, you'd call it like this:

S: TStringList;
S := TStringList.Create;
S.WriteToFile('config.txt', TEncoding.GetUTF8);

Without the extra parameter, 'config.txt' would simply be converted and written out as ANSI encoded based on the current active codepage.  The interesting thing to note here is that you don't need to change the read code since TStrings will automatically detect the encoding based on the BOM and do the right thing.  If you wanted to force the file to read and written using a specific codepage, you can create an instance of TMBCSEncoding and pass in the code page you want to use into the constructor.  Then you use that instance to read and write the file since the specific codepage may not match the user's active codepage.

I use INI files for configuration and extensively use TIniFile and TMemIniFile.  What about those?

The same thing holds for these classes in that the data will be read and written as ANSI data.  Since INI files have always been traditionally ANSI (ASCII) encoded, it may not make sense to convert these.  It will depend on the needs of your application.  If, you do wish to change to use a Unicode format, we'll offer ways to use the TEncoding classes to accomplish that as well.

In all the above cases, the internal storage will be Unicode and any data manipulation you do with string will continue to function as expected. (Unless you're doing some of those odd data-payload tricks I've mentioned previously).  Conversions will automatically happen when reading and writing the data.  Those of you that are familiar with VCL for .NET, you already know how all of the above works since the new overloads added to TStrings were introduced with VCL for .NET and use the System.Text.Encoding class for all these operations.