Thursday, January 15, 2004

More IDE secrets - UTF8 and the Editor

In C#Builder and Delphi8, the code editor operates internally on UTF8 characters only. This requires a "filtering" mechanism to translate from/to various other text file formats when the file is loaded or saved. These are the "filters" you can select by right-clicking in the editor and selecting the encoding from the menu. By default, the editor will store the file as locale-specific ANSI encoding which can lead to a potential loss of character data if a file from another locale using a different code-page were loaded and saved. You can change the default encoding of the files to always be UTF8 by setting the following key:

C#Builder: HKCUSoftwareBorlandBDS1.0Editor
Delphi 8: HKCUSoftwareBorlandBDS2.0Editor


Using UTF8 encoding when operating on a file in memory instead of straight UCS-2 (Unicode) was done for efficiency, not only in implementation time, but also in terms of memory usage. The editor kernel already knew how to manage middle- and far-eastern multibyte codepages, so extending the kernel to simply treat UTF8 as simply another multi-byte encoding was a relatively trivial excercise. Also, since the vast majority of source files contain only characters from the 0-127 ASCII range, each character will remain a single byte. Only embedded strings and comments would typically have extended (>128) characters. Also, UTF8 conversion is a very fast bit-level transform without the need for look-up tables. This allows the editor painting code to do a simple quick transform into UCS-2 and then use the Unicode APIs for painting the text. This way, a file created from one locale will render correctly when opened and edited in another locale since the UTF8/Unicode character space includes encodings for all languages.