Wednesday, November 25, 2015

Strings with a reference count > 1 are immutable

And have always been.

Let me explain. Assigning to a string variable from another has always been merely copying of the reference along with a bit of reference count management.

  A, B: string;
  A := 'a string';
  B := A;
  // string instance reference count = 2

But what if you wanted to modify the content of "B"?
  B[2] = '_';
While this happens at runtime, the compiler knows that you are modifying the content of the string referenced by "B", so it generates code to ensure that "B" can be modified. Some of you already know what is happening. A call to the helper function, _UniqueString() is issued before the assignment is done. This will look at the instance, and if the reference count > 1 it is not "safe" to modify the content because "A" is still referencing it. If it were allowed, "A" would "see" the change. That doesn't preserve the "value type" semantics.

Because the reference count > 1, the _UniqueString() function will create a copy of the string and modify "B" to reference it and remove the reference from original. Now "A" and "B" reference different instances, each with a reference count = 1. Because "B" is the only reference to one of those instances, it is "safe" to modify the string content.

This is also true across threads. What if "A" is a variable being used by another thread? What if both threads try and modify the string at nearly the same time? There is a benign "race-condition" in which both threads may try and make a copy of the string to ensure they have a unique instance, but because the reference count is "thread safe" the worst case scenario is that both create a copy of the string and the original shared string instance is merely freed.

At no point in the above scenarios are the variables shared across threads, only the instance(s). Another thing to remember here is that any kind of typecasting can side-step any compiler generated code that is there to ensure proper consistency. Without the proper code, all bets are off and you're on your own. Type casting is the developers way of telling the compiler to "get out of the way! I know what I'm doing"... and the compiler believes you.

So when you think about it, a string instance with a reference count > 1 is never allowed to be modified using any normal constructs. The compiler also tries to be nice and understands one special kind of type cast; the "PChar(B)" type cast. In this case, the compiler doesn't quite know what you intend to do with that pointer, so it helps you out by calling _UniqueString(). This was done because it was a very common scenario when calling out to "C-style" external APIs which expect a "char *" or "wchar_t *" parameter.

Because of very common scenario where the "PChar(B)" cast is used, the runtime also ensures that all strings are terminated with a #0 character, which is not included in the length. This has also lead to some confusion because strings are "length-prefixed" there is no special "terminator character" so #0 is just as valid of a character as "S". However, C-Style strings rely on a #0 to indicate the end of the string. So, if you pass a string to some API that has an embedded terminator, it will be interpreted as shorter than it actually is. Not much can be done about that, other than you need to ensure there are no embedded #0 characters in your strings you pass as a "PChar(B)".