Wednesday, May 12, 2004

What is a "Breaking interface change?"

There seems to be a level of confusion regarding what constitutes a source code change that would result in symbol version mismatch errors. The rules are actually rather simple, but somewhat obscure and steeped in various levels of voodoo.

The "unit" structure of Delphi (or Borland Pascal) determines what other units can "use" from a particular unit. By placing declarations in the interface section of a unit, the user is publicizing the existence of a particular type, constant, variable, or function for external use. This declaration forms a sort of "contract" between the user of that symbol and the implementation of that symbol. For instance, a class declaration describes general structure of a particular class in addition to the actual physical layout of an instance of that class. When another unit (or program) references that class symbol and then it is compiled, certain bits of information are persisted to the binary .dcu/.dcuil file that describes in very fine detail the actual physical layout of an instance of that class, the class vmt (virtual method table), the RTTI, etc.. For instance, given a class TFoo with a virtual method Bar, a copy of the vmt is placed into the .dcu file with certain offsets. For a descendant of this TFoo, any virtuals it may introduce, will now be appended to the end of ancestor's vmt and the offsets to those entries are recorded. The same thing happens for fields as they have the offsets into the class instance recorded as well.

Now, say you decided to change the declaration of TFoo and wanted to add a new virtual method Baz. When the unit is recompiled, this method would now be inserted into the vmt and all the offsets are recalculated. This represents a classic "breaking" change. In most cases the user is totally unaware that this happened because the compiler is smart enough to recognize that the symbol's version (an internal compiler generated signature that allows the compiler to quickly determine that a symbol has changed), has changed it knows that all other units that refer to that symbol must now be recompiled.

Problems now arise from the fact that if you had a .dcu file that didn't include the original source, there is no way for the compiler to resolve the changes with the users of that symbol. All the compiler has is the final binary representation of that source with all the various offsets and fixups resolved in an intermediate format. This is what causes the symbol version mismatch errors folks may have seen.

All the above also applies to Delphi packages and their corresponding .dcp files as well. .dcp files are simply a concatenation of all the containing .dcu files sans any code blocks since that data is persisted into the .bpl file.

So what can you change that won't break .dcu compatibility? The rules are as follows:

You cannot:

  • change the ancestor type of an existing class

  • add/remove a field to/from a class or record

  • add/remove a new method, virtual, dynamic or otherwise to a class

  • add/remove a property decl to a class, including a property redecl

  • remove a constant, variable, type, or function from the interface

  • change the value of a constant

  • change the calling convention of a method or function

  • change the visibility of a field, method or property

  • change a method to/from virtual or dynamic

You can:

  • add a new global constant

  • add a new class type

  • add a new global variable

  • add a new global procedure or function

This list should cover most situations but there may be some cases that don't quite fit the above list, or at least appear to not fit. For instance, if you had unit A; unit B; and unit C;. Now if the "uses" order were C uses B uses C and a breaking change were made to C, that change may cause a corresponding breaking change in B, and so on. If you didn't have the source to B, there'd be no way to resolve that breaking change.

This is exactly the situation that Borland strives to prevent when we publish a patch. We cannot afford to cause this kind of grief for all the third-parties and ultimately their customers that so diligently support Delphi. Were we to be careless when providing fixes and constantly make breaking changes, the pain and frustration level would rise far above the pain we tried to relieve. Kinda like, "So we cured the patient of that deadly disease, however the problem is that he didn't survive the treatment!"

Now before you skeptics and "flame wolves" pounce and start off with things like, "Hey C++ code doesn't ever have this problem!" Au Contraire! C++ has this same problem in spades. The problem is that you get far less compiler help. Say you had a C++ DLL that exported some classes, or even a .LIB that contained some classes. If you simply replaced the DLL, and thought.. "I only added a private field to my base class," you have just set yourself up for a long debugging session of wondering why some of the fields in your base class are now being overwritten. The same holds true if you had a LIB B for which there were no source (headers don't count here folks), and it linked to a common LIB A. Then LIB A changed the declaration of a class which was used or descended from in LIB B. LIB B will contain instructions, and field offsets based solely on the old version of LIB A. The problem is with this scenario is that the compiler/linker won't tell you that there even is a problem. As long as you didn't change any the method declarations in LIB A, it will link just fine (because the mangled names haven't changed). That is purely insane!

For that other crowd out there that would like to point out that C# isn't afflicted with these problems, I'll have to say for the most part you're right. Since classes in .NET (CLR) are purely late-bound by virtue of the JIT compiler, the runtime execution engine can figure out what the class layout actually is. So you can certainly add virtuals, statics, and fields to an existing class without necessarily breaking existing code... but what if I needed to add a parameter to a method, or I wanted to remove a param? What if that old property or method is obsolete, so I removed it? Same issue folks. Murphy (of Murphy's law) would step in and promptly find that one user that called that changed/removed method and cause their app to break. The interesting thing is that even if that method were referenced in some code, the CLR, and thus the user, would not find out about the problem until the JIT'er actually had to compile the calling method. Microsoft knows all of this and has taken steps to assist the user in solving these issues. The first solution is the late-binding nature of the CLR, the other solutions involve strong names and side-by-side execution, which are out of the scope of this post.

Oh and finally, yes there have been a couple of cases where we did introduce a breaking change in a patch. These situations are met with very internal resistance while the costs and benefits are carefully weighed. These are generally cases where the change was deemed to affect a relatively small number of users, and the likelihood of there being a third-party component that depends on the unit being changed is small. A breaking change to TButton would certainly not likely even be considered. However a change to DDEMan, would certainly be entertained... if only to see who would even notice.