Wednesday, February 6, 2008

A Critical[Section] Difference: Windows XP vs. Windows Vista

No, this isn't one of those comparisons!  This is just something somewhat interesting about the difference in the implementation of a critical section in Windows XP vs. Windows Vista.  It seems that Windows Vista is much more resilient in how it handle's the misuse of a critical section.  One such degenerate, blatantly obvious case of misuse is doing the following:

LeaveCriticalSection(CSec);
EnterCriticalSection(CSec);

Yes this is wrong on so many levels, but the interesting thing to note is that under Windows XP, the above code hangs at the EnterCriticalSection call because the LeaveCriticalSection call did some very bad damage to the critical section structure.  The problem is that the RecursionCount field of the critical section is decremented and left as non-zero (-1) which causes the LockCount field to also be decremented in order to keep the two field counts in sync.  When the EnterCriticalSection call is made, the LockCount field is incremented, but it is still non-zero which makes it think there is another owner.  The OwningThread Field is compared with the calling thread's ID which being 0, is not equal to the calling thread.  But no other thread owns the lock!  It blithely forges ahead and allocates the wait event and proceeds to block.  Oops!


Windows Vista, in contrast, has changed how it manages the LockCount field.  Rather than only accumulating the waiters (and recursions) in terms of simply incrementing (or decremented) the LockCount field, it uses the low-bit as the actual indication of holding the lock and accumulates the waiter and recursion counts in the remaining bits.  This means that the critical section can now only have 2^31 recursions and waiters. (UPDATE: Upon further examination, only the waiters are indicated in the LockCount field)  Still plenty of space.  The upshot of this is that the above degenerate case no longer will cause a complete hang in many cases where it would have.  Why they made the change is anybody's guess;  Maybe Raymond Chen can pipe in and explain the reasons...  Seems to me that anyone with doing the above deserves to be put into the penalty box.  Now all that is going to happen is that when a program that once used to hang, will now appear to work only to probably show up some other flaw downstream.  Seems like a hang/crash now or hang/crash later kind of thing.  The end result is still the same.

6 comments:

  1. Change debugging pattern? To what, 'awake'? The code that demonstrates the issue is ridiculous, the equivalent of closing a door before you walk through it. As to platform stability, this has nothing to do with that. It just means that if Joe Programmer isn't smart enough to understand critical sections, MS gives them a little free pass on this one bad approach.

    ReplyDelete
  2. Well, I get this picture in my mind: Some applications will end up working fine in Vista, but crashing misteriously on XP. I don't think this is intentional (in this case), but I don't think it's a small thing, either. It seems MS are torn between those for which backward compatibility is still paramount (as it generally was a few years ago, essentially up to the .NET wave) and those that would like to "stimulate" a faster rate of technological change.
    One would have expected them to add a new API instead of breaking the old one.

    ReplyDelete
  3. Mmmm, it increase the debugging cost when developing in Vista. Some applications that will fail on XP, will seem's to work on Vista. But the same app can later hang up, and it will hardest to deduce what's wrong.

    ReplyDelete
  4. It may appear as "improved robustness" in Vista, but in fact it just hides invalid program flow and makes debugging more complicated. The example code is certainly for demonstration only, in real world apps the flow will be a "little" more complex, with the same pattern though.
    Imo, if there is invalid code, then it should be expose as error/exception as soon as possible and not be artificially delayed, giving the false impression of robustness.
    It's a bit like setting variables to nil if the instances they were referencing have been freed. At least in unmanaged code ... ;-)

    ReplyDelete
  5. Heheh,
    I found that out on my own - the hard way of course:

    http://blog.delphi-jedi.net/2008/04/23/the-case-of-the-unexplained-dead-lock-in-a-single-thread/

    ReplyDelete
  6. I believe such risks (OS behavior consequence) can easily avoided by considering virtual environment.

    Example:

    1) By targeting Windows environment and overlooking the release (32/64 bit, XP, Vista).
    2) Explicitly define the critical sections and use the appropriate methods to enter/leave critical sections.

    Consequently, sources applying such "low level virtualization" are more comprehensive.

    ReplyDelete

Please keep your comments related to the post on which you are commenting. No spam, personal attacks, or general nastiness. I will be watching and will delete comments I find irrelevant, offensive and unnecessary.