Exchange 2010 SP1 Hung I/O watchdog thread causes intentional BSOD

The other day I ran into an issue with a customer in that one of
their Exchange 2010 mailbox servers was continuously rebooting after what
seemed to be a few minutes, and I thought I would share the experience and end result. This box was  hosting the active copy of their mailbox database and the passive database had  already been put into a “failed and suspended” state by the system.

The system did not fail over to the passive copy obviously  and there was upwards of 6000+ transaction logs in the copy queue length. I had  the customer shut down the guest in vsphere and vmotion it to another host while it was powered down. This seemed tom stabilize the box and stop the automatic restarting after a few minutes of uptime.  I was also then able to grab the memory dump files and analyzed  them with Windbg. The bugcheck codes reported in the memory.dmp file was

BugCheck F4, {3, fffffa8004dc9b30, fffffa8004dc9e10, fffff80001bd8f40}

The culprit to the blue screen was msexchangerepl.exe. That
seemed kind of odd so I looked into any issues with SP1 or RU3 for SP1 which is
what the customer is running at, which brought me to a technet article explaining new high availability features in Exchange 2010 SP1

In Exchange 2010 SP1, ESE has a new feature to detect “hung I/O”
with a watchdog thread and will log events that Active Manager will see and
respond to accordingly.  Active Manager is the component in Exchange 2010 which is the brains of choosing which database(s) to activate during best copy selection while handling recovery from database issues.

If the ESE watchdog has a database i/o issue that lasts greater than four minutes, the msexchangerepl.exe services will initiate a bug check by terminiating wininit.exe.

So it seems that Exchange is trying to protect and recovery  itself from storage I/O issues, by initiating a blue screen and restarting the  system. I’m unsure though how the actual integrity of the Exchange database files and logs are ensured by the  system before the bug check occurs. It also appears that you can change the behavior of this feature in the registry or in Active Directory.

Some useful links below.

Active Manager technical details – http://technet.microsoft.com/en-us/library/dd776123.aspx

Blog Post on Exchange 2010 Hanging I/O  – http://thoughtsofanidlemind.wordpress.com/2011/02/10/sp1-and-bsod/

ESE on Hung IO at the bottom of the page – http://technet.microsoft.com/en-us/library/ff625233.aspx

Advertisements
This entry was posted in Exchange Server 2010 and tagged , , , , , , , , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s