Failover Cluster Fails to Failover

  • We have SQL on an active/passive cluster with the following details:

    SQL Server 2005 Enterprise 64-bit

    Windows 2003 R2 64 bit

    MSA1500 CS disk array

    The cluster works fine until it is suppose to fail over. The first indication in the system log that there is a problem is the “Cluster resource ‘SQL Server’ in Resource Group ‘SQL Server Group’ failed” error. There is nothing in the application log or system log prior to this error that indicates there is a problem.

    When it occurs, the server does not automatically fail over. Instead, the active node goes “missing”. We can ping it, but cannot access it via RDP, remote shutdown, file explorer – anything. SQL is not broadcasting on the server. The application log shows “ODBC sqldriverconnect failed” and “Unable to complete login process due to delay in prelogin response” errors every 30 seconds.

    The only resolution is to physically shut the node off, which then causes the cluster to fail over.

    I have compared the cluster settings to other clusters we have. The only difference is was the defined cluster groups did not have a preferred owner – which shouldn’t matter since we do not fail-back.

    Has anyone seen something like this before?

  • what does the SQL error log say?

  • Nothing. That last entry in the SQL error log was made 50 minutes prior to the failure and was mundane backup type entries. Nothing was written to the SQL error logs again until the failover was forced.

  • So SQL is still running, I assume... but probably shutting down?

  • I do not know if SQL was still running or not. Nothing was able to connect to it and I can tell that the Agent was running because I put a monitoring job on it that sent a heartbeat to another server. This is the only way I knew it was down.

  • Agent can't run without SQL Server running so that answers the question.

    And if SQL Server is running you can't fail over.

    It needs to shutdown on one node and the resources float to the other node and it brings SQL up on the other node.

    look through the error logs to see the shutdown command and all the info related to why it wouldn't shut down. you may have been able to log on and kill/rollback some active transactions to speed up the process.

    I've had SQL take 40+ minutes to shutdown and failover on a really really distinctive case where I just let the thing finish. That's because SHUTDOWN WITH NOWAIT isn't specified by the cluster software that I was using.

    http://msdn.microsoft.com/en-us/library/ms188767%28SQL.90%29.aspx

    Anyway... all that might not be your case, but use the sql error log reader to check up on what was going on on the server.

    Good luck.

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply