Cluster failover?

  • Had a problem on a production server where the cluster failed over unexpectedly (when is it ever expected!) We are running sql2k with with latest sp on a windows 2000 advanced server running MSCS and geocluster. Problem started out with the following error. LogWriter: Operating system error 5(Access is denied.) encountered.

    Which is followed by a The log for database 'tempdb' is not available. This repeats for every db on the server. After all these erros I get Error: 3414, Severity: 21, State: 1

    Database 'msdb' (database ID 4) could not recover. Contact Technical Support. When I looked msdb was fine. Log shows that many transactions were rolled back in several db's. Ran dbcc checkdb and had to repair a bunch of db's but of course the enterprise db was marked as suspect. I fixed all the problems but need to know what caused these errors in the first place. The cluster service also reported: The Microsoft Clustering Service could not write file (E:\MSCS\chk77E0.tmp). The disk may be low on diskpace, or some other serious condition exists. A few more related errors as well that I can provide if anyone thinks it will help. All drives on the server had lots of space so this was not the issue. Also there were and still are lots of errors from ClusSvc losing connection to a node. Errors like this: The node lost communication with cluster node 'SQL1102' on network 'PCI-Fibre-1GbE-Dual-Port-A'. and Cluster node SQL1102 was removed from the active cluster membership. The Clustering Service may have been stopped on the node, the node may have failed, or the node may have lost communication with the other active cluster nodes. I can't find anything that went wrong with actual sql server wondering if this is a hardware or network problem. Any help would be greatly appreciated.

    **Sorry if this is the wrong place for this, could not think of a better place in the forums**

  • Maybe the sql server service started before the disks were ready. Do you have any event log entries that show the disks were not ready after the sql service started

  • I can't find anything in the logs about the drive the db's and logs are on (the F: drive) but the e: drive is the quorum drive and from the logs it could not write to this drive.

  • When you said "it could not write to this drive" meaning the quorum, what was the exact error message?  (Permission issue, Resource issue, etc.)  Why don't you try manually failing it over again to the other node and see if you get the same error message....

    Linda

  • hi,

    did you have a look at the nt event log on _all_ nodes? This sounds as if the active node lost the disks! It's possible that during the network failures the passive node tried to go active and grabbed the disks, and so SQL Server on the active node had no more access to them.

    karl

    Best regards
    karl

  • There were several access denied errors on the quorum drive,also this shows up in the application log right before trouble started 17053 :

    LogWriter: Operating system error 5(Access is denied.) encountered.I will show here all the app errors that happended in sequence just to have it all out there.

    1: Information: Service has stopped Replication, ID: 1

    2: Information: Service has disconnected from 10.1.1.42 : 1100 for Replication Set DRV-E, ID: 1

    3: Information: Service has stopped Replication, ID: 2

    4: Information: Service has disconnected from 10.1.1.42 : 1100 for Replication Set DRV-F, ID: 2

    5: Error: Windows cannot unload your registry file. If you have a roaming profile, your settings are not replicated. Contact your administrator.

    DETAIL - Access is denied. , Build number ((2195)).

    6: Error: 17053 :

    LogWriter: Operating system error 5(Access is denied.) encountered.

    And then every db on the server gives the following error

    7: Error: 9001, Severity: 21, State: 4

    The log for database 'msdb' is not available.

    Please help!

  • Any messages in system- or security log around that time?

    Did somebody delete or deactivate the service account?

    Best regards
    karl

  • Here are the system log entries at the same time no security entries that apply. These are all errors from the source ClusSvc.

    1: The Microsoft Clustering Service could not write file (E:\MSCS\chk77E0.tmp). The disk may be low on diskpace, or some other serious condition exists. (Lots of room on this disk 6 gigs free)

    2: Microsoft Clustering Service failed to obtain a checkpoint from the cluster database for log file E:\MSCS\tqu77DF.tmp.

    3: Microsoft Clustering Service suffered an unexpected fatal error at line 2166 of source module D:\nt\private\cluster\service\dm\dmlog.c. The error code was 5.

    4: The Cluster Service service terminated unexpectedly. It has done this 1 time(s). The following corrective action will be taken in 60000 milliseconds: Restart the service.

    These errors continue to show up periodically but have not had the server fail over or had any db problems.

  • It seems that you might have had some IO issues. the cluster log might have some information of what triggered the failover. It might be difficult to post here but you can send it to me at shah_a_g_74@yahoo.com and i can see if i can find more information.

  • The cluster log only seems to hold a day or two of data so I will get the appropriate day from backups and e-mail to you sa24. Thanks!

  • Looks like this had to do with hardware problems on one of the servers. Had many errors again today to do with I/O and in particular writing to disk. Our infrastructure team is looking at it and I will post the exact cause when available in case anyone has similar problems.

  • I know this is an old subject, but did anyone find out what the issue was?  We are having the same issue on our machines.

     

    Thanks.

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply