Cluster automatic failover - could it be caused by something running within SQL?

  • We have an active-passive cluster configuration in our production environment, and yesterday afternoon there was an automatic failover of our SQL db-engine instance from the active to the passive node.

    The physical server hosting the active node had rebooted itself, but we cannot find much information of what happened in the SQL logs or the event logs on the server.

    During the brief period of the failover (a few minutes) there was disruption to the applications accessing the SQL databases, and our users noticed it and complained about it.

    My question is: could SQL ever cause a server to reboot itself? I have never heard of such a thing, but I need some corroboration so I can convince our system admins to take a closer look at the Windows side (Windows Server 2008) and our hardware.

    From what I know, SQL Server can gobble up memory and CPU resources to the max on a server, but cannot cause it to reboot itself.

    So there must have been something wrong on the O/S or hardware side (including the disk configuration).

    Any thoughts anyone?

    __________________________________________________________________________________
    SQL Server 2016 Columnstore Index Enhancements - System Views for Disk-Based Tables[/url]
    Persisting SQL Server Index-Usage Statistics with MERGE[/url]
    Turbocharge Your Database Maintenance With Service Broker: Part 2[/url]

  • Marios Philippopoulos (9/25/2010)


    During the brief period of the failover (a few minutes) there was disruption to the applications accessing the SQL databases, and our users noticed it and complained about it.

    wow, it's not surprising the users complained. I have clustered SQL Server instances which failover in around 10 - 15 secs at the very most. In fact last time we had a failover (someone made a VLAN change to the switch port for the public NIC :w00t: ) it was around middle of the day, 90% of the users didnt even realise!

    If you're getting failover times that high you probably want to check your cluster configuration

    Marios Philippopoulos (9/25/2010)


    My question is: could SQL ever cause a server to reboot itself? I have never heard of such a thing, but I need some corroboration so I can convince our system admins to take a closer look at the Windows side (Windows Server 2008) and our hardware.

    From what I know, SQL Server can gobble up memory and CPU resources to the max on a server, but cannot cause it to reboot itself.

    Au contrair mon pere, if SQL server gobbled up or was allowed to use excessive amounts of memory the OS can become starved. SQL Server is asked to release memory but if this doesnt occur quickly enough the server will suffer a BSOD.

    Check the event logs primarily for anything odd up to the known time of failure.

    As i said before definitely re visit your cluster config, failover should be a lot quicker than 5 minutes IMHO

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • I'll need to check more closely how long the failover lasted; how do I do that? Get last timestamp from SQL ERRORLOG prior to the failover and first timestamp after?

    Based on that difference between the SQL ERRORLOG timestamps before and after failover the difference is 9 minutes.

    __________________________________________________________________________________
    SQL Server 2016 Columnstore Index Enhancements - System Views for Disk-Based Tables[/url]
    Persisting SQL Server Index-Usage Statistics with MERGE[/url]
    Turbocharge Your Database Maintenance With Service Broker: Part 2[/url]

  • Arrange some downtime for the instance with your change manager. Go into cluster administrator and move the SQL resource group to a partner node, note how long it takes to offline move and online the resources.

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • Any stack traces in your log directory?

    the longest part of a cluster failover is usually recovering the database(s). Check the time of the first message in the errorlog and the time when the 'recovery is complete' message occurs.

    As a test we failed over a cluster supporting about 20 databases whilst a large batch load process was running, total failover time was 15 mins. By failover time I mean ALL databases recovered and usable and all services up, including SQLAgent, which will not be usable until last database is recovered.

    ---------------------------------------------------------------------

  • Also check your cluster log !!

    It may also help with the facts analysis.

    A long startup time for your sqlinstance may be cause by an excessive amount of virtual log files (VLF). (http://www.sqlskills.com/BLOGS/KIMBERLY/post/8-Steps-to-better-Transaction-Log-throughput.aspx.

    )

    Did someone add other middle ware on the node(s) ? e.g. If seen an instance reboot because someone installed a new ibm udb middleware version on that server. It wasn't a clustered one, so I don't know if it would actually cause a failover, but now we know we need to request downtime before doing the next server.

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution 😀

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

  • the issue is you dont just have the sql server service resources. You also have any storage, IP and name resources. SQL server will not start until these are all available! Any problems here can increase the failover time.

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • The description sounds a lot like cluster negotiation. This is when the inactive node thinks that the active node is offline and tries to fail over. The active node refuses to release its resources and the two nodes go into negotiation for control for an extended period of time until eventually one of the nodes wins out. When we were experiencing this, it caused an outage of about 8 to 9 minutes each time.

    You need to make sure you are using a dedicated heartbeat conenction between the nodes and that it is set to private, not public. Also make sure your SAN Kit, HBA drivers, and NIC drivers are all up to date. Any of these things can cause the heartbeat connection to error out and trigger the cluster negotiation.

    This would be indicative by messages in the event log and cluster log indicating that the inactive node tried to take the active node offline.


    My blog: SQL Soldier[/url]
    SQL Server Best Practices:
    SQL Server Best Practices
    Twitter: @SQLSoldier
    My book: Pro SQL Server 2008 Mirroring[/url]
    Microsoft Certified Master: SQL Server, Data Platform MVP
    Database Engineer at BlueMountain Capital Management[/url]

  • Robert Davis (9/27/2010)


    The description sounds a lot like cluster negotiation.

    Hi

    This wouldn't cause the original active node to reboot though would it?

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • Perry Whittle (9/27/2010)


    Robert Davis (9/27/2010)


    The description sounds a lot like cluster negotiation.

    Hi

    This wouldn't cause the original active node to reboot though would it?

    No, it wouldn't. I'd love to see what the cluster log says.


    My blog: SQL Soldier[/url]
    SQL Server Best Practices:
    SQL Server Best Practices
    Twitter: @SQLSoldier
    My book: Pro SQL Server 2008 Mirroring[/url]
    Microsoft Certified Master: SQL Server, Data Platform MVP
    Database Engineer at BlueMountain Capital Management[/url]

  • Robert Davis (9/27/2010)


    No, it wouldn't. I'd love to see what the cluster log says.

    Me too!

    Marios can you post details of the cluster.log

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • I launch the failover cluster manager, go to "Cluster Events", but then I see no events there: "No events were found."

    How do I get the cluster log?

    __________________________________________________________________________________
    SQL Server 2016 Columnstore Index Enhancements - System Views for Disk-Based Tables[/url]
    Persisting SQL Server Index-Usage Statistics with MERGE[/url]
    Turbocharge Your Database Maintenance With Service Broker: Part 2[/url]

  • Cluster.log should be in the following folder

    %systemroot%\cluster

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • george sibbald (9/26/2010)


    Any stack traces in your log directory?

    the longest part of a cluster failover is usually recovering the database(s). Check the time of the first message in the errorlog and the time when the 'recovery is complete' message occurs.

    As a test we failed over a cluster supporting about 20 databases whilst a large batch load process was running, total failover time was 15 mins. By failover time I mean ALL databases recovered and usable and all services up, including SQLAgent, which will not be usable until last database is recovered.

    There were no stack-trace dumps in the log directory.

    Recovery completed within a minute from the time of the first message in the ERRORLOG to the "recovery complete" message.

    __________________________________________________________________________________
    SQL Server 2016 Columnstore Index Enhancements - System Views for Disk-Based Tables[/url]
    Persisting SQL Server Index-Usage Statistics with MERGE[/url]
    Turbocharge Your Database Maintenance With Service Broker: Part 2[/url]

  • Perry Whittle (9/27/2010)


    Cluster.log should be in the following folder

    %systemroot%\cluster

    Hmm, I'm on Windows Server 2008, cluster.log is no longer there. I'll use this link to get the log:

    http://blogs.technet.com/b/pfe-ireland/archive/2008/07/04/windows-2008-clustering-the-cluster-log.aspx

    ...

    Getting access-is-denied errors when I try to run the following (as cluster admin) from the command prompt:

    Cluster /Cluster:<myClusterName> log /gen /copy "C:\temp"

    __________________________________________________________________________________
    SQL Server 2016 Columnstore Index Enhancements - System Views for Disk-Based Tables[/url]
    Persisting SQL Server Index-Usage Statistics with MERGE[/url]
    Turbocharge Your Database Maintenance With Service Broker: Part 2[/url]

Viewing 15 posts - 1 through 15 (of 41 total)

You must be logged in to reply to this topic. Login to reply