Cluster monitoring

  • Over the last couple of weeks a fail over has occurred as a result of the cluster CheckQueryProcessorAlive querying SQL and getting no response (after about 6 try’s) (Checking for CheckQueryProcessorAlive: sqlexecdirect failed) the fail over went nicely, but obviously affected users. There was very little useful information in the event log and the SQL Server log. But I am still left with the task of explaining to people as to why it happened. So, I am looking for cluster-specific monitoring tools. Does anyone have any suggestions?

    Thanks in advance,

    Clea Boe

  • Can you give some detail over your cluster setup what SP's and if it is a SCSI cluster or fibre cluster and if the heartbeat runs over a public network or is segmented some how. Usually these kinds of things can be traced to a network issue. There are some known bugs in clustering you may want to search technet about it.

    Wes

  • I had a 3 month incident with Microsoft concerning this issue. The most likely cause is that your server was too busy to respond to the cluster's monitor process so you should trace cpu usage during the periods when your failover occurs.

    The cluster monitor attempts to execute the command "SELECT @@SERVERNAME" periodically via the ODBC api SQLExecDirect. Microsoft's view of what you can do to resolve this issue is, in my opinion, entirely incorrect. They suggest limiting the server RAM and limiting the number of CPUs the server can run on. These give more resources to the cluster's monitoring thread and not to the sql server itself, which is the thing that is resource bound. The obvious way to solve this problem is to allow the user to tune the ODBC timeout the resource monitor is using, e.g. via the registry.

    What I did to resolve my issue was some hefty retuning of my tables & indices which allowed the sp I was having problems with to execute in such a manner that the server didn't get resource bound.

    You're going to need to identify what the server's doing when this fail over occurs and tune around it.

    Graham

  • Hi Wes,

    It is a fibre cluster with heartbeat running on dedicated private link completely seperate from the public network. SQL2K SP3 - not sure about OS patches.

    Thanks

    Clea

  • SELECT @@SERVERNAME I completly forgot about that. We had to do the same thing on a real active server. We ended up running the cluster in active/active and shaping traffic around the spikes in usage. Another thing we had to do in SQL2k was set max parallelism to 1 instead of 0 keeping some processes from hogging all the proc's a side effect was timeouts on some jobs until we re-wrote them. If I'm not mistaken the parallelism problem was resolved in SP2 I would do some searching on MS site for sure. Good catch click-fund!

    Wes

  • we are using sql server 2000 clustering .Pl help me , tell me about how to

      moniture cluster and what are diffrent counters

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply