Post cluster-failover log shipping break

  • Hi all,

    Had some issues yesterday when we had to failover the cluster, however this was done on the hour, and I suspect this is what caused 10 of the 12 copy jobs to fail on the Secondary. We went through the logs and found nothing useful.

    Anway, we failed BACK and the LS healed itself, no failures found. However, I was warned that the problem would recur if we fail over again.

    Is this likely to occur?

  • If you failed over an LS broke, then yes when you fail over again it will break until you fix the issues.

    What is the likelyhood of failing over? It depends, we had a highly unstable cluster which failed over at least 3 times a week. The right alerting, monitoring, resolution to problems is needed to ensure that your as stable as you can be.

    I have an job setup which fires when the SQL Agent starts up to say, look the agent has started, go and investigate as you might of had a failover, then check the logs, find the cause and fix it. Touch wood in this new environment and position I'm in now, the cluster is very stable.

  • Looks like we have a problem then because the only "plausible" error we found in the logs was,

    "The server's configuration parameter "irpstacksize" is too small for the server to use a local device. Please increase the value of this parameter .", and the MSKB indicates that this be done in the registry. Trouble is, neither the registry in question nor that of any of the other two SQL clusters contains any such param.

    We appear to be at an impasse, and have to live with the likelihood of LS failing if we go to the other node, and me then - as MS won't support SS2K - having to spend 2-3 days exclusively spent rebuilding shipping for those 10 db's.

    Oh well... 🙂

  • You can manually add in the key to the registry if you wish. It's something to do with the kernel so unsure if you have a dodgy install of Window server on the 2nd node. Might be over kill, but you might want to look at rebuilding that node.

    http://support.microsoft.com/kb/106167

  • Hi Anthony, I'll use that as a final resort, but meanwhile would I be insuring myself against failover if I go ahead and rebuild shipping on the 10 dbs NOW?

  • no you wont be insuring against fail over, nothing can insure against that, if it happens it happens, you will need to find the cause of what made the cluster fail over.

    if logshipping is working now, you wont acheive anything rebuilding it.

  • the classic reason for logshipping failing in a clustered environment after a failover is the log backup directory not being defined as a share that both nodes of the cluster can see, and fails over along with all other resources.

    I would check that first.

    ---------------------------------------------------------------------

  • another problem we have had is logship.exe (think thats the name but anyway the executable) not being installed to exactly the same path on the two nodes.

    ---------------------------------------------------------------------

  • george sibbald (5/31/2012)


    the classic reason for logshipping failing in a clustered environment after a failover is the log backup directory not being defined as a share that both nodes of the cluster can see, and fails over along with all other resources.

    I would check that first.

    To my knowledge it is, but where should I check that?

  • right click the database and select properties - log shipping, check out how it has been configured.

    It just sounds to me as if your log shipping configuration is not 'cluster aware' and may be tied to the node name

    ---------------------------------------------------------------------

  • george sibbald (5/31/2012)


    right click the database and select properties - log shipping, check out how it has been configured.

    It just sounds to me as if your log shipping configuration is not 'cluster aware' and may be tied to the node name

    Again I would ask, "How do I check whether it's cluster-aware'?

  • the share where the logs are backed up to has to be a cluster resource so it fails over when the sql instance does, that way its visible on either node.

    if you are using the node name when specifying the log directory destination or have not made the share a cluster resource you would get the issue you describe.

    ---------------------------------------------------------------------

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply