Patching Failover Clusters

  • I don't have much experience with clustering but I understand the concepts of clustering.  I understand that Microsoft has a recommended process for patching Windows, or even SQL Server when a failover cluster instance is involved as seen here.  Another group in my company handles Windows patching and their plan at this point is to install all the Windows patches on all nodes but leave the restarting of the nodes up to the DBA team so that we're in control of when a failover happens.  This seems like it could open us up to some unnecessary risk, but I don't know all the ways this could go wrong.  What problems could occur by installing OS patches on all the nodes but not rebooting?  What advantages, if any, would there be to having failover clustered instances in a test/dev environment?  Trying to convince people to do that so we have a place that mirrors prod to test patching.

  • Here is how we handle this, it may or may not work for you. I think it is not safe to patch both nodes, it defeats the purpose of having a cluster.
    Patching is automated with SCCM, i don't want to get up at night to restart a server.
    All clusters are split in to Node 1 - Node 2 patching groups. Group 1 will be patched first, Group 2 a week later (except when emergency security patch needs to be pushed out).
    Patching starts with reboot at predetermined time. This way fail-over happens in controlled manner and node is now available for patching (who knows how long it will actually take, it can vary).
    After Node 1 was patched it is restarted again and next morning i can verify that it has joined the cluster and now ready to accept workload. Same process repeats week later with Node 2.

    We done this for years and never had problems.

  • sam rebus - Wednesday, June 21, 2017 1:31 PM

    Here is how we handle this, it may or may not work for you. I think it is not safe to patch both nodes, it defeats the purpose of having a cluster.
    Patching is automated with SCCM, i don't want to get up at night to restart a server.
    All clusters are split in to Node 1 - Node 2 patching groups. Group 1 will be patched first, Group 2 a week later (except when emergency security patch needs to be pushed out).
    Patching starts with reboot at predetermined time. This way fail-over happens in controlled manner and node is now available for patching (who knows how long it will actually take, it can vary).
    After Node 1 was patched it is restarted again and next morning i can verify that it has joined the cluster and now ready to accept workload. Same process repeats week later with Node 2.

    We done this for years and never had problems.

    Thanks, that helps.  How do you handle the possibility of a failover happening while the passive node is having patches installed?  The process I saw was to remove that node from possible cluster resource owners in the Failover Cluster Manager. Is that something SCCM is doing for you? Then the patching happens and the reboot and then the failover cluster manager is used to add that passive node back as a possible owner of cluster resources.  Is that overkill?

  • You could script it for sure, but i do not. it has not been a problem before and probability of it becoming one is very low. You know your environment better of course. I have 6 clusters and all are on VMs, so chance of spontaneous fail-over during patching is very slim.

  • lmarkum - Wednesday, June 21, 2017 11:59 AM

    Another group in my company handles Windows patching and their plan at this point is to install all the Windows patches on all nodes but leave the restarting of the nodes up to the DBA team so that we're in control of when a failover happens.

    This isnt necessary if they patch any inactive nodes and reboot them. Does your cluster have multiple instances across the nodes?

    lmarkum - Wednesday, june 21, 2017 11:59 AMThis seems like it could open us up to some unnecessary risk, but I don't know all the ways this could go wrong.  What problems could occur by installing OS patches on all the nodes but not rebooting?

    Installing OS patches and not rebooting can cause some oddities, generally you'll find patches that force a reboot of the OS to complete the patch installation.
    Ensure nodes to be patched are drained of roles, so that they may be rebooted without issue.

    lmarkum - Wednesday, june 21, 2017 11:59 AMWhat advantages, if any, would there be to having failover clustered instances in a test/dev environment?  Trying to convince people to do that so we have a place that mirrors prod to test patching.

    WSFCs in test and dev are generally only useful for replicating the environment. As dev environment i would say you dont need it

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • Perry Whittle - Thursday, June 22, 2017 5:25 AM

    lmarkum - Wednesday, June 21, 2017 11:59 AM

    Another group in my company handles Windows patching and their plan at this point is to install all the Windows patches on all nodes but leave the restarting of the nodes up to the DBA team so that we're in control of when a failover happens.

    This isnt necessary if they patch any inactive nodes and reboot them. Does your cluster have multiple instances across the nodes?

    lmarkum - Wednesday, june 21, 2017 11:59 AMThis seems like it could open us up to some unnecessary risk, but I don't know all the ways this could go wrong.  What problems could occur by installing OS patches on all the nodes but not rebooting?

    Installing OS patches and not rebooting can cause some oddities, generally you'll find patches that force a reboot of the OS to complete the patch installation.
    Ensure nodes to be patched are drained of roles, so that they may be rebooted without issue.

    lmarkum - Wednesday, june 21, 2017 11:59 AMWhat advantages, if any, would there be to having failover clustered instances in a test/dev environment?  Trying to convince people to do that so we have a place that mirrors prod to test patching.

    WSFCs in test and dev are generally only useful for replicating the environment. As dev environment i would say you dont need it

    The cluster does not have multiple instances of the DB engine installed.  There are, however, at least two instances of SSAS, one Multi-dimensional and a Tabular instance that somehow has two names associated with it, but that's another issue for a different post.  I do appreciate hearing that installing patches but not rebooting can create some problems.

  • sam rebus - Wednesday, June 21, 2017 4:37 PM

    You could script it for sure, but i do not. it has not been a problem before and probability of it becoming one is very low. You know your environment better of course. I have 6 clusters and all are on VMs, so chance of spontaneous fail-over during patching is very slim.

    Sam, thanks for sharing your experience with this issue.  It has helped.

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply