Unplanned failover of AG

  • Last night our AG failed over to our passive node (same subnet). This isn't a problem as our data loads are configured to run from the current primary.

    What is a problem, is that our Listener stopped working for some reason. We have done lots of investigation today and couldn't find anything. The listener would work locally on the box, but not remotely. It would ping from the box and connect to the instance, but not ping remotely.

    I have just failed the AG back over to the primary and the listener started working again. I failed it back to the passive, and the listener still works.

    Any idea why this might have occured?

  • under what accounts do the sql server services run on each node?

    does the listener name and ip come online at all on the second node, check the cluster log for more detail on why it failed

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • I would be interested in the configuration of the Listener too.  Is it in a multi-subnet environment? Which port is it listening on, etc.



    Ben Miller
    Microsoft Certified Master: SQL Server, SQL MVP
    @DBAduck - http://dbaduck.com

  • SQLAssAS - Thursday, February 9, 2017 10:02 AM

    Last night our AG failed over to our passive node (same subnet). This isn't a problem as our data loads are configured to run from the current primary.

    What is a problem, is that our Listener stopped working for some reason. We have done lots of investigation today and couldn't find anything. The listener would work locally on the box, but not remotely. It would ping from the box and connect to the instance, but not ping remotely.

    I have just failed the AG back over to the primary and the listener started working again. I failed it back to the passive, and the listener still works.

    Any idea why this might have occured?

    also, please provide the AG configuration details, is this a synch or asynch group, is there any auto failover, this will dicatate which nodes are a possible owner of the AG cluster resources

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • We have a 3 node AG
    Node 1 - Subnet 1 -Sync - autofailover
    Node 2 - Subnet 1 -Sync - autofailover
    Node 3 - Subnet 2 - ASync - Manual

    All services are set up using a service account specifically created for running this AG. The account can not be locked out

    The listener is using a custom Port number (not used by anything else)

    The unplanned failover happened between node 1 and node 2, which is when the listener stopped responding. This is due to the Quorum (fileshare) becoming unavailable for a period of time.

  • Perry Whittle - Thursday, February 9, 2017 10:05 AM

    under what accounts do the sql server services run on each node?

    does the listener name and ip come online at all on the second node, check the cluster log for more detail on why it failed

    Hi

    Yes it was online, it was working absolutely fine locally, but not remotely.

  • Sounds to me like you have a firewall issue on that node that does not let remote connections, especially if you can connect to the listener locally.



    Ben Miller
    Microsoft Certified Master: SQL Server, SQL MVP
    @DBAduck - http://dbaduck.com

  • dbaduck - Thursday, February 9, 2017 10:28 AM

    Sounds to me like you have a firewall issue on that node that does not let remote connections, especially if you can connect to the listener locally.

    Hi

    the listener is working fine now with no firewall changes.

    As mentioned above, I had to fail the AG over to node 2, then back again and it now works as expected. But it doesnt explain why it stopped working.

  • Can you check that gratuitous ARP packets are allowed on the networking devices hosting your VLAN? I have an issue in my workplace that gratuitous ARP is blocked on the VLANs because of DISA STIG requirements. The problem is this is how a Windows cluster node announces the updated MAC/IP pair to the gateway for the subnet. So what happens is the VLAN hosting the subnet gateway refuses the packet to update its local ARP table of the new node hosting the listener address and external traffic (traffic coming from application servers trying to connect to the AG listener) timeout for about 15 minutes until the ARP table entry times out and it requests an update; then traffic to the AG listener starts working again.

    The only fixes I know of for this are:

    - Get your networking guys to allow gratuitous ARP packets on networking devices
    - Add NICs on clients connecting to the AG that are on the same subnet as the AG listener. This will allow them to receive the ARP packets directly and should allow for failover times like you expect

    I did a post on this to TechNet forums a little while back.

    WSFC Virtual IPs and GARP
    https://social.technet.microsoft.com/Forums/en-US/5f0831b7-fef6-4efd-a6b5-6ddacd1c3f89/wsfc-virtual-ips-and-garp?forum=winserverClustering

    Joie Andrew
    "Since 1982"

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply