Alwayson AG behaviour and locking

  • Hi All,

    We have an SQL 2017 CU22 Enterprise edition. We have configured some databases which are part of Alwayson Availability Group.

    One of the database "SalesDW" has gone Out of sync for more than 2 hours. There was an application outage and app team has reached out the database team.

    When I query state_desc from sys.sysdatabases, it shows ONLINE but in SSMS it shows not synchronizing.

    Checked Errorlog and I see below messages :

    LogDate ProcessInfo Text

    ----------------------- ------------ ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

    2020-12-28 02:14:42.150 spid277s Nonqualified transactions are being rolled back in database SalesDW for an Always On Availability Groups state change. Estimated rollback completion: 0%. This is an informational message only. No user action is required.

    2020-12-28 02:14:44.150 spid277s Nonqualified transactions are being rolled back in database SalesDW for an Always On Availability Groups state change. Estimated rollback completion: 0%. This is an informational message only. No user action is required.

    :

    :

    :

    Issue ended at

    2020-12-28 04:26:27.660 spid277s Nonqualified transactions are being rolled back in database SalesDW for an Always On Availability Groups state change. Estimated rollback completion: 99%. This is an informational message only. No user action is required.

    2020-12-28 04:26:29.660 spid277s Nonqualified transactions are being rolled back in database SalesDW for an Always On Availability Groups state change. Estimated rollback completion: 99%. This is an informational message only. No user action is required.

    2020-12-28 04:26:31.670 spid277s Nonqualified transactions are being rolled back in database SalesDW for an Always On Availability Groups state change. Estimated rollback completion: 99%. This is an informational message only. No user action is required.

    2020-12-28 04:26:33.670 spid277s Nonqualified transactions are being rolled back in database SalesDW for an Always On Availability Groups state change. Estimated rollback completion: 99%. This is an informational message only. No user action is required.

    Further checking I also see :

    LogDate ProcessInfo Text

    ----------------------- ------------ ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

    2020-8-28 02:14:40.140 spid277s Process ID 765 was killed by an ABORT_AFTER_WAIT = BLOCKERS DDL statement on database_id = 8, object_id = 0.

    2020-8-28 02:14:42.150 spid277s Process ID 765 was killed by an ABORT_AFTER_WAIT = BLOCKERS DDL statement on database_id = 8, object_id = 0.

    2020-8-28 02:14:44.150 spid277s Process ID 765 was killed by an ABORT_AFTER_WAIT = BLOCKERS DDL statement on database_id = 8, object_id = 0.

    2020-8-28 02:14:48.180 spid277s Process ID 765 was killed by an ABORT_AFTER_WAIT = BLOCKERS DDL statement on database_id = 8, object_id = 0.

    2020-8-28 02:14:50.180 spid277s Process ID 765 was killed by an ABORT_AFTER_WAIT = BLOCKERS DDL statement on database_id = 8, object_id = 0.

    2020-8-28 02:14:52.190 spid277s Process ID 765 was killed by an ABORT_AFTER_WAIT = BLOCKERS DDL statement on database_id = 8, object_id = 0.

    2020-8-28 02:14:54.190 spid277s Process ID 765 was killed by an ABORT_AFTER_WAIT = BLOCKERS DDL statement on database_id = 8, object_id = 0.

    2020-8-28 02:14:56.190 spid277s Process ID 765 was killed by an ABORT_AFTER_WAIT = BLOCKERS DDL statement on database_id = 8, object_id = 0.

    :

    2020-8-28 04:19:22.220 spid277s Process ID 765 was killed by an ABORT_AFTER_WAIT = BLOCKERS DDL statement on database_id = 8, object_id = 0.

    2020-8-28 04:19:24.220 spid277s Process ID 765 was killed by an ABORT_AFTER_WAIT = BLOCKERS DDL statement on database_id = 8, object_id = 0.

    2020-8-28 04:19:26.230 spid277s Process ID 765 was killed by an ABORT_AFTER_WAIT = BLOCKERS DDL statement on database_id = 8, object_id = 0.

    2020-8-28 04:19:28.240 spid277s Process ID 765 was killed by an ABORT_AFTER_WAIT = BLOCKERS DDL statement on database_id = 8, object_id = 0.

    2020-8-28 04:19:30.230 spid277s Process ID 765 was killed by an ABORT_AFTER_WAIT = BLOCKERS DDL statement on database_id = 8, object_id = 0.

    Questions:

    1) Firstly, Why the database gone out of sync? Until the rollback is complete, we aren't able access the database. If we try to do so, we get below error:

    Msg 983, Level 14, State 1, Line 38

    Unable to access availability database 'SalesDW' because the database replica is not in the PRIMARY or SECONDARY role.

    Connections to an availability database is permitted only when the database replica is in the PRIMARY or SECONDARY role. Try the operation again later.

    2) Seondly, about the KILL messages. We dont know who has killed it. What does below message mean ?

    Process ID 765 was killed by an ABORT_AFTER_WAIT = BLOCKERS DDL statement on database_id = 8, object_id = 0.

    I was reading something about online operations? What happens if someone uses INDEX REBUILD ABORT_AFTER_WAIT = BLOCKERS in their code? Does it means, if any spids blocking this spid will get KILLED?

    https://sqlserver-help.com/tag/managed-lock-priority/

    Unfortunately, I didn't see any alter index message logged in the ERRORLOG as mentioned in above article.

    3) This situation might occur again, how can monitor the spids to get a full picture during that time frame?

    Thanks,

    Sam

  • Thanks for posting your issue and hopefully someone will answer soon.

    This is an automated bump to increase visibility of your question.

  • Just my 2 cents. It seems to me that Index 'Rebuild' job was aborted and caused the timed out and rolled back. I had also experienced 'out of sync' secondary. If you look into the 'log' file (text) under Log folder and dig into it (I think it only log to .txt file if you check box on Agent job) it will tell you which index 'failed' to rebuild. That may give you some clue. Good luck

  • This sound a bit like there was a state change in your AG, meaning that a failover occurred, and there was maybe a large redo queue on the secondary that had to be worked through. Perhaps that included this rollback that was happening. The log on the secondary has to be redone before the database can come online after a failover.

    If you don't have a 3rd party SQL Server monitoring app already, I would encourage you to consider one. It will help you answer questions like #3. In the absence of 3rd party software, you can run sp_whoisactive and log the results to a table every minute or so. That will provide something of a history of activity on the SQL server that can be examined later.

    If the query store is enabled for the database, you could look in the query store during the time frame when this occurred to get a picture of activity at that time that was significant enough in terms of performance that it would show up in there.

    The System Health xEvents session captures information about queries waiting for locks for an extended period of time. You can query that with sys.fn_xe_file_target_read_file. The datsa in the fie can also be sorted and filtered in the SSMS UI.

  • Thank you Tac11 and Imarkun for the inputs.

    Best Regards,

    Sam

  • This was removed by the editor as SPAM

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply