All Queries Performing Much Worse

  • I have a SQL Server 2014 Enterprise Edition machine on SP1 CU4 that developed some severe performance issues early at the end of last week.  It is a virtual machine that was moved from an older VMWare cluster to a new UCS environment between 11pm and midnight on 5/11.  The path to the storage stayed the same and the newer host should have provided better performance. 

    Someone contacted me about the performance of the SQL instance on Monday.  The primary purpose of this machine is to host data from Azure locally where we still currently have some other data that the Azure data needs to be aggregated with.  The instance owner is using a bulk copy process via C# to do BULK INSERT statements to several tables on the local instance, I'll call it instance E.  In our monitoring software I can see that suddenly beginning with the 6am hour on 5/12 every query started taking 3-4 times longer to process on instance E.  That showed up in a chart of all expensive queries in the monitoring tool. 

    In terms of perfmon metric changes, here is what I observed.  Log Bytes Flushed/sec went from 12 million to 17 million throughout the day on 5/11 and then went from 17 to 28 million between 11pm on 5/11 and midnight on 5/12 and has stayed around there.  Log Flushes/sec went from 254 to 280 over the course of 5/11 and then between 11pm on 5/11 and midnight on 5/12  it went up to 630 and is in the 400 to 600 range all the time now.  Page Writes per second gradually went up from 1700 to 2300 until 11pm on 5/11.  At midnight it went to 3,000 and at 1 am was 3600.  It was consistently higher than that on 5/13 and 14 and since the 16th has consistently been around 3300. Procedure Hit Cahce Ratio has gone from 93% on 5/11 down to 65% and the Plan Cache Size took a nose dive from 318 MB at 7pm on 5/11 down to 48 MB by 9pm, and with one exception where it climbed back up to 200 MB, it has been around 50 MB ever since.  SQL Compilations/sec has plummeted along with Batch Requests/sec. It's almost like it's less busy from a transactions standpoint but taking longer for queries to process, which one would expect the opposite. 

    So, I'm seeing a performance change in perfmon metrics between 11pm on the 11th and midnight on the 12th as well as a sudden performance change for queries at 6am on 5/12.  

    The person who manages this BULK INSERT process says nothing has changed about that process.  I'm not really sure where to go from here.

  • The other thing that changed is that suddenly during the 6am hour on 5/12 waiting for LCK_M_X starts becoming more prevalent on the 12th.  By Monday the 14th queries had really long IO Completion wait type associated with them and OS Disk Queue Length was through the roof.  Write latency in SQL Server, as measured by the monitoring tool querying dm_io_virtual_file_stats, has gone up from 150 ms to around 225.

  • May not be a SQL issue but an issue with the VM and/or the host.  Have you had the sys admins check things on their end?

  • lmarkum - Friday, May 18, 2018 7:40 AM

    I have a SQL Server 2014 Enterprise Edition machine on SP1 CU4 that developed some severe performance issues early at the end of last week.  It is a virtual machine that was moved from an older VMWare cluster to a new UCS environment between 11pm and midnight on 5/11.  The path to the storage stayed the same and the newer host should have provided better performance. 

    Someone contacted me about the performance of the SQL instance on Monday.  The primary purpose of this machine is to host data from Azure locally where we still currently have some other data that the Azure data needs to be aggregated with.  The instance owner is using a bulk copy process via C# to do BULK INSERT statements to several tables on the local instance, I'll call it instance E.  In our monitoring software I can see that suddenly beginning with the 6am hour on 5/12 every query started taking 3-4 times longer to process on instance E.  That showed up in a chart of all expensive queries in the monitoring tool. 

    In terms of perfmon metric changes, here is what I observed.  Log Bytes Flushed/sec went from 12 million to 17 million throughout the day on 5/11 and then went from 17 to 28 million between 11pm on 5/11 and midnight on 5/12 and has stayed around there.  Log Flushes/sec went from 254 to 280 over the course of 5/11 and then between 11pm on 5/11 and midnight on 5/12  it went up to 630 and is in the 400 to 600 range all the time now.  Page Writes per second gradually went up from 1700 to 2300 until 11pm on 5/11.  At midnight it went to 3,000 and at 1 am was 3600.  It was consistently higher than that on 5/13 and 14 and since the 16th has consistently been around 3300. Procedure Hit Cahce Ratio has gone from 93% on 5/11 down to 65% and the Plan Cache Size took a nose dive from 318 MB at 7pm on 5/11 down to 48 MB by 9pm, and with one exception where it climbed back up to 200 MB, it has been around 50 MB ever since.  SQL Compilations/sec has plummeted along with Batch Requests/sec. It's almost like it's less busy from a transactions standpoint but taking longer for queries to process, which one would expect the opposite. 

    So, I'm seeing a performance change in perfmon metrics between 11pm on the 11th and midnight on the 12th as well as a sudden performance change for queries at 6am on 5/12.  

    The person who manages this BULK INSERT process says nothing has changed about that process.  I'm not really sure where to go from here.

    Any chance you are having memory problems with UCS servers? It almost sounds like it could be degrading...I saw that happen before when not using the "approved" memory. May or may not be the issue but you may want to check. Here are some related docs
    Troubleshoot DIMM memory issues in UCS
    DIMMs: Reasons to use only Cisco Qualified Memory on Cisco UCS Servers

    Sue

  • Lynn, Sue.  We moved the guest VM back to the old host just as a test, and things have improved quite a bit.  PLE is recovering as well as the plan cache size.  SQL Compilations/sec have recovered from under 10/sec up to around 120.  Batch request/sec has gone way up to between 600 and 800.  Trans/sec has gone up.  SQL Write Latency has dropped from around 200ms down to 20-30ms and read latency has goen back down dramatically as well.  I will say though that at the moment the number of active sessions is less than prior to shutting the guest down and moving it to the old host.  I'll keep watching it.  The problem is, this machine can't stay on this old host.  We have to evacuate this host. So, we need to figure out why there appears to be an issue with this VM on the new host.  I'll take a look at the links Sue provided.

  • One of the SysAdmins ended up creating a new VM on the UCS with the latest VMWare tools that were still compatible with the old host, in case we ended up moving hte machine back to the old host again.  The VM disks that were on the old guest were connected to the new guest.  INSERT BULK process was turned back on and the SQL instance and VM are performing fine now.  We're not really sure why this resolved the issue.

  • lmarkum - Monday, May 21, 2018 9:05 AM

    One of the SysAdmins ended up creating a new VM on the UCS with the latest VMWare tools that were still compatible with the old host, in case we ended up moving hte machine back to the old host again.  The VM disks that were on the old guest were connected to the new guest.  INSERT BULK process was turned back on and the SQL instance and VM are performing fine now.  We're not really sure why this resolved the issue.

    Those are so frustrating when things are just "fixed" by something unknown. Glad it working though and thanks for posting back.

    Sue

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply