Assessing performance issue

  • Hello. I have been trying to understand the source of some severe performance degradation on one of our servers. The symptoms we are seeing are excessive application timeouts and queries way taking longer to run than normal - even when the execution plan is optimal.

    I have been googling and binging for the last two weeks and have been monitoring key performance counters and I'm not 100% sure what to make of what I'm finding. I'm hoping one or a few of the gurus here can lend a hand.

    To start, our server configuration is as follows:

    Windows Windows 2003 Server R2 Standard Edition SP2 (32 bit)

    SQL Server 2005 SP2

    Data storage is on an iSCSI SAN with a 1 GB NIC (HP AIO600)

    4 GB memory

    The SAN is shared with other database servers but the drives are carved out and dedicated to each server (versus all the drives configured as one large virtual). Four of the logical disks are exposed to this one server (1-RAID5, 3-RAID1).

    These are the main performance counters I have been observing:

    Average Disk Queue Length (one for each drive, not _Total)

    Buffer Cache Hit Ratio

    Page Life Expectancy

    Checkpoint pages/sec

    Lazy writes/sec

    Combined with these I am monitoring some of the DMVs and finding a lot of PAGEIOLATCH_SH (and _EX).

    I am also monitoring CPU counters, page file, etc. All in all, the only two things that stand out are the PAGEIOLATCH wait types and the Page Life Expectancy counter. Everything else appears to be optimal - buffer cache 96%+, checkpoint pages/sec & lazy writes/sec extremely low, CPUs are not being over utilized, etc.

    As far as the PAGEIOLATCH wait types, I think these indicate an I/O issue. But we've even put databases on their own spindels and the same performance still exists. The Page Life Expectancy peaks at about 600 when the SQL Server is started but when we turn on our applications it drops down <75 almost immediately. And of course, when this happens, the Average Disk Queue Length spikes mid range double digits for the duration of the query. Which drive depends on which query. The thing is it is consistent across databases and spindels, etc. So something tells me it is either a SQL or hardware issue, not an application issue.

    Ok, so sorry for the wall of text but my question is, how do you determine what this problem is? At this point I am not sure if it is a memory issue and we should add more memory to SQL Server or if there is a hardware issue with the iSCSI device.

    So, the million dollar question - any suggestions?

    Thanks in advance!

  • It all sounds very much like a lack of buffer pool memory.

    Can you answer the following questions please?

    a) Is AWE enabled?

    b) Does the account that SQL Server runs under have the 'lock pages in memory' OS permission?

    c) What does the performance monitor counter SQLServer:Memory Manager: Total Server Memory (KB) show?

    d) What does the performance monitor counter SQLServer:Memory Manager: Target Server Memory (KB) show?

    e) What are the settings for minimum and maximum server memory set to in SQL Server?

    f) What figure for page life expectancy do you typically see?

    g) Do you see a 'using locked pages for buffer pool' or 'Address Windowing Extensions is enabled' message in the SQL Server error log on start-up?

    Once we know how much memory SQL is using, we'll know whether you need to tweak the settings, or simply add RAM.

    Thanks

    Paul

  • Paul White (9/13/2009)


    It all sounds very much like a lack of buffer pool memory.

    Can you answer the following questions please?

    a) Is AWE enabled?

    b) Does the account that SQL Server runs under have the 'lock pages in memory' OS permission?

    c) What does the performance monitor counter SQLServer:Memory Manager: Total Server Memory (KB) show?

    d) What does the performance monitor counter SQLServer:Memory Manager: Target Server Memory (KB) show?

    e) What are the settings for minimum and maximum server memory set to in SQL Server?

    f) What figure for page life expectancy do you typically see?

    g) Do you see a 'using locked pages for buffer pool' or 'Address Windowing Extensions is enabled' message in the SQL Server error log on start-up?

    Once we know how much memory SQL is using, we'll know whether you need to tweak the settings, or simply add RAM.

    Thanks

    Paul

    Paul, thanks for your reply. See my responses below:

    a) It was not prior to a few days ago but I did enable it. Either way this did not seem to impact the issue we are experiencing.

    b) Again, it did not have the permissions but with enabling AWE we turned this option on.

    c & d - I was not tracking these two counters prior so I added them. (c) is leveling off at about 1.72 million. (d) is staying at just about 3 million. These counts are the average for the duration being tracked (the default of 1:40). Also note on (c) - this value is slowly climbing as well.

    e) Default settings, 0 and 2147483647. This server is dedicated to SQL so this configuration made sense to me.

    f) I am unsure of what this normal value is. From what I understand it should be above 350 (based on my research).

    g) As in items (a) & (b) - AWE was not enabled before this weekend, but once we enabled it and restarted SQL I did confirm this message was in the log.

    Can I toss in a silly question here? How does this relate to the excessive PAGEIOLATCH(xxx) waits we are seeing?

  • Interesting outcome today. We had one of our RAID drives recreated so the partitions were aligned. I'm seeing a significant improvement in read/write times as well as the average disk queue length. Probably in the area of 45%. For example, just copying data files between drives, cut down from 35ish minutes to 18 minutes. However, I still don't want to write off any memory related issues. Let me put this out there, is seeing PAGEIOLATCH_(XXX) waits normal? IOW, should I only be concerned if I see an excessive amount of these? After all, there should be some level of waits that are normal, correct?

  • Tony Fountain (9/14/2009)


    Can I toss in a silly question here? How does this relate to the excessive PAGEIOLATCH(xxx) waits we are seeing?

    Hey Tony,

    It's not a silly question.

    Without AWE enabled properly, SQL Server would generally be limited to under 2GB memory for the buffer pool due to virtual address space issues on 32-bit. With a too-small data page cache, SQL Server will be going to disk to fetch pages much more frequently than necessary. The page IO latch waits you see are associated with SQL Server fecthing pages from disk. That's the other reason I wanted to know the value for page life expectancy you are seeing - it is a good indicator of whether SQL Server has enough buffer pool available for your workload.

    Please take a look at Linchi Shea's blog entry to see how the AWE mechanism helps SQL Server use memory appropriately.

    When using AWE, you need to set the server max memory, even for a dedicated SQL Server. With just 4GB RAM, I would probably set max server memory to around 3.5GB (3584 MB!). See http://blogs.msdn.com/slavao/archive/2006/11/13/q-a-does-sql-server-always-respond-to-memory-pressure.aspx

    One more memory-usage check: would you mind posting the results from the following?

    SELECT [name],

    SUM(single_pages_kb + multi_pages_kb) / 1024.0

    FROM sys.dm_os_memory_clerks

    WHERE [name] IN (N'Object Plans', N'SQL Plans', N'TokenAndPermUserStore')

    GROUP BY [name]

    ORDER BY SUM(single_pages_kb + multi_pages_kb) / 1024.0 DESC;

    This is to check one common issue.

    Paul

  • Tony Fountain (9/14/2009)


    We had one of our RAID drives recreated so the partitions were aligned. I'm seeing a significant improvement in read/write times as well as the average disk queue length. Probably in the area of 45%. For example, just copying data files between drives, cut down from 35ish minutes to 18 minutes.

    Partition alignment usually has around a 5% benefit - you may be seeing additional benefits from the data being reorganized on disk.

    Tony Fountain (9/14/2009)


    However, I still don't want to write off any memory related issues. Let me put this out there, is seeing PAGEIOLATCH_(XXX) waits normal? IOW, should I only be concerned if I see an excessive amount of these? After all, there should be some level of waits that are normal, correct?

    Yes, a certain level of IO waits is normal (and especially with iSCSI).

    I know I keep posting links at you, but the following is a pretty definitive white paper from Microsoft which you should find useful:

    Waits And Queues

  • I assume you have checked that you have correct indexes and they aren't really fragmented etc

  • I suspect Paul is pointing you in the right direction, but I would also add a few perfmon counters to rule out the I/O subsystem as the source of your troubles:-

    Physical Disk:Avg Disk Sec/Read

    Physical Disk:Avg Disk Sec/Write

    Add these for each of the drive letters containing your database or logs. For a well tuned I/O subsystem, these should be <10ms

    I would also add "Physical Disk:% Idle Time" to see how busy these disks are.

  • Paul, here are the counters you asked for. Between my last post and now here are the things that have changed:

    AWE is configured for a max of 3.5 GB RAM

    All SAN disks have been recreated so they are partition aligned

    The results you asked for:

    Object Plans - 60.429687

    SQL Plans - 4.109375

    TokenAndPermUserStore - 1.289062

    I have seen Object Plans up to 90ish just a few minutes ago.

  • Matthew Bryant (9/15/2009)


    I assume you have checked that you have correct indexes and they aren't really fragmented etc

    Matthew, you are correct. Our issue is a bit more broad than this situation would cause.

  • Paul - keep throwing links my way if you have them! The last one I think is good. Although it eludes to my issue could either be memory or disk subsystem I/O. I'll dig through this again and see if I can identify what counters can lead me in the right direction.

  • Another silly question. If I am on a 32 bit OS, do I need the /3GB switch so SQL Server can allocate > 2 GB or will the /PAE switch do?

  • Ok - so we did the partition alignment. Not the blockbuster breakthrough I was hoping to see. Definitely still seeing a lot of PAGEIOLATCH_(xx) types - mainly SH & EX. So now I think I exhausted all my options in looking at the hardware so the focus is back to memory. My question is this, how do I prove it is memory? Is seeing the page life expectancy drop below 300ish good enough or are there other specific indicators I should be concerned with?

    If I do need to upgrade the memory beyond 4 GB, it will require a rebuild to another version of the OS since Windows 2003 Standard only supports up to 4 GB.

  • Tony Fountain (9/15/2009)


    Paul, here are the counters you asked for. Between my last post and now here are the things that have changed:

    AWE is configured for a max of 3.5 GB RAM

    All SAN disks have been recreated so they are partition aligned

    The results you asked for:

    Object Plans - 60.429687

    SQL Plans - 4.109375

    TokenAndPermUserStore - 1.289062

    I have seen Object Plans up to 90ish just a few minutes ago.

    That's all good.

    Object Plans are generally good things (compiled plans for triggers, procedures and so on) so long as they are being re-used (typically yes). 60-90MB of object plans seems very reasonable.

    SQL Plans are things like ad-hoc SQL batches, and are generally not so good. It is good to see this being very small.

    The TokenAndPerm store contains combinations of permissions - an optimization. Again, nothing to worry about here.

    Paul

  • Tony Fountain (9/15/2009)


    Another silly question. If I am on a 32 bit OS, do I need the /3GB switch so SQL Server can allocate > 2 GB or will the /PAE switch do?

    The full answer would extremely long, so let me summarize:

    Never enable /3GB without being asked to by Microsoft Customer Support.

    /PAE allows Windows to see memory in excess of 4GB. No harm in enabling it AFAIK.

    AWE allows SQL Server to access any extra memory visible through the OS.

    BTW in case you missed Ian 'Capt' Scarlett's suggestions - I would encourage you to follow that good advice.

    Paul

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply