Adding Filegroups

  • Background

    System:

    Windows 2003 R2 Enterprise SP2

    24GB Ram

    8 Processors

    SQL Server Standard (64-bit) version 9.00.4035.00 SP3

    Data stored on Netapp LUNs

    AWE Enabled

    Problem

    - 1 CPU is I/O bound

    - 1 LUN contains all user databases as well as Log files.

    I've been recording some counters and creating daily averages. Per day I'm getting the following averages

    - Average disk time of the LUN containing the production db ~300%.

    - Average Disk Sec/Transfer is ~20ms.

    - Average compilation/sec ~45

    - Average Page Lookups/sec ~6100

    - Averege SQL Server:Buffer Manager: Page Reads/sec ~677

    - Averege SQL Server:Buffer Manager: Page Life Expectancy ~270

    - Average SQL Server:Buffer Manager: Buffer Cache Hit Ratio ~99.5

    - Average Lock Wait Time ~ 1518 (down to an average of 150 the past two days)

    Other Issues

    - Different types of pageIOlatch waits appear. I'm seeing on average 10 blocking chains where the blocking process is this type of wait.

    - The Netapp aggregate is shared with other apps (including Exchange).

    Resolution

    We've done lots of work re-writing queries, re-indexing and offloading reports to a replicated database on a different server. We've also changed MAXDOP to 1 because a ton of locks were occurring and the lock types were cxpackets. These lock types basically stopped occurring but I feel that we've limited SQL Servers ability to create parallel query plans. We're still monitoring the database for under performing queries and tuning them as we go. I want would like your expert feedback on our next moves which are:

    - Moving the production log file as well as the data file to dedicated luns

    - Adding files to the production database. Each file will be placed on a separate LUN

    - Changing MAXDOP to match the number of files created.

    This move makes everyone worried but me but I want to get someone elses take on it. Will adding files across multiple disks help prevent one CPU from being I/O bound? Will this also decrease the disk time on that LUN? Do I need a wake up call and do something totally different?

    Sorry for the long post. I don't post often but when I do it's engaging.

    Thanks in advance

    James

    MCITP: DBA

    -

  • Most of your questions can be answered from this TechNet best practices article.

    http://technet.microsoft.com/en-us/library/cc966412.aspx

  • Check the white paper, but definitely move the files to their own LUNs. That will let you better troubleshoot where the issues are.

  • Also, MAXDOP is based on the CPUs/cores, not files.

  • Steve Jones - Editor (6/19/2009)


    Also, MAXDOP is based on the CPUs/cores, not files.

    Thank you all for your help.

    I was aware that MAXDOP was related to #CPU's.

    The white paper is excellent, however, it leads to more questions. I'll just post them one at a time.

    In order to utilize multiple CPU's and prevent one CPU from becoming I/O bound, should we add more files to the primary filegroup but place them on different LUNs?

  • Apollo74x (6/22/2009)


    Steve Jones - Editor (6/19/2009)


    Also, MAXDOP is based on the CPUs/cores, not files.

    I was aware that MAXDOP was related to #CPU's.

    I think Steve was responding to your statement about CPUs becoming I/O bound. I think I see what you mean, but CPUs don't become I/O bound of course 🙂

    I guess it is possible for one CPU to become saturated by processing I/O requests (particularly if all I/O is masked into one CPU) but that would be highly unusual - SQL Server generally issues asynchronous I/O requests and the I/O system becomes the bottleneck long before the CPU.

    Apollo74x (6/22/2009)


    In order to utilize multiple CPU's and prevent one CPU from becoming I/O bound, should we add more files to the primary filegroup but place them on different LUNs?

    Multiple files on separate LUNs is often a good idea - with caveats. Generally hardware RAID across the same LUNs provides bigger benefits.

    That said, some processes such as backup and file expansion may benefit from the extra thread started per file, but generally you will only see a benefit if the LUNs are serviced by physically separate disk spindles, or if some other feature of the I/O system makes this beneficial. An added complication is that the files should be the same size in order to distribute the load evenly.

    TempDB is a special case, and generally benefits from as many equally-sized files as there are cores, regardless of the physical considerations - this is because it reduces contention on the allocation structures associated with the database (tempdb tends to make many small allocations very repidly - especially from mixed extents).

    Paul

  • Paul (6/22/2009)


    I guess it is possible for one CPU to become saturated by processing I/O requests (particularly if all I/O is masked into one CPU) but that would be highly unusual - SQL Server generally issues asynchronous I/O requests and the I/O system becomes the bottleneck long before the CPU.

    So I need to resolve any issues with the I/O system and see if the % processing time goes down??

    Apollo74x (6/22/2009)


    In order to utilize multiple CPU's and prevent one CPU from becoming I/O bound, should we add more files to the primary filegroup but place them on different LUNs?

    Paul (6/22/2009)


    Multiple files on separate LUNs is often a good idea - with caveats. Generally hardware RAID across the same LUNs provides bigger benefits.

    That said, some processes such as backup and file expansion may benefit from the extra thread started per file, but generally you will only see a benefit if the LUNs are serviced by physically separate disk spindles, or if some other feature of the I/O system makes this beneficial. An added complication is that the files should be the same size in order to distribute the load evenly.

    What I'm getting from this is because I'm not getting separate spindles when a new LUN is created on the Netapp Storage Unit, I'm probably not going to see any performance benefits. Is this correct?

    If so, then do you see any issues with the disk time issues that I'm getting. The numbers have gone down since all of the rework but today, the % disk time average is about ~132% per hour during our busy hours.

    I guess this brings me back to an issue I'm having with parrellel query plans....

    When MAXDOP was set to 0, we were experiencing lots of cxpacket deadlocks. There were so many, not caused by any one query plan, that we changed this value to 1 and we've had zero cxpacket deadlocks since then. That seems to be a stop gap solution because ideally I feel that we want to enable parallelism. Which is one of the reasons that I was thinking of adding more files to the database. What are your thoughts and/or what would you do differently?

    I was thinking that the following article shed some light on our issues but now I'm not too sure: [/url].

  • Apollo74x (6/23/2009)


    So I need to resolve any issues with the I/O system and see if the % processing time goes down??

    I would start with the basics - you need to get the data and log separated as far as is possible.

    Some pretty in-depth and thorough advice can be found here on TechNet

    Apollo74x (6/23/2009)


    What I'm getting from this is because I'm not getting separate spindles when a new LUN is created on the Netapp Storage Unit, I'm probably not going to see any performance benefits. Is this correct?

    With any sort of NAS/SAN you just need to do the best you can. Again, the article I reference above should help you.

    Apollo74x (6/23/2009)


    When MAXDOP was set to 0, we were experiencing lots of cxpacket deadlocks.

    CXPACKET is a normal wait - not a deadlock. Problems with parallel execution most often come down to a data set which is not distributed evenly among the available threads, or occasionally where order-preserving exchanges are involved. It isn't always a simple matter. I would encourage you to read Craig Freedman's excellent articles on the subject.

    Paul

  • Page Life Expectancy is fairly low at 270 seconds, so you might consider adding more memory to reduce the IO. For example, up it to 64 GB.

    The Average Page Lookups/sec and the Page Reads/sec 677 seem low considering the low Page Life Expectancy. What does the CPU usage % look like?

    You should investigate disk performance in more detail. Look at the read and write disk queue lengths, the MB\sec for read, write, and total, and disk read, writes, and transfers\sec. If the disk queue length is staying long and the MB\sec is close to the limit for the HBA card, then you are probably IO bound.

    If you are seeing a lot of blocking, you look into setting the databases to read_committed_snapshot. This could eliminate a lot of the blocking if it is coming from lookup queries.

  • I know I just seem to be posting links here, but http://technet.microsoft.com/en-us/library/cc966413.aspx is a Microsoft white paper on SQL Server 2005 Performance Tuning using the Waits and Queues methodology. It is probably the best guide our there to help identify any performance bottlenecks.

    I'm not sure I would agree about the page life expectancy for example - the guide gives '< 300' but 270 is not significantly below that. You should probably take a look and run the example scripts to get an overview of the performance characteristics of your system. More often that not, a pattern will emerge which will guide you in the right direction - be it memory, I/O, CPU, or some other issue.

    Paul

  • Michael Valentine Jones (6/23/2009)


    The Average Page Lookups/sec and the Page Reads/sec 677 seem low considering the low Page Life Expectancy. What does the CPU usage % look like?

    The CPU Usage (%Processor Time) overall is at ~ 10% daily. Core 3 and 7 are at ~40%. The other 6 are all under 3%.

    Michael Valentine Jones (6/23/2009)


    You should investigate disk performance in more detail. Look at the read and write disk queue lengths, the MB\sec for read, write, and total, and disk read, writes, and transfers\sec. If the disk queue length is staying long and the MB\sec is close to the limit for the HBA card, then you are probably IO bound.

    Is there a way to get the information that I need regarding the HBA card or do I have to contact our hardware team or better yet the Netapp geniuses?

  • Paul White (6/24/2009)


    I know I just seem to be posting links here, but http://technet.microsoft.com/en-us/library/cc966413.aspx is a Microsoft white paper on SQL Server 2005 Performance Tuning using the Waits and Queues methodology. It is probably the best guide our there to help identify any performance bottlenecks.

    Paul

    I appreciate the links. I'm in the process of digesting the other two. I'll keep you all posted before we decide to make any changes.

    Thanks again.

  • Apollo74x (6/24/2009)


    Michael Valentine Jones (6/23/2009)


    The Average Page Lookups/sec and the Page Reads/sec 677 seem low considering the low Page Life Expectancy. What does the CPU usage % look like?

    The CPU Usage (%Processor Time) overall is at ~ 10% daily. Core 3 and 7 are at ~40%. The other 6 are all under 3%.

    Michael Valentine Jones (6/23/2009)


    You should investigate disk performance in more detail. Look at the read and write disk queue lengths, the MB\sec for read, write, and total, and disk read, writes, and transfers\sec. If the disk queue length is staying long and the MB\sec is close to the limit for the HBA card, then you are probably IO bound.

    Is there a way to get the information that I need regarding the HBA card or do I have to contact our hardware team or better yet the Netapp geniuses?

    If you mean the performance metrics I mentioned, they are available through the windows performance monitor.

    If you mean the HBA card limits, I don't know, but I would hope your hardware team or the vendor could supply that info.

  • The CPU Usage (%Processor Time) overall is at ~ 10% daily. Core 3 and 7 are at ~40%. The other 6 are all under 3%.

    I think if you look further, you will find Cores 3 and 7 are handling all of the network IO , hence the higher CPU usage. There is a good write up titled "Scaling Heavy Network Traffic with Windows" at http://blogs.msdn.com/sqlcat/archive/2008/09/18/scaling-heavy-network-traffic-with-windows.aspx

    The recommendation is to bind the NICs to particular cores using the Interrupt-Affinity Policy Tool and then set SQL Server to not use those CPUs.

    SQL = Scarcely Qualifies as a Language

  • Carl Federl (6/24/2009)


    I think if you look further, you will find Cores 3 and 7 are handling all of the network IO , hence the higher CPU usage. There is a good write up titled "Scaling Heavy Network Traffic with Windows" at http://blogs.msdn.com/sqlcat/archive/2008/09/18/scaling-heavy-network-traffic-with-windows.aspx

    The recommendation is to bind the NICs to particular cores using the Interrupt-Affinity Policy Tool and then set SQL Server to not use those CPUs.

    An important caveat is that this advice is really aimed at 32-bit Windows 2000 and Windows 2003 RTM.

    These were unable to spread network-related CPU load over multiple cores.

    It is true that setting network adapter affinity may benefit very high-end systems with very high network I/O (think 1Gb/s and up) and very many cores - say 12 to 16 at the least. I doubt it applies here - though perhaps Apollo would confirm the network traffic level.

    The higher (though still comfortable) CPU utilization on those two cores could be explained by a couple of long-running MAXDOP 1 tasks, or perhaps I/O is affinitized. Who knows.

    Paul

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply