Adding Filegroups

Question

Post reply

Adding Filegroups

Apollo74x

Ten Centuries

Points: 1250
More actions
June 19, 2009 at 1:39 pm

#201515

Background

System:

Windows 2003 R2 Enterprise SP2

24GB Ram

8 Processors

SQL Server Standard (64-bit) version 9.00.4035.00 SP3

Data stored on Netapp LUNs

AWE Enabled

Problem

- 1 CPU is I/O bound

- 1 LUN contains all user databases as well as Log files.

I've been recording some counters and creating daily averages. Per day I'm getting the following averages

- Average disk time of the LUN containing the production db ~300%.

- Average Disk Sec/Transfer is ~20ms.

- Average compilation/sec ~45

- Average Page Lookups/sec ~6100

- Averege SQL Server:Buffer Manager: Page Reads/sec ~677

- Averege SQL Server:Buffer Manager: Page Life Expectancy ~270

- Average SQL Server:Buffer Manager: Buffer Cache Hit Ratio ~99.5

- Average Lock Wait Time ~ 1518 (down to an average of 150 the past two days)

Other Issues

- Different types of pageIOlatch waits appear. I'm seeing on average 10 blocking chains where the blocking process is this type of wait.

- The Netapp aggregate is shared with other apps (including Exchange).

Resolution

We've done lots of work re-writing queries, re-indexing and offloading reports to a replicated database on a different server. We've also changed MAXDOP to 1 because a ton of locks were occurring and the lock types were cxpackets. These lock types basically stopped occurring but I feel that we've limited SQL Servers ability to create parallel query plans. We're still monitoring the database for under performing queries and tuning them as we go. I want would like your expert feedback on our next moves which are:

- Moving the production log file as well as the data file to dedicated luns

- Adding files to the production database. Each file will be placed on a separate LUN

- Changing MAXDOP to match the number of files created.

This move makes everyone worried but me but I want to get someone elses take on it. Will adding files across multiple disks help prevent one CPU from being I/O bound? Will this also decrease the disk time on that LUN? Do I need a wake up call and do something totally different?

Sorry for the long post. I don't post often but when I do it's engaging.

Thanks in advance

James

MCITP: DBA

-

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply

Paul White SSC Guru Points: 150457 More actions · Answer 1

Paul White

SSC Guru

Points: 150457

June 19, 2009 at 10:24 pm

#1012747

Most of your questions can be answered from this TechNet best practices article.

http://technet.microsoft.com/en-us/library/cc966412.aspx

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

Steve Jones - SSC Editor SSC Guru Points: 728159 More actions · Answer 2

Steve Jones - SSC Editor

SSC Guru

Points: 728159

June 19, 2009 at 10:40 pm

#1012750

Check the white paper, but definitely move the files to their own LUNs. That will let you better troubleshoot where the issues are.

Follow me on Twitter: http://www.twitter.com/way0utwest
Forum Etiquette: How to post data/code on a forum to get the best help
My Blog: www.voiceofthedba.com

Steve Jones - SSC Editor SSC Guru Points: 728159 More actions · Answer 3

Steve Jones - SSC Editor

SSC Guru

Points: 728159

June 19, 2009 at 10:41 pm

#1012751

Also, MAXDOP is based on the CPUs/cores, not files.

Follow me on Twitter: http://www.twitter.com/way0utwest
Forum Etiquette: How to post data/code on a forum to get the best help
My Blog: www.voiceofthedba.com

Apollo74x Ten Centuries Points: 1250 More actions · Answer 4

Steve Jones - Editor (6/19/2009)

Also, MAXDOP is based on the CPUs/cores, not files.

Thank you all for your help.

I was aware that MAXDOP was related to #CPU's.

The white paper is excellent, however, it leads to more questions. I'll just post them one at a time.

In order to utilize multiple CPU's and prevent one CPU from becoming I/O bound, should we add more files to the primary filegroup but place them on different LUNs?

Paul White SSC Guru Points: 150457 More actions · Answer 5

Apollo74x (6/22/2009)

Steve Jones - Editor (6/19/2009)

Also, MAXDOP is based on the CPUs/cores, not files.

I was aware that MAXDOP was related to #CPU's.

I think Steve was responding to your statement about CPUs becoming I/O bound. I think I see what you mean, but CPUs don't become I/O bound of course 🙂

I guess it is possible for one CPU to become saturated by processing I/O requests (particularly if all I/O is masked into one CPU) but that would be highly unusual - SQL Server generally issues asynchronous I/O requests and the I/O system becomes the bottleneck long before the CPU.

Apollo74x (6/22/2009)

In order to utilize multiple CPU's and prevent one CPU from becoming I/O bound, should we add more files to the primary filegroup but place them on different LUNs?

Multiple files on separate LUNs is often a good idea - with caveats. Generally hardware RAID across the same LUNs provides bigger benefits.

That said, some processes such as backup and file expansion may benefit from the extra thread started per file, but generally you will only see a benefit if the LUNs are serviced by physically separate disk spindles, or if some other feature of the I/O system makes this beneficial. An added complication is that the files should be the same size in order to distribute the load evenly.

TempDB is a special case, and generally benefits from as many equally-sized files as there are cores, regardless of the physical considerations - this is because it reduces contention on the allocation structures associated with the database (tempdb tends to make many small allocations very repidly - especially from mixed extents).

Paul

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

Apollo74x Ten Centuries Points: 1250 More actions · Answer 6

Paul (6/22/2009)

I guess it is possible for one CPU to become saturated by processing I/O requests (particularly if all I/O is masked into one CPU) but that would be highly unusual - SQL Server generally issues asynchronous I/O requests and the I/O system becomes the bottleneck long before the CPU.

So I need to resolve any issues with the I/O system and see if the % processing time goes down??

Apollo74x (6/22/2009)

In order to utilize multiple CPU's and prevent one CPU from becoming I/O bound, should we add more files to the primary filegroup but place them on different LUNs?

Paul (6/22/2009)

Multiple files on separate LUNs is often a good idea - with caveats. Generally hardware RAID across the same LUNs provides bigger benefits.

That said, some processes such as backup and file expansion may benefit from the extra thread started per file, but generally you will only see a benefit if the LUNs are serviced by physically separate disk spindles, or if some other feature of the I/O system makes this beneficial. An added complication is that the files should be the same size in order to distribute the load evenly.

What I'm getting from this is because I'm not getting separate spindles when a new LUN is created on the Netapp Storage Unit, I'm probably not going to see any performance benefits. Is this correct?

If so, then do you see any issues with the disk time issues that I'm getting. The numbers have gone down since all of the rework but today, the % disk time average is about ~132% per hour during our busy hours.

I guess this brings me back to an issue I'm having with parrellel query plans....

When MAXDOP was set to 0, we were experiencing lots of cxpacket deadlocks. There were so many, not caused by any one query plan, that we changed this value to 1 and we've had zero cxpacket deadlocks since then. That seems to be a stop gap solution because ideally I feel that we want to enable parallelism. Which is one of the reasons that I was thinking of adding more files to the database. What are your thoughts and/or what would you do differently?

I was thinking that the following article shed some light on our issues but now I'm not too sure: [/url].

Paul White SSC Guru Points: 150457 More actions · Answer 7

Apollo74x (6/23/2009)

So I need to resolve any issues with the I/O system and see if the % processing time goes down??

I would start with the basics - you need to get the data and log separated as far as is possible.

Some pretty in-depth and thorough advice can be found here on TechNet

Apollo74x (6/23/2009)

What I'm getting from this is because I'm not getting separate spindles when a new LUN is created on the Netapp Storage Unit, I'm probably not going to see any performance benefits. Is this correct?

With any sort of NAS/SAN you just need to do the best you can. Again, the article I reference above should help you.

Apollo74x (6/23/2009)

When MAXDOP was set to 0, we were experiencing lots of cxpacket deadlocks.

CXPACKET is a normal wait - not a deadlock. Problems with parallel execution most often come down to a data set which is not distributed evenly among the available threads, or occasionally where order-preserving exchanges are involved. It isn't always a simple matter. I would encourage you to read Craig Freedman's excellent articles on the subject.

Paul

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

Michael Valentine Jones SSC Guru Points: 64818 More actions · Answer 8

Page Life Expectancy is fairly low at 270 seconds, so you might consider adding more memory to reduce the IO. For example, up it to 64 GB.

The Average Page Lookups/sec and the Page Reads/sec 677 seem low considering the low Page Life Expectancy. What does the CPU usage % look like?

You should investigate disk performance in more detail. Look at the read and write disk queue lengths, the MB\sec for read, write, and total, and disk read, writes, and transfers\sec. If the disk queue length is staying long and the MB\sec is close to the limit for the HBA card, then you are probably IO bound.

If you are seeing a lot of blocking, you look into setting the databases to read_committed_snapshot. This could eliminate a lot of the blocking if it is coming from lookup queries.

Paul White SSC Guru Points: 150457 More actions · Answer 9

I know I just seem to be posting links here, but http://technet.microsoft.com/en-us/library/cc966413.aspx is a Microsoft white paper on SQL Server 2005 Performance Tuning using the Waits and Queues methodology. It is probably the best guide our there to help identify any performance bottlenecks.

I'm not sure I would agree about the page life expectancy for example - the guide gives '< 300' but 270 is not significantly below that. You should probably take a look and run the example scripts to get an overview of the performance characteristics of your system. More often that not, a pattern will emerge which will guide you in the right direction - be it memory, I/O, CPU, or some other issue.

Paul

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

Apollo74x Ten Centuries Points: 1250 More actions · Answer 10

Michael Valentine Jones (6/23/2009)

The Average Page Lookups/sec and the Page Reads/sec 677 seem low considering the low Page Life Expectancy. What does the CPU usage % look like?

The CPU Usage (%Processor Time) overall is at ~ 10% daily. Core 3 and 7 are at ~40%. The other 6 are all under 3%.

Michael Valentine Jones (6/23/2009)

You should investigate disk performance in more detail. Look at the read and write disk queue lengths, the MB\sec for read, write, and total, and disk read, writes, and transfers\sec. If the disk queue length is staying long and the MB\sec is close to the limit for the HBA card, then you are probably IO bound.

Is there a way to get the information that I need regarding the HBA card or do I have to contact our hardware team or better yet the Netapp geniuses?

Apollo74x Ten Centuries Points: 1250 More actions · Answer 11

Paul White (6/24/2009)

I know I just seem to be posting links here, but http://technet.microsoft.com/en-us/library/cc966413.aspx is a Microsoft white paper on SQL Server 2005 Performance Tuning using the Waits and Queues methodology. It is probably the best guide our there to help identify any performance bottlenecks.

Paul

I appreciate the links. I'm in the process of digesting the other two. I'll keep you all posted before we decide to make any changes.

Thanks again.

Michael Valentine Jones SSC Guru Points: 64818 More actions · Answer 12

Apollo74x (6/24/2009)

Michael Valentine Jones (6/23/2009)

The Average Page Lookups/sec and the Page Reads/sec 677 seem low considering the low Page Life Expectancy. What does the CPU usage % look like?

The CPU Usage (%Processor Time) overall is at ~ 10% daily. Core 3 and 7 are at ~40%. The other 6 are all under 3%.

Michael Valentine Jones (6/23/2009)

You should investigate disk performance in more detail. Look at the read and write disk queue lengths, the MB\sec for read, write, and total, and disk read, writes, and transfers\sec. If the disk queue length is staying long and the MB\sec is close to the limit for the HBA card, then you are probably IO bound.

Is there a way to get the information that I need regarding the HBA card or do I have to contact our hardware team or better yet the Netapp geniuses?

If you mean the performance metrics I mentioned, they are available through the windows performance monitor.

If you mean the HBA card limits, I don't know, but I would hope your hardware team or the vendor could supply that info.

Carl Federl One Orange Chip Points: 25384 More actions · Answer 13

The CPU Usage (%Processor Time) overall is at ~ 10% daily. Core 3 and 7 are at ~40%. The other 6 are all under 3%.

I think if you look further, you will find Cores 3 and 7 are handling all of the network IO , hence the higher CPU usage. There is a good write up titled "Scaling Heavy Network Traffic with Windows" at http://blogs.msdn.com/sqlcat/archive/2008/09/18/scaling-heavy-network-traffic-with-windows.aspx

The recommendation is to bind the NICs to particular cores using the Interrupt-Affinity Policy Tool and then set SQL Server to not use those CPUs.

SQL = Scarcely Qualifies as a Language

Paul White SSC Guru Points: 150457 More actions · Answer 14

Carl Federl (6/24/2009)

I think if you look further, you will find Cores 3 and 7 are handling all of the network IO , hence the higher CPU usage. There is a good write up titled "Scaling Heavy Network Traffic with Windows" at http://blogs.msdn.com/sqlcat/archive/2008/09/18/scaling-heavy-network-traffic-with-windows.aspx

The recommendation is to bind the NICs to particular cores using the Interrupt-Affinity Policy Tool and then set SQL Server to not use those CPUs.

An important caveat is that this advice is really aimed at 32-bit Windows 2000 and Windows 2003 RTM.

These were unable to spread network-related CPU load over multiple cores.

It is true that setting network adapter affinity may benefit very high-end systems with very high network I/O (think 1Gb/s and up) and very many cores - say 12 to 16 at the least. I doubt it applies here - though perhaps Apollo would confirm the network traffic level.

The higher (though still comfortable) CPU utilization on those two cores could be explained by a couple of long-running MAXDOP 1 tasks, or perhaps I/O is affinitized. Who knows.

Paul

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi