Disk Config issues

Question

Post reply

Disk Config issues

mssql_rules

SSCrazy

Points: 2783
More actions
April 7, 2004 at 9:44 am

#84878

I found the following errors in our SQL Server.

I/O error 1450(Insufficient system resources exist to complete the requested service.) detected during read at offset 0x0000017e358000 in file 'E:\Shared\Microsoft SQL Server\MSSQL\Data\2004Data.NDF'..

I/O error 1450(Insufficient system resources exist to complete the requested service.) detected during read at offset 0x000001ec8ce000 in file 'F:\Shared\Microsoft SQL Server\MSSQL\Data\2004Q1Data2.NDF'..

The SQL Server is a Dell 4 processor system with 4 GB RAM, SAN attached (~600 GB storage on 15 Drive RAID 5). The server hosts around 15 databases with an average of 15GB. These errors show up when there is heavy load on the server. So far no data has corrupted yet. These databases have a history of pounding the server they were hosted on before with similar errors. That prompted us to move to the SAN. But our vendor insisted that RAID 5 is good enough ( I wanted a RAID 10 and more smaller drives) with 76 GB drives. Now, 6 months after migration to the new server, I find myself facing the same errors. One obvious solution is to ask for more drives, and may be RAID 10. Can you think of any thing else?

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply

Nicholas Cain SSC-Dedicated Points: 33009 More actions · Answer 1

Nicholas Cain

SSC-Dedicated

Points: 33009

April 7, 2004 at 10:43 am

#502026

I'd look to increase the cache size on the SAN as well.

Shamless self promotion - read my blog http://sirsql.net

Steve Jones - SSC Editor SSC Guru Points: 728159 More actions · Answer 2

Steve Jones - SSC Editor

SSC Guru

Points: 728159

April 7, 2004 at 11:31 am

#502030

I'd get MS to work with the SAN vendor on this. SQL is reporting a lower level OS error.

Follow me on Twitter: http://www.twitter.com/way0utwest
Forum Etiquette: How to post data/code on a forum to get the best help
My Blog: www.voiceofthedba.com

akshay1974 SSC-Addicted Points: 414 More actions · Answer 3

akshay1974

SSC-Addicted

Points: 414

April 8, 2004 at 1:29 pm

#502207

You might want to make sure that the disk are on different controllers. and also how many HBA's do you have. We have HBA's with a throughput of 100 mb/S but according to SAN experts, the best throughput is achieved at 50 MB/s.

desmond_sql Grasshopper Points: 16 More actions · Answer 4

We have exactly the same problems, running

SQL2000 with 2GB memory, Windows 2000 AS, 4 CPUs. We are using a SAN storage connected by 2 Fibre cards. The databases range from 10GB to 400GB in size for decision support applications.

Looking at:

http://support.microsoft.com/default.aspx?scid=kb;en-us;828339

it should be caused by a lower level OS-related problem, possibly to do with

http://support.microsoft.com/default.aspx?scid=kb;en-us;555068

Cause

Having the /3GB and /PAE switch enabled in the boot.ini on a Windows 2000 Advanced Server with over 4GB of RAM and running SQL Server in a cluster with AWE enabled.

Resolution

Use only the /PAE and AWE setting by removing the /3GB switch from the boot.ini. As an additional precaution make sure to set the maximum memory available to SQL Server to prevent it from starving the operating system.

Please also refer to

http://support.microsoft.com/default.aspx?scid=kb;EN-US;247904

which describes what resources fall(System Page Table Entry , SystemPTE) below a certain threshold.

You might want to monitor your SystemPTE levels while the box is on heavy loads.

Nicholas Cain SSC-Dedicated Points: 33009 More actions · Answer 5

Nicholas Cain

SSC-Dedicated

Points: 33009

April 16, 2004 at 4:57 am

#503093

If the server has 2GB of memory why would you have /3GB, /PAE and AWE enabled? You should only use /3GB when the server has 4GB of memory, and /PAE & AWE once you get above that.

Shamless self promotion - read my blog http://sirsql.net

desmond_sql Grasshopper Points: 16 More actions · Answer 6

desmond_sql

Grasshopper

Points: 16

April 16, 2004 at 5:28 am

#503096

I suppose my quote was misleading. I tried to point to areas that influence the System PTEs available to the box. With 4GB memory, the /3GB switch impact the number of System PTEs available.

Right now, we don't have the /3GB switch enabled.

mssql_rules SSCrazy Points: 2783 More actions · Answer 7

Thanks to all those replied for your valuable input. Here is some

additional info on this system configuration:

I am using the /3GB switch only since I have 3.5 GB RAM and the system

is a dedicated SQL Server. I am not using the /PAE switch since I am

not

over 4 GB. I ran the perfmon SQL Server is using ~2.7 GB, and there is

~ 500 MB of available memory. So I am discouning the possibility of the

OS starving for memory. Nevertheless, I will remove the /3 GB switch

and

see if that makes any difference.

Last week I added 6 additional volumes to the system. Each volume is 65

GB carved on a 7 drive RAID, so I effectively have 40 drives. Since

this

is a SAN, I am not the only user on those RAIDS. In fact each of those

RAID volumes is about 420 GB ( 72 GB drives ) and I am allocated 65 GB.

The SAN is IBM Shark. It has 6 GB read cache and 2 GB write cache. Our

SAN administrator has checked the SAN and drives and found no issues on

his side.

I started collecting some additional disk stats, and here those are :

I did a fn_Virtualfilestats against the database, while I was running

the index recreation job that generated those insufficient disk

resources error. Then I took the delta between the disk stats for

before

and after the indexing job.

The read volume was 15788867584 bytes in a 15 minute period translates

into an average of only 17 Megabytes/sec.

The write volume was 18536996352 bytes in a 15 minute period translates

into an average of only 20 Megabytes/sec.

Then I worked with our SAN admin. He checked the stats on the fibre

channels. We have 2 2GB fibre channels. A 2Gbit Fibre Channel can

handle

a maximum transfer rate of about 120Megabytes per second. Each fibre

channel has ~ 21 MB/sec in and 41 MB/sec out. The Shark has 2 fibre

adapters, shared by three servers, has about 22 MB/Sec in and 43 MB/Sec

out. These are less than 20 % of the capacity (120 MB/sec) I am told.

Hence the finger points back at SQL. We also did throughput testing on

those RAID volumes using IOMeter and Microsoft disk stressing tool, and

it does ~ 60 MB/sec for READ using 64 K blocks. So, we seem to be well

withing the capacity of all the sub systems.

Mark Shepherd SSC Enthusiast Points: 142 More actions · Answer 8

I would be very surprised if SQL is the bottle neck. We have an EMC SAN connected to a 4way 2gig processor machine with 8 gig of ram, win2k, 2x2gigabit fibre channel with load balancing software (powerpath). We get a continuous throughput of 143megabytes/sec and I have seen 153megabytes/sec.

To get this throughput we use SQL Litespeed to backup a large database to another disk on the SAN. When Litespeed backs up the database if compresses the file to 25% of the original size, this allows us not to be constrained by writes back to disk. On a HP disk array we have had a throughput of 166megabytes/sec on a 8 way machine. If I back up a smaller database twice in a row the first one's throughput is 143 megabytes/sec and the second is at over 300 megabytes/sec. We backup a 100gig database in a little over 10 minutes, So I can confirm SQL Server is more than capable of high throughput.

When we originally set up the SAN we were getting very similar performance to you. Our configuration was 4 x 8gig stripes to make up a 32 gig LUN. We now have 8 x 4 gig stripes to make up a 32 gig LUN and we then stripe either 4 or 5 LUN's together at the OS. You need to make your disk stripe smaller and across more disks on the SAN.

Get your SAN administrator to check the performance of the individual disks on the SAN which your database uses. You will probably find you are exceeding the throughput capacity of the individual disks.

See some other posts.

http://qa.sqlservercentral.com/forums/shwmessage.aspx?forumid=65&messageid=107712#bm111783

http://qa.sqlservercentral.com/forums/shwmessage.aspx?forumid=5&messageid=111959&p=2

sa24 SSCrazy Eights Points: 8027 More actions · Answer 9

Can you check the perfmon counter Disk Queue length and see whether it is not exceeding 2*no. of processors. Regarding the microsoft stress test utility, below is what microsoft has to say.

One thing it is definitely good for is determining if there are problems

with a caching controller. If the controller loses any reads or writes

in it's cache we have problems.

sa24 SSCrazy Eights Points: 8027 More actions · Answer 10

sa24

SSCrazy Eights

Points: 8027

April 22, 2004 at 10:14 am

#503873

Something which you might want to check with the SAN team is whether you have striped or Concatenated volumes. Also see if they see any hotspots in SAN.

mssql_rules SSCrazy Points: 2783 More actions · Answer 11

I agree that SQL Server software cannot be the bottleneck. We have other server with low end IBM Fastt with 40 * 17GB drives in two raid 5 volumes, (each with 20 drives effective = 320 GB) and I have a 250 GB DB and 50 GB tables, and I reorganize them every week. Each 50 GB tables takes about 3 hours. Never times out. That SAN is dedicated to that server only. Ironically that SAN is over 4 years old, has older drives and probably other older components too. The drives on the Shark are 10 K RPM, 76 GB drives. The disk queue gets to about 20 during high load (for a RAID volume with 6 or 7 drives) so it is not terrible. Another thing to note is that the MS IO testing tool reported 60 MB /Sec IO. As I have already mentioned my indexing job is throwing about 37 MB (17 MB read and 20 MB write) once agin below the capacity. I would be happy if it took longer or ran a bit slower, what puzzles me is that timing out with the insufficient resources errors. I am cautious not to hold the SAN responsible, since the other systems sharing the SAN are not reporting any of these problems.

Mark: You mentioned that you do striping at the OS level of the volumes presented by the SAN. I did not do that, since our OS admin and I thought that might impose some additional load (software striping). So, I am presenting each lun to the DB with database files equally spread on those volumes. Since ours is DW server, I can fairly spread the data evenly.

Next week I am probably going to call MS.. (reaching a point where $250/hour may be worth it), it does seem to be the last resort.