Slow SAN?

  • Hi

    SQL Server 2008 64bit enterprise allocated 56G RAM out of 64G. Data and Log files are sperated on seperate SAN disk. (Don't have the details of these!). OS is installed on a local drive togheter with tempdb (1 file).

    I have four quod core processors on this machine, although Degree of Parrellelism is set to 1 as per MS recomendations.

    Checked page life expectancy and buffer cache hit ratio and both are fine, i.e 800 and 98 respectively.

    1) During Index Rebuilds although I specify MAXDOP > 1, I have never seen more than one processor working! I do have enough CPU power avail but it's not being used. Can someone please guide me where to start looking?

    2) During a batch process, same problem. I'm only expecting one core to spike but this does not happen. Looking at sys.dm_os_wait_stats, my highest wait types are SOS_SCHEDULER_YIELD followed by OLEDB and PAGEIOLATCH_EX,TRACEWRITE,PAGEIOLATCH_SH, WRITELOG,IO_COMPLETION and PAGEIOLATCH_UP which got me to think that maybe I have an IO bottleneck.

    Checking from sys.dm_io_virtual_file_stats, I have noticed io_stalls_read_ms of 10742306 (48658055168 bytes read) and io_stalls_write_ms of 360106 (158105600 bytes written) on one of the data files. This seems quite high! Am I correct in suspecting that the IO can be a problem?

    Looking at the counters:

    Avg Disk Queue Length = 3.038

    Avg Disk sec/Read = 0 (Max = 0.006)

    Avg Disk sec/Write = 0.003 (Max = 0.019)

    Regards

    Brian

    Regards
    Brian Ellul
    ----------------------------------------------------------------------------------
    Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination.
    - Albert Einstein -

  • I'll take #2

    Assuming SAN/Network are working properly chances are you are facing some hot spots. A single file Tempdb sounds like a possible cause but I'll suggest to further investigate.

    Please follow link next for proper tools, credits to Louis Davidson and sources http://drsql.spaces.live.com/blog/cns!80677FB08B3162E4!1606.entry

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • Is the same as this issue you posted about?

    http://qa.sqlservercentral.com/Forums/Topic1034234-391-1.aspx



    A.J.
    DBA with an attitude

  • A couple of things:

    1) Are you sure you are running SQL Server 2008 Enterprise Edition and not Standard Edition? Use one of the following to verify:

    Select @@version;

    Select serverproperty('Edition');

    Note: Standard Edition will not parallel process index rebuilds - no matter what settings you use.

    2) Lock pages in memory setting? Note: if you are running Standard Edition, make sure you have the right SP and CU installed to enable this right.

    3) Perform volume maintenance tasks right set? If you are running a domain user service account and that user is not in the local administrators group (it shouldn't be - according to best practices), this right would be required for instant initialization. This would only help your particular performance issues if your data files are auto growing during your index rebuilds.

    4) Are you really sure you want to disable parallelism? It might be a recommendation for OLTP applications, but it really depends on your application and the queries being run. I generally set this to 1/4 the number of CPU cores and increase the threshold up to about 60 seconds. And then I monitor the system to see if I have high CXPACKET waits - and if I do I can then address the issues for those specific queries or adjust the parallelism.

    5) Are you really getting 3ms average disk sec/write? And less than 6ms average disk sec/read? If so, then I cannot really see any issues with your IO subsystem.

    The wait stats are cumulative stats - so those numbers are not really indicative of a problem. What I would recommend is clearing the wait stats before the business day and capture the numbers at a regular interval throughout the day.

    Jeffrey Williams
    Problems are opportunities brilliantly disguised as insurmountable obstacles.

    How to post questions to get better answers faster
    Managing Transaction Logs

  • A.J. Wilbur (12/28/2010)


    Is the same as this issue you posted about?

    http://qa.sqlservercentral.com/Forums/Topic1034234-391-1.aspx

    Not exactly since this time I'm directly at the client site, equipped with a much better hardware, however yes I'm almost facing the same issues!

    Regards
    Brian Ellul
    ----------------------------------------------------------------------------------
    Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination.
    - Albert Einstein -

  • Jeffrey Williams-493691 (12/28/2010)


    A couple of things:

    1) Are you sure you are running SQL Server 2008 Enterprise Edition and not Standard Edition? Use one of the following to verify:

    Select @@version;

    Select serverproperty('Edition');

    Note: Standard Edition will not parallel process index rebuilds - no matter what settings you use.

    2) Lock pages in memory setting? Note: if you are running Standard Edition, make sure you have the right SP and CU installed to enable this right.

    3) Perform volume maintenance tasks right set? If you are running a domain user service account and that user is not in the local administrators group (it shouldn't be - according to best practices), this right would be required for instant initialization. This would only help your particular performance issues if your data files are auto growing during your index rebuilds.

    4) Are you really sure you want to disable parallelism? It might be a recommendation for OLTP applications, but it really depends on your application and the queries being run. I generally set this to 1/4 the number of CPU cores and increase the threshold up to about 60 seconds. And then I monitor the system to see if I have high CXPACKET waits - and if I do I can then address the issues for those specific queries or adjust the parallelism.

    5) Are you really getting 3ms average disk sec/write? And less than 6ms average disk sec/read? If so, then I cannot really see any issues with your IO subsystem.

    The wait stats are cumulative stats - so those numbers are not really indicative of a problem. What I would recommend is clearing the wait stats before the business day and capture the numbers at a regular interval throughout the day.

    1) Yes I checked and I'm using the Enterprise version

    2) Lock Pages in memory is set to a domain user while the SQL Service is logging in with the local system account. I'll have a word with their system admins and have this changed. However, both the data files and log files are already set to a reasonable size and the autogrow is set to large increments in MB.

    3) SQL Service is logging in with the local system account however the Perform volume maintenance tasks right is set to another Domain account + the administrors group. I'll have this checked as well!

    4) Parallelism has been disabled by a Microsoft guy and after reading about it have never thought of changing it. I do specify MAXDOP for Index rebuilds although it's ignored!

    5) Yes those are the reading from performance monitor. The wait stats have been cleared just before running the batch job I'm monitoring and therefore those reading all relate to the process. My suspicion for a slow SAN is due to the fact that I have these errors logged in the SQL log (sorry forgot to mention them in my original post).

    - SQL Server has encountered 2000 occurrences of IO requests taking longer than 15 seconds to complete on file.....

    Regards

    Brian

    Regards
    Brian Ellul
    ----------------------------------------------------------------------------------
    Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination.
    - Albert Einstein -

  • PaulB-TheOneAndOnly (12/28/2010)


    I'll take #2

    Assuming SAN/Network are working properly chances are you are facing some hot spots. A single file Tempdb sounds like a possible cause but I'll suggest to further investigate.

    Please follow link next for proper tools, credits to Louis Davidson and sources http://drsql.spaces.live.com/blog/cns!80677FB08B3162E4!1606.entry

    I'll create multiple tempdb files as recommended and see how it goes. In my original post I said that the tempdb was on the local disk when in fact they are on the same SAN drive as the data files!

    Regards
    Brian Ellul
    ----------------------------------------------------------------------------------
    Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination.
    - Albert Einstein -

  • brian118 (12/29/2010)


    PaulB-TheOneAndOnly (12/28/2010)


    I'll take #2

    Assuming SAN/Network are working properly chances are you are facing some hot spots. A single file Tempdb sounds like a possible cause but I'll suggest to further investigate.

    Please follow link next for proper tools, credits to Louis Davidson and sources http://drsql.spaces.live.com/blog/cns!80677FB08B3162E4!1606.entry

    I'll create multiple tempdb files as recommended and see how it goes. In my original post I said that the tempdb was on the local disk when in fact they are on the same SAN drive as the data files!

    1) you REALLY need to know details about your SAN LUN configuration and file placements to be able to know what to do here and to understand what you are seeing

    2) maxdop of 1 should only be used on an OPTIMALLY TUNED PURE OLTP system

    3) you need to do a differential file IO stall and wait stats analysis during high-load times. search web for track_waitstats_2005 and use the scripts you find to handle that task. for file IO stalls, put output from the dmv into a temp table, waitfor some delay (i usually do 5 or 10 mins) then take another snapshot and diff the two sets to get actual metrics during the interval.

    4) consider getting a perf tuning professional to give your system a review. a day or two from such an individual can pay tremendous returns for both your system and for your knowledge!

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • TheSQLGuru (12/29/2010)


    brian118 (12/29/2010)


    PaulB-TheOneAndOnly (12/28/2010)


    I'll take #2

    Assuming SAN/Network are working properly chances are you are facing some hot spots. A single file Tempdb sounds like a possible cause but I'll suggest to further investigate.

    Please follow link next for proper tools, credits to Louis Davidson and sources http://drsql.spaces.live.com/blog/cns!80677FB08B3162E4!1606.entry

    I'll create multiple tempdb files as recommended and see how it goes. In my original post I said that the tempdb was on the local disk when in fact they are on the same SAN drive as the data files!

    1) you REALLY need to know details about your SAN LUN configuration and file placements to be able to know what to do here and to understand what you are seeing

    2) maxdop of 1 should only be used on an OPTIMALLY TUNED PURE OLTP system

    3) you need to do a differential file IO stall and wait stats analysis during high-load times. search web for track_waitstats_2005 and use the scripts you find to handle that task. for file IO stalls, put output from the dmv into a temp table, waitfor some delay (i usually do 5 or 10 mins) then take another snapshot and diff the two sets to get actual metrics during the interval.

    4) consider getting a perf tuning professional to give your system a review. a day or two from such an individual can pay tremendous returns for both your system and for your knowledge!

    Thanks for your replies..

    1) I'll be siting down with the Administrators to gain a better understanding of their SAN configuration

    2) The application is Dynamicx AX 2009 and besides me having NO access to the application/code, I'm assuming that this application is already optimally tuned as an OLTP system. My only area where I can investigate is the DB and it's related hardware such as the SAN

    3) Thanks I'll be using that script to monitor IO stalls

    4) Will be considering this option as well!

    Regards

    Brian

    Regards
    Brian Ellul
    ----------------------------------------------------------------------------------
    Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination.
    - Albert Einstein -

  • Dynamics AX is hardly a well-tuned (or well designed/coded) SQL Server application from what I have heard. But there are things you can and probably should do for tuning, such as possbily adding indexes (although this is unsupported, you can remove them before you apply patches and then put them back on), using plan guides for suboptimal queries, evaluate optimize for ad hoc workloads and also forced parameterization. Note that all of those things are pretty advanced and use them only with proper knowledge and especially testing.

    Drop me a line if you need a script to to the file IO stall analysis, and best of luck with your troubleshooting!

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply