Performance Degradation

  • I do not know if this is a SQL Server issue but I am hoping that maybe someone has experienced something similar and might be able to give me some advice on how diagnose this.

    For starters I have a clustered SQL Server environment running on Windows Enterprise Server 2008 r2 (64bit). We recently rebuilt the server to 2008 r2 (JULY)

    8 Quad Core 2.4Ghz processors

    256 GB RAM

    Version SQL Server 2005 SP3CU7 (4273)... SQL Server has 100 Gig RAM dedicated to it

    A couple of weeks ago we had some reports of sluggishness on our application.

    We identified the dedicated database server as the source of the sluggishness but we were unable to identify a smoking gun. We realized that the problem had to be the SQL Server machine because our remote desktop connections were taking very long to connect. We could not identify any resources that were starved on the machine.

    All my DBA monitors were showing a little high in values but nothing that I can say that would create the kind of degradation that we were experiencing.

    CPU utilization across the board was running at 30% to 35%, maybe 5% higher than normal.

    Memory looked good but I can not say that I did a good job of analysis here...I need some better monitors and I am researching that now (Any advice for researching memory issues would be appreciated)

    CPU Queue Waits looked fine

    Normal Blocking ocurrances

    SQL Traces showed nothing completely out of the ordinary as far as response times

    Overall nothing that I could see inside SQL Server that I could say was causing the issue.

    We ended up failing over SQL Server to the B side of the cluster and the problem resolved itself immediately for both the appliciations accessing SQL Server and the Remote Desktop Connection accessing the primary server. We re-booted the A side of the cluster just to be safe and then failed SQL Server back to its primary location and all was fine.

    Now, two weeks later we are starting to see slight signs of the same problem.

    I have opened a case with Microsoft But I have not had any satisfactory results yet.

    My first Question...What resources beyond the obvious should I be monitoring and how?

    Second Part...Part of a performance analysis that I did last year identified one stored procedure that was using high CPU cycles and being called quite frequently. I traced the problem to a FOR XML RAW statement. One of the applications that has been seeing the highest amount of performance degradation calls this procedure often. I have called this out to the developers in hopes that we can redesign this so that we do not use the FOR XML RAW. Note...peak usage for our sytem starts around September and I am starting to see a pattern ever two weeks.

    My Second question....Can frequent use of FOR XML RAW cause a SYSTEM wide issue? Maybe memory related?

    I know this is a complicated issue.

    Any advice would be greatly appreciated.

  • Eric, I believe you're on the right track regarding memory.

    FOR XML by itself is not going to the problem, any more then any heavily used, large result set query would be. A quick question.

    What's the SQL Server's memory settings?


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Min server Memory (MB) - 30720

    Max Server Memory (MB) - 102400

    Perfmon shows SQL Server:memory Manager:Total Server Memory (KB) at 104,857,600 ( or 102400 MB)

  • First question:

    Are You set "lock pages in memory" in gpedit.msc for account sqlservices?

    Twice:

    what are you set RAID for Database, TempDb, Log Files... ? It is for separates LUN and spindles?

    Best regards,

    Michal Marek

    Best regards,
    Michał Marek

  • michal_marek (10/12/2010)


    First question:

    Are You set "lock pages in memory" in gpedit.msc for account sqlservices?

    Twice:

    what are you set RAID for Database, TempDb, Log Files... ? It is for separates LUN and spindles?

    Yes, we are set for "lock pages in memory".

    One question...we have this set via the Domain Admins group which our SQL Service account is part of (not recommended I know). Would this be a problem? Do I need to set the policy on the domain account directly?

    There is a lot of conflicting documentation surrounding this policy on 64bit machines. I have read that it is no longer needed on 64bit machines? We have it set to be safe.

    We are 0+1 across the board accept for the c: which is 1.

    Tempdb has its own LUN with its own set of spindles as does log, data and system DB's

    thanks for the reply, let me know if there is anything else you can think of.

    I am working with Microsft to caputre statistics durring the next degradation occurance.

    Please keep the questions and comments flowing, any help is appreciated.

    I will post any solutions that I come across.

  • Yes, we are set for "lock pages in memory".

    One question...we have this set via the Domain Admins group which our SQL Service account is part of (not recommended I know). Would this be a problem? Do I need to set the policy on the domain account directly?

    No, i think so 🙂

    There is a lot of conflicting documentation surrounding this policy on 64bit machines. I have read that it is no longer needed on 64bit machines? We have it set to be safe.

    This set is for "lock" pages in memory and to avoid Windows and other application "memory pressure",

    We are 0+1 across the board accept for the c: which is 1.

    Tempdb has its own LUN with its own set of spindles as does log, data and system DB's

    thanks for the reply, let me know if there is anything else you can think of.

    I am working with Microsoft to caputre statistics durring the next degradation occurance.

    Please keep the questions and comments flowing, any help is appreciated.

    I will post any solutions that I come across.

    ...

    My suggestion:

    MS DTC, Check in PerfMon Counters for it, statistic, Operations IOs ...

    Other Check:

    Average disk queue lenght (recommendation is <2 ms per physical disk in LUN/RAID for Database, for Log and TempDb...

    Please, read this paper: http://www.servicesorientedstorage.com/assets/pdf/tuning-microsoft-sql-server-2005-performance-wp.pdf

    Use PAL tool to chceck bottlenack disk subsystem http://pal.codeplex.com/

    Best regards,
    Michał Marek

  • Whot is set in "sp_configure" ?

    sp_configure 'show advanced options', 1;

    GO

    RECONFIGURE WITH OVERRIDE;

    GO

    sp_configure

    GO

    Best regards,
    Michał Marek

  • 1) disk queue length is a useless measure of IO performance on modern hardware. avg disk sec/read and /write are best.

    2) do a fileIO stall analysis and wait stat analysis during the event (and before to get a baseline). andy kelly has a presentation/scripts to help with this, although I use track_waitstats_2005 for the waits. There is a whitepaper from microsoft on this too. search for sql server 2005 waits and queues

    3) do NOT use lock pages in memory without reading this first: http://support.microsoft.com/kb/918483

    4) have you analyzed network utilization?

    5) sound like you are doing a lot of remote desktop sessions. why?

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Name Min Max ConfigVal RunVal

    Ad Hoc Distributed Queries 0 1 0 0

    affinity I/O mask -2147483648 2147483647 0 0

    affinity mask -2147483648 2147483647 0 0

    affinity64 I/O mask -2147483648 2147483647 0 0

    affinity64 mask -2147483648 2147483647 0 0

    Agent XPs 0 1 1 1

    allow updates 0 1 0 0

    awe enabled 0 1 0 0

    blocked process threshold 0 86400 5 5

    c2 audit mode 0 1 0 0

    clr enabled 0 1 0 0

    common criteria compliance enabled 0 1 0 0

    cost threshold for parallelism 0 32767 5 5

    cross db ownership chaining 0 1 0 0

    cursor threshold -1 2147483647 -1 -1

    Database Mail XPs 0 1 1 1

    default full-text language 0 2147483647 1033 1033

    default language 0 9999 0 0

    default trace enabled 0 1 1 1

    disallow results from triggers 0 1 0 0

    fill factor (%) 0 100 0 0

    ft crawl bandwidth (max) 0 32767 100 100

    ft crawl bandwidth (min) 0 32767 0 0

    ft notify bandwidth (max) 0 32767 100 100

    ft notify bandwidth (min) 0 32767 0 0

    index create memory (KB) 704 2147483647 0 0

    in-doubt xact resolution 0 2 0 0

    lightweight pooling 0 1 0 0

    locks 5000 2147483647 0 0

    max degree of parallelism 0 64 4 4

    max full-text crawl range 0 256 4 4

    max server memory (MB) 16 2147483647 102400 102400

    max text repl size (B) 0 2147483647 65536 65536

    max worker threads 128 32767 0 0

    media retention 0 365 0 0

    min memory per query (KB) 512 2147483647 2048 2048

    min server memory (MB) 0 2147483647 30720 30720

    nested triggers 0 1 1 1

    network packet size (B) 512 32767 4096 4096

    Ole Automation Procedures 0 1 1 1

    open objects 0 2147483647 0 0

    PH timeout (s) 1 3600 60 60

    precompute rank 0 1 0 0

    priority boost 0 1 0 0

    query governor cost limit 0 2147483647 0 0

    query wait (s) -1 2147483647 -1 -1

    recovery interval (min) 0 32767 0 0

    remote access 0 1 1 1

    remote admin connections 0 1 0 0

    remote login timeout (s) 0 2147483647 20 20

    remote proc trans 0 1 0 0

    remote query timeout (s) 0 2147483647 600 600

    Replication XPs 0 1 0 0

    scan for startup procs 0 1 0 0

    server trigger recursion 0 1 1 1

    set working set size 0 1 0 0

    show advanced options 0 1 1 1

    SMO and DMO XPs 0 1 1 1

    SQL Mail XPs 0 1 1 1

    transform noise words 0 1 0 0

    two digit year cutoff 1753 9999 2049 2049

    user connections 0 32767 0 0

    user options 0 32767 0 0

    Web Assistant Procedures 0 1 1 1

    xp_cmdshell 0 1 0 0

  • 1) disk queue length is a useless measure of IO performance on modern hardware. avg disk sec/read and /write are best.

    Yes, am am now capturing and trending these numbers daily.

    2) do a fileIO stall analysis and wait stat analysis during the event (and before to get a baseline). andy kelly has a presentation/scripts to help with this, although I use track_waitstats_2005 for the waits. There is a whitepaper from microsoft on this too. search for sql server 2005 waits and queues

    Yes, I have set up perfmons to capure fileIO stats, wait stats etc. If this does re-occur, I am ready to compare healthy to degraded

    3) do NOT use lock pages in memory without reading this first: http://support.microsoft.com/kb/918483%5B/quote%5D

    I've looked into this as well as consulted MS on this, we have it set as a recomendation from MS.

    4) have you analyzed network utilization?

    That is monitored daily by NW engineer, but I have not monitored from the servers perspective. I have included Network stats in the perfmon traces.

    5) sound like you are doing a lot of remote desktop sessions. why?

    I wouldn't say we do a lot. This is a remote server and RDC is the primary means for logging into the server for debugging and administrator tasks. When we were debugging this issue it was obvious that the RDC was not responding as expected.

    I appreciate the input and questions.

    It sounds like I am on the right track for diagnosis. I found the MS White papper on tracking wait stats and I plan on giving that an in depth look.

    Tomorrow is the day where we might expect a re-occurance. I will post my findings if it does occur.

  • My Second question....Can frequent use of FOR XML RAW cause a SYSTEM wide issue? Maybe memory related?

    I know this is a complicated issue.

    Any advice would be greatly appreciated.

    I did a no-no and piggy backed two questions in the same topic...I am going to post this question in a more suitable forum.

  • hi, can you please leave a link to the new thread ?

    i am following this 🙂

  • Final followup to this thread in case anyone was interested....

    We ended up having the problem re-occur and neither Microsoft nor I could determine the source of the issue even with all of the monitors I had going.

    We ended up failing SQL Server to the secondary side of the cluster and we have not seen the problem since.

    We are also seeing some very strange results in the form of extremely High Disk Response times coming from the simple Task Manager Resource Monitor. Microsoft has indicated that the monitor results are incorrect as they differ from the perf mon results. The second side of the cluster is built with the exact same configuration and we do not see these anomolies.

    I agree with Microsoft that the numbers in the Resource Monitor are probably off, but I believe that this could be an indication of something else going on either with a corrupted OS install/patch or hardware issue.

    We plan on rebuilding this machine at a later time to see if the problems go away.

  • Eric1/2aB (12/20/2010)


    Final followup to this thread in case anyone was interested....

    We ended up having the problem re-occur and neither Microsoft nor I could determine the source of the issue even with all of the monitors I had going.

    We ended up failing SQL Server to the secondary side of the cluster and we have not seen the problem since.

    We are also seeing some very strange results in the form of extremely High Disk Response times coming from the simple Task Manager Resource Monitor. Microsoft has indicated that the monitor results are incorrect as they differ from the perf mon results. The second side of the cluster is built with the exact same configuration and we do not see these anomolies.

    I agree with Microsoft that the numbers in the Resource Monitor are probably off, but I believe that this could be an indication of something else going on either with a corrupted OS install/patch or hardware issue.

    We plan on rebuilding this machine at a later time to see if the problems go away.

    My advice for the "corrupted OS install/patch": If you're really concerned, do a binary (or at least a hash) comparison of all OS files between, in this case, your primary and secondary sides of the cluster. A corrupt file is by definition different than the original. Jacksum[/url] is a Java-based tool to do a variety of hashes, and there are many others (md5sum is an old, out of date Unix favorite).

    I hear a lot of "X must be corrupt" tossed around on issues that nobody's figured out; many times an actual investigation of X shows this to translate directly to "I have no idea", as X is identical on a working and a nonworking system (or a before/after snapshot, whichever).

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply