Crazy Problems - Cluster / Multiple SQL Instances / Performance Problems

  • Background:

    We have a cluster running SQL 2005 64-bit on two active nodes. Node 2 has 2 instances of SQL on it. The default instance of SQL on Node 2 has maybe 40-50 DB's on it.

    Problem:

    A week before I started with my new employer they started experiencing very slow performance (30+ seconds) on some queries that previously performed really well (<1 second). So the consultant DBA and I started digging in. We ran traces and perfmon watching all the usual suspects and then some. Page Faults, Stolen Pages, Disk Queues, Memory, CPU (all 16 on the node individually), transactions, Buffer stats, Compiles/Recompiles.

    Page faults were really high. 11k at times. Disk queues were really low. Processor usage was really low except during a few queries - and not every time those queries ran. It seemed to be happening randomly. It also corresponded with some very heaving page faults. Page Faults seemed to us to indicate memory pressure (although I've learned that's not always the case). Page life expectancy fluctuated rapidly from 0 to over 3000 at times.

    Traces would show us which queries were taking the longest but the ones we found were part of TFS. TFS encrypts their stored procedures so we could see exactly what was going on inside.

    Remember, this is all happening during my 1st week at a new job. I've sold myself as the DBA and I'm not fixing this problem. The only thing I have going for me at this point is the spendy consultant (a good guy) doesn't know it either. We're beating our heads against our desks. Neither one of us designed the environment that things are running in. It's complex enough that we aren't sure there is even just one problem.

    Two more weeks pass and we've recommended a RAM upgrade. I wasn't convinced it was the root of the problem but I do know the server needs it for other reasons. I thought I could leverage the RAM for SQL to see if it helped then later reallocate it for other purposes on the server if I could.

    We hadn't yet ordered the RAM. We were still getting crazy, random performance problems. Then someone asked me to take a DB offline. Completely unrelated to the issue I was working on. They just didn't need the DB anymore and thought it would be a good idea to shut it off. So I did. POOF! Performance problem goes away! The other DBA and I are sitting around going, "WTF?". So we start it up again. Problem returns. It's that DB that no one needs.

    But why? A look in the application log of the server shows that the DB in question was starting up and doing a checkdb about once a minute. Someone had it set it to auto-close. This shouldn't cause trouble on it's own with a DB that no one was using. Except the other DBA had the RedGate SQLBackup client open at all times on his local PC to monitor backups. Once a minute SQLBackup would poke all the DB's and force that auto-closed one to come online to say it was ok then it would go back to sleep.

    I'm sure there's a moral to this story. I just can't find it yet because I'm too tired and happy from working on and, if not exactly fixing, at least discovering the root cause of this problem.

    I felt I had to share. Take from this what you will. 🙂

  • Sounds like the story had a happy ending and at least you guys were "chasing his tail" and not your own. Even if you did "trip" over the solution you still added value by precisely identifying the solution once you tripped over it instead of just saying "that fixed it but we don't know why."

    Also, there is something terribly wrong with a gui altering any database in any way without explicit user say so and doing it via the user's account.

  • A look in the application log of the server shows that the DB in question was starting up and doing a checkdb about once a minute. Someone had it set it to auto-close.

    Please be aware of the Knowledge Base article titled "You may experience a decrease in query performance after you perform certain database maintenance operations or regular transaction operations in SQL Server 2005" http://support.microsoft.com/kb/917828

    Some root causes of procedure cache flush

    A database has the AUTO_CLOSE database option set to ON.

    You run several queries against a database that has default options. Then, the database is dropped.

    A database snapshot for a source database is dropped.

    You change any database state to OFFLINE or ONLINE.

    You successfully rebuild the transaction log for any database.

    You restore a database backup.

    You run the DBCC CHECKDB statement.

    You specify one of the following options when you run the ALTER DATABASE statement:

    OFFLINE, ONLINE, MODIFY FILEGROUP DEFAULT, MODIFY_NAME, MODIFY FILEGROUP READ_WRITE,

    COLLATE, MODIFY FILEGROUP READ_ONLY, READ_ONLY, READ_WRITE

    The whole procedure cache is cleared if one of the following server options is changed by the RECONFIGURE statement:

    cross db ownership chaining, index create memory (KB), remote query timeout (s), user options,

    max text repl size (B), cost threshold for parallelism, max degree of parallelism, min memory per query (KB), query wait (s), min server memory (MB), max server memory (MB), query governor cost limit

    SQL = Scarcely Qualifies as a Language

  • Thanks for sharing your experience!

    MJ

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply