SQL 2008 database in recovery for 22+ hours!!!?

  • Ah, my confusion. I was thinking this was a slow restore. My mistake.

  • Sorry for the confusion... it is at least partiall my fault.

    When the recovery had been running for 20+ hours, I kicked off a restore of a recent backup so that whichever finished first would 'win' and I could put the machine back into production.

    The restore was done without recovery so that I could apply transaction logs. The restore went well, but when I kicked off the first transaction log restore, it never finished, despite the fact that it said it was at 100%. That was where I saw the errors.

  • HowardW (10/21/2009)


    It was my understanding that regardless of recovery interval, the data files are not written to until the transaction is committed?

    Nope. When checkpoint runs it writes ALL dirty pages to disk, regardless of the state of the transaction that dirtied those pages. The rule is that the changes must be logged before the data pages are written and that the transaction is not complete until all of the associated log records have been hardened on disk. It's perfectly allowable for the data pages to be written back to disk before the commit occurs, provided the associated log records are written to disk first.

    If pages were only written after the commit, then there would be no need for rollback during recovery because uncommitted changes would never need to be undone because they wouldn't have been written to the disk.

    Easy to prove. Create a dummy database on a local machine. Run the following.

    CREATE TABLE Dummy (

    SomeStr varchar(50)

    )

    GO

    BEGIN TRANSACTION

    INSERT INTO Dummy (SomeString) Values ('This insert will never be committed')

    Then, from another query window

    CHECKPOINT

    GO

    SHUTDOWN WITH NOWAIT

    Grab a hex editor and open the data file of the DB that you just created. Search for the string that you inserted.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • If i'm not wrong, i'd bet your transaction log is set to auto-grow by multiply of 1024 bytes. Kimberly Tripp has pointed out a bug in ms sql if you auto-grow by this value which causes excessive VLF creation. i.e setting your tran log to auto-grow by 4096MB.

    http://www.sqlskills.com/BLOGS/KIMBERLY/post/Transaction-Log-VLFs-too-many-or-too-few.aspx

    I've been able to obtain the piece of code for this bug but M$ has not come back saying they'll fix this in their future SP. Mind you, i've been hit by this bug where 900GB db which took 2 hours restore time took 12 hours after tran log auto-growth.

    If you do have your auto-grow option set to multiply of 1024, then i'd recommend you to shrink your transaction log to remove the excessive VLF. Then check the number of VLF and make sure its just like couple of hundred or something (but not like hundreds of thousands).

    dbcc loginfo(<db name>)

    Otherwise, i reckon there's no point for me to provide my "hunch" as it might just waste everyone's time reading my post. But if this is the case, then i could post more details about the bug.

    Good luck,

    Simon

    Simon Liew
    Microsoft Certified Master: SQL Server 2008

  • dbcc loginfo(<db name>) now results in 604 vlfs, growth is set to 5000MB. We did run a dbcc checkdb and got the following:

    CHECKDB found 0 allocation errors and 0 consistency errors in database 'Pipeline'.

    Msg 5269, Level 16, State 1, Line 1

    Check terminated. The transient database snapshot for database 'Pipeline' (database ID 9) has been marked suspect due to an IO operation failure. Refer to the SQL Server error log for details.

    Msg 0, Level 20, State 0, Line 0

    A severe error occurred on the current command. The results, if any, should be discarded.

    So it looks like the DB is okay, but google reveals little to nothing on msg 5269. Any clues on this or is it even anything to worry about, since new snaps occur whenever I setup a new replication?

  • It's something to worry about. IO errors are always something to worry about. Will see if I can pass this on to someone who will know the details of that error.

    There should be more info in the SQL error log. Is that the case and, if so, what errors are listed in the error log?

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Ugh.

    So here is the final response from MS. Not terribly helpful.

    I verified the logs and could not find the Root cause of this issue.

    Have my Technical lead reviewed those logs and he confirmed the same.

    This seems to be an issue with Memory corruption when the recovery for this database was in action. I missed taking the mini dump which would have help in finding the root cause.

    Next time when you encounter the same issue, Please take a minidump.

Viewing 7 posts - 31 through 36 (of 36 total)

You must be logged in to reply to this topic. Login to reply