SQL 2008 database in recovery for 22+ hours!!!?

Question

Post reply

SQL 2008 database in recovery for 22+ hours!!!?

Viewing 7 posts - 31 through 36 (of 36 total)

You must be logged in to reply to this topic. Login to reply

Steve Jones - SSC Editor SSC Guru Points: 728159 More actions · Answer 1

Steve Jones - SSC Editor

SSC Guru

Points: 728159

October 21, 2009 at 9:47 am

#1068969

Ah, my confusion. I was thinking this was a slow restore. My mistake.

Follow me on Twitter: http://www.twitter.com/way0utwest
Forum Etiquette: How to post data/code on a forum to get the best help
My Blog: www.voiceofthedba.com

Philip Millwood-419646 SSC Eights! Points: 873 More actions · Answer 2

Sorry for the confusion... it is at least partiall my fault.

When the recovery had been running for 20+ hours, I kicked off a restore of a recent backup so that whichever finished first would 'win' and I could put the machine back into production.

The restore was done without recovery so that I could apply transaction logs. The restore went well, but when I kicked off the first transaction log restore, it never finished, despite the fact that it said it was at 100%. That was where I saw the errors.

Gail Shaw SSC Guru Points: 1004494 More actions · Answer 3

HowardW (10/21/2009)

It was my understanding that regardless of recovery interval, the data files are not written to until the transaction is committed?

Nope. When checkpoint runs it writes ALL dirty pages to disk, regardless of the state of the transaction that dirtied those pages. The rule is that the changes must be logged before the data pages are written and that the transaction is not complete until all of the associated log records have been hardened on disk. It's perfectly allowable for the data pages to be written back to disk before the commit occurs, provided the associated log records are written to disk first.

If pages were only written after the commit, then there would be no need for rollback during recovery because uncommitted changes would never need to be undone because they wouldn't have been written to the disk.

Easy to prove. Create a dummy database on a local machine. Run the following.

CREATE TABLE Dummy (



SomeStr varchar(50)

)

GO

BEGIN TRANSACTION

  INSERT INTO Dummy (SomeString) Values ('This insert will never be committed')

Then, from another query window



CHECKPOINT

GO

SHUTDOWN WITH NOWAIT

Grab a hex editor and open the data file of the DB that you just created. Search for the string that you inserted.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

SimonLiew SSCertifiable Points: 7891 More actions · Answer 4

If i'm not wrong, i'd bet your transaction log is set to auto-grow by multiply of 1024 bytes. Kimberly Tripp has pointed out a bug in ms sql if you auto-grow by this value which causes excessive VLF creation. i.e setting your tran log to auto-grow by 4096MB.

http://www.sqlskills.com/BLOGS/KIMBERLY/post/Transaction-Log-VLFs-too-many-or-too-few.aspx

I've been able to obtain the piece of code for this bug but M$ has not come back saying they'll fix this in their future SP. Mind you, i've been hit by this bug where 900GB db which took 2 hours restore time took 12 hours after tran log auto-growth.

If you do have your auto-grow option set to multiply of 1024, then i'd recommend you to shrink your transaction log to remove the excessive VLF. Then check the number of VLF and make sure its just like couple of hundred or something (but not like hundreds of thousands).

dbcc loginfo(<db name>)

Otherwise, i reckon there's no point for me to provide my "hunch" as it might just waste everyone's time reading my post. But if this is the case, then i could post more details about the bug.

Good luck,

Simon

Simon Liew
Microsoft Certified Master: SQL Server 2008

Philip Millwood-419646 SSC Eights! Points: 873 More actions · Answer 5

dbcc loginfo(<db name>) now results in 604 vlfs, growth is set to 5000MB. We did run a dbcc checkdb and got the following:

CHECKDB found 0 allocation errors and 0 consistency errors in database 'Pipeline'.

Msg 5269, Level 16, State 1, Line 1

Check terminated. The transient database snapshot for database 'Pipeline' (database ID 9) has been marked suspect due to an IO operation failure. Refer to the SQL Server error log for details.

Msg 0, Level 20, State 0, Line 0

A severe error occurred on the current command. The results, if any, should be discarded.

So it looks like the DB is okay, but google reveals little to nothing on msg 5269. Any clues on this or is it even anything to worry about, since new snaps occur whenever I setup a new replication?

Gail Shaw SSC Guru Points: 1004494 More actions · Answer 6

It's something to worry about. IO errors are always something to worry about. Will see if I can pass this on to someone who will know the details of that error.

There should be more info in the SQL error log. Is that the case and, if so, what errors are listed in the error log?

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Philip Millwood-419646 SSC Eights! Points: 873 More actions · Answer 7

Ugh.

So here is the final response from MS. Not terribly helpful.

I verified the logs and could not find the Root cause of this issue.

Have my Technical lead reviewed those logs and he confirmed the same.

This seems to be an issue with Memory corruption when the recovery for this database was in action. I missed taking the mini dump which would have help in finding the root cause.

Next time when you encounter the same issue, Please take a minidump.