dbcc checkdb fails (nightmare scenario)

  • Not repairable in any way. By the looks of the errors, every file in that database is damaged across most of its length. It looks like large portions of the DB have been zeroed (0x00 written across the sectors) and so the data that was in the pages so affected is not there any longer.

    If you have no backup, this database is likely a complete and total loss.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Gianluca Sartori (10/4/2011)


    I know that my comment is totally useless now, but, hell, why don't you have a backup?

    Failure in communicating would be the short version. 🙁

  • How exactly was the database "wiped" and what exactly was restored? All files from scratch or only some files/blocks?

    All your errors stem from the fact that your PFS pages seem blank.

    Msg 8909, Level 16, State 1, Line 1

    TABLE error: Object ID 0, INDEX ID -1, partition ID 0, alloc unit ID 0 (TYPE UNKNOWN), page ID (1:2491104) contains an incorrect page ID IN its page header. The PageId IN the page header = (0:0).

    Page ID 2491104 is a PFS page given that 2491104 % 8088 = 0. Since that PFS page is ruined, the next 8088 pages can't be checked by CheckDB:

    Msg 8998, Level 16, State 2, Line 1

    Page errors ON the GAM, SGAM, OR PFS pages prevent allocation integrity checks IN DATABASE ID 6 pages FROM (1:2491104) TO (1:2499191). See other errors FOR cause.

    This is basically the same error throughout all of your results. What's weird is that all of the PFS pages fail in the way that the unexpected 0:0 PageID is stored in the header. Had random blocks been written to the disk, these shouldn't all be 0:0, but random x:y values. As they're not, I'm suspecting the restore actually zeroed out those blocks, one way or the other - hence I'd like to know more about the exact nature of the "wipe" and following restore.

    You can, using DBCC PAGE, take manual random checks of the uncheckable data pages to see if they've truly been zeroed out or if they contain any data. If they've been zeroed out and you have no other restore options, you're out of luck, game over. If they do contain data, you may be able to retrieve it through alternative means.

    You mention the DB was 3.7TB - was this the allocated disk space or the actual in-use space of the file? Were the datafiles allocated using instant initialization or by zeroing out the blocks on allocation? If the latter, it may be that SQL Server doesn't recognize the zeroed part of the file if other metadata is corrupt.

    Mark S. Rasmussen
    Blog: improve.dk
    Twitter: @improvedk
    Internals geek & author of OrcaMDF

  • Mark S. Rasmussen (10/4/2011)


    As they're not, I'm suspecting the restore actually zeroed out those blocks, one way or the other - hence I'd like to know more about the exact nature of the "wipe" and following restore.

    Restore or IO subsystem glitch (since the log was on a disk that's 'very gone', it's not a stretch to think that there are other IO subsystem problems). I think the majority of the DB has been zeroed out. The errors are across almost all the files and in several places it's a series of allocation pages that are wiped, so it's fairly likely that some or all of the pages between are zeroed as well.

    Also we have a system table corruption right at the beginning (Msg 8921, Level 16, State 1, Line 1

    Check terminated. A failure was detected while collecting facts. Possibly tempdb out of space or a system table is inconsistent. Check previous errors.)

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Restore or IO subsystem glitch (since the log was on a disk that's 'very gone', it's not a stretch to think that there are other IO subsystem problems). I think the majority of the DB has been zeroed out. The errors are across almost all the files and in several places it's a series of allocation pages that are gone, so it's fairly likely that some or all of the pages between are zeroed as well.

    Probably the case, yes. However, all of the PFS page errors are in the 2,500,000+ PageID range, leaving at least 20GB of data unaffected by PFS page clearing. As such, there may be a large amount of recoverable data in there. And even in the 2,500,000+ range, there are many ranges with unaffected PFS pages too.

    Also we have a system table corruption right at the beginning (Msg 8921, Level 16, State 1, Line 1

    Check terminated. A failure was detected while collecting facts. Possibly tempdb out of space or a system table is inconsistent. Check previous errors.)

    Question is what pages exactly are corrupted and how much it affects.

    OP - what is the nature of the data? Is this an "accounting"-like system where you need all data 100% correct, or would any amount of restorable data be helpful? While CheckDB probably won't help you, and you absolutely won't be able to get 100% correct data, it may be possible to salvage large amounts of data either way. No guarantees, but there are manual ways of doing so.

    Mark S. Rasmussen
    Blog: improve.dk
    Twitter: @improvedk
    Internals geek & author of OrcaMDF

  • Mark S. Rasmussen (10/4/2011)


    Also we have a system table corruption right at the beginning (Msg 8921, Level 16, State 1, Line 1

    Check terminated. A failure was detected while collecting facts. Possibly tempdb out of space or a system table is inconsistent. Check previous errors.)

    Question is what pages exactly are corrupted and how much it affects.

    And what checks were terminated as a results. All the errors are allocation errors and you'd think with widespread damage there would be at least one non-allocation error. Could be that a lot of errors aren't showing because of the system table error.

    serr4no: Can you pick one table (a small one to start) and run this:

    DBCC CheckTable (<table name>)

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • I chose a table and ran CheckTable:

    Msg 824, Level 24, State 2, Line 1

    SQL Server detected a logical consistency-based I/O error: incorrect pageid (expected 1:8743947; actual 0:0). It occurred during a read of page (1:8743947) in database ID 6 at offset 0x000010ad816000 in file 'F:\Program Files\Microsoft SQL Server\MSSQL10.MSSQLSERVER\MSSQL\DATA\<db_name>.mdf'.

  • How did you do the block level restore? Was it from a Windows Server Backup? You wouldn't normally expect this sort of backup to be so totally corrupted...

    Is it possible you're actually doing this restore onto suspect hardware or is this completely seperate from the disks that the problem initially occurred on?

  • Mark S. Rasmussen (10/4/2011)


    Restore or IO subsystem glitch (since the log was on a disk that's 'very gone', it's not a stretch to think that there are other IO subsystem problems). I think the majority of the DB has been zeroed out. The errors are across almost all the files and in several places it's a series of allocation pages that are gone, so it's fairly likely that some or all of the pages between are zeroed as well.

    Probably the case, yes. However, all of the PFS page errors are in the 2,500,000+ PageID range, leaving at least 20GB of data unaffected by PFS page clearing. As such, there may be a large amount of recoverable data in there. And even in the 2,500,000+ range, there are many ranges with unaffected PFS pages too.

    Also we have a system table corruption right at the beginning (Msg 8921, Level 16, State 1, Line 1

    Check terminated. A failure was detected while collecting facts. Possibly tempdb out of space or a system table is inconsistent. Check previous errors.)

    Question is what pages exactly are corrupted and how much it affects.

    OP - what is the nature of the data? Is this an "accounting"-like system where you need all data 100% correct, or would any amount of restorable data be helpful? While CheckDB probably won't help you, and you absolutely won't be able to get 100% correct data, it may be possible to salvage large amounts of data either way. No guarantees, but there are manual ways of doing so.

    The database is a statistics and analysis system, and actually the structure is more important than the data.

    Also, since the database is partitoned monthly (a primarary mdf and secondary monthly ndf files), repairing and restoring the recent data (2 recent ndf's) is more important than the rest.

    A lagre portion of the data from the "older" months can be restored from an existing outdated backup, but not only that it lacks recent data, the structure has changed over time. Going back to that backup completly is the absolutley last resort since months of development and data will be lost.

  • HowardW (10/4/2011)


    How did you do the block level restore? Was it from a Windows Server Backup? You wouldn't normally expect this sort of backup to be so totally corrupted...

    Is it possible you're actually doing this restore onto suspect hardware or is this completely seperate from the disks that the problem initially occurred on?

    The database files are on a Netapp WAFL lun, inside a volume, that single volume file was accidentally lost.

    A disk block restore was made, but it's obvious that not all the blocks were restored then.

    Not sure about how Netapp restore works in detail, if it zeros the bad blocks that is.

    Would a windows chkdsk in full block-repair mode do much at this stage?

  • The database is a statistics and analysis system, and actually the structure is more important than the data.

    Structure as in schema? Can't you derive that from the application(s) accessing the database? OR/M configuration, etc.? If the schemas has changed over time I assume that means you've built new tables for new data while keeping archived data in older tables with different schemas?

    Also, since the database is partitoned monthly (a primarary mdf and secondary monthly ndf files), repairing and restoring the recent data (2 recent ndf's) is more important than the rest.

    Do you know the FileID's of said recent and most important NDFs? Would be interesting to correlate those with error messages. How big are your files - and are they all the same size?

    A lagre portion of the data from the "older" months can be restored from an existing outdated backup, but not only that it lacks recent data, the structure has changed over time. Going back to that backup completly is the absolutley last resort since months of development and data will be lost.

    Would any amount of restorable data, whether schema or actual data, be better than none? Or will intermittent, possibly some of it corrupted, data essentially be useless?

    Mark S. Rasmussen
    Blog: improve.dk
    Twitter: @improvedk
    Internals geek & author of OrcaMDF

  • The database files are on a Netapp WAFL lun, inside a volume, that single volume file was accidentally lost.

    A disk block restore was made, but it's obvious that not all the blocks were restored then.

    Not sure about how Netapp restore works in detail, if it zeros the bad blocks that is.

    Would a windows chkdsk in full block-repair mode do much at this stage?

    If your blocks have been zeroed, I doubt a chkdsk would do much if anything. Rather, until you know with complete certainty what you want to do with the corruption, I'd touch as little as possible to avoid unintentionally making things worse. Reading is OK, writing might just make things worse.

    Mark S. Rasmussen
    Blog: improve.dk
    Twitter: @improvedk
    Internals geek & author of OrcaMDF

  • serr4no (10/4/2011)


    Would a windows chkdsk in full block-repair mode do much at this stage?

    No, almost certainly not.

    I would be seeing if the storage team could do more at this stage (liaising with the SAN vendor) - clearly the restore hasn't worked as advertised. I could understand having some issues with a hot backup, but not to the level you've experienced.

  • GilaMonster (10/4/2011)


    Mark S. Rasmussen (10/4/2011)


    Also we have a system table corruption right at the beginning (Msg 8921, Level 16, State 1, Line 1

    Check terminated. A failure was detected while collecting facts. Possibly tempdb out of space or a system table is inconsistent. Check previous errors.)

    Question is what pages exactly are corrupted and how much it affects.

    And what checks were terminated as a results. All the errors are allocation errors and you'd think with widespread damage there would be at least one non-allocation error. Could be that a lot of errors aren't showing because of the system table error.

    serr4no: Can you pick one table (a small one to start) and run this:

    DBCC CheckTable (<table name>)

    Did this on a few more tables and some are actually readble - looks OK.

    Others not so.

    In process of finding out from devs what parts are most valuable.

  • At this point I'd recommend querying the tables and seeing what, if anything, you can get back from them. Some will fail, some won't. For tables that fail, some you'll be able to query ranges (of the clustered index), some you won't. This will be tedious and slow, but without a backup that's pretty much the only option left.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass

Viewing 15 posts - 16 through 30 (of 32 total)

You must be logged in to reply to this topic. Login to reply