Monitoring Tool

  • Hello all, I need a good recommendation. Last week our co. lost a week worth of productivity due to a corrupt db on an enterprise server. Apparently we had a brown-out which caused the OS to write a bad block at the beginning of the file. The funny thing is that subsequent backups with verify did not fail and we were not aware of the issue for several days. So when the problems DID manifest and it LOOKED like we had good backups we could not get them to restore. That being said, if we had had the manpower to pour through the logs every day we would have noticed the i/o error, 1 entry among 100,000, and we could have caught the error much sooner. Now I'm being tasked by my project manager to find a tool that would alert us when we receive that 1 critical error among the thousands of legitimate entries. Does anyone know of such a tool?

    Thanks in advance,

    dab

  • http://www.quest.com/

    I think you need a good recovery plan, lost of one week productivity is way too much and unacceptable to me.

    How large is your database?

  • The database is one of our smaller ones, about 14 GB and we do perform a nightly backup. We do have 3 other systems in a failover environment with log shipping but we are still about 6 weeks out from this particular system begin added to one ( lack of money and harware ). Yes, the failover would have caught this the day it happened when the backup failed to restore. However, I've been tasked to come up with a tool to detect critical errors and alert us when they occur.

  • I guess I need to rephrase. What I am looking for is some kind of tool that will detect the error, even at 2:00 in the morning, as it's written to the log and notify myself and the other DBA by email that it occurred

  • I think in this case, you might want to set up an alert that would start a job that go back at that time to read the log file and send you whatever you are looking for that you think is fishy.

    For example, I set up an alert for transaction log full, when it got trigger, it start a job and that job would go back to read the current error log looking for that error and send out pager and email to me to let me know which database has transaction log full so that I could look at it right of way.

    I also set up a job that would gather all errorlog for the whole day and weed out the normal stuff and send an email to me only those error that I think is out of ordinary or I think is important to read.

     

    mom

  • So mom, you are suggesting writing a custom log tool.

    Is there anything on the market besides Log P.I. that performs a similar function?

  • MOM --> Microsoft Operations Management is the tool that we use. It provides lots (repeat lots) of monitoring. Here's the main link below (plan on spending a hour or two investigating and reading first though):

     

    http://www.microsoft.com/mom/default.mspx

     

    Another question for you though ... are your servers on UPS power ? If not, all the monitoring in the world will not help.

    RegardsRudy KomacsarSenior Database Administrator"Ave Caesar! - Morituri te salutamus."

  • Yes, our servers are on UPS' and battery backups. However, and this opinion does not reflect that of upper management, they suck. We have maybe 5 minutes to gracefully shut down anything in the event of a true blackout.

    However, I've actually found an undocumented proc, sp_readerrorlog, and am working on a custom job to monitor the SQL Error log and alert appropriate personell.

    dab

  • If you want to pursue the SQL solution go the SQL Server Alert path they are very very powerful and can help you get where you want very easily (Every thing that goes in the log can be detected with Alerts).

    The other thing is to use third party which will usually imply big $$$ 

     

     


    * Noel

  • IMHO 5 minutes UPS time is not enough 'battery time' for a SQL server. This lack of time for a graceful shutdown is probably the root cause of your corruption. Even if you have automated shutdown integration (i.e. software in the UPS that senses power loss and then begins to shut things down on the server) 5 minutes is most probably not enough time. Good luck on xp_readerrorlog it's undocumented but it works well with a little massaging of the data it returns.

    RegardsRudy KomacsarSenior Database Administrator"Ave Caesar! - Morituri te salutamus."

  • Five minutes is deifnitely not enough. We have 8 different enterprise db servers. As for the cause, it looks like a brown out since none of them recorded a shut down event and several show an i/o or RAID event at the same time. And I'm making fair progress with sp_readerrorlog. Enough so that in conjunction with SQL Alerts I should have a more robust monitoring system. Finally, we received news that the new server arrived at the airport and we will be replacing this within a few weeks. That means the existing server will become a failover server.

    dab

  • Without sufficient power 'failover' will not help unfortunately.

    RegardsRudy KomacsarSenior Database Administrator"Ave Caesar! - Morituri te salutamus."

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply