Recover Master ( cluster,SAN dies )

  • How to handle this Disaster Recovery Scenario: active/passive cluster, master on SAN, backups on SAN (not good I know ), client databases on storage array,SAN dies.

    Since posting the following on the Enterprise forum ( no responses so far ), I realized that IF you copy your live master mdf/ldf files to the storage array nightly, you still can't start sql server unless the server finds the master files where it expects them ( G:\mssql\data on the SAN ), which may mean you have to remap G  to the array where you have been copying your master files nightly.)

    One other twist is that the sql oriented people in the company don't have access to the cluster management software and the system people who do don't know sql.  So this means you must have two people for any recovery.

    -----------------------------

    From Enterprise forum:

    I'm looking for input on using cluster.exe to stop the cluster, copy the master mdf/ldf and restart.

    We have an active/passive cluster for sql 2000.  The system databases and backups are on a SAN. The user/client databases will soon be moved from the SAN to a storage array which will be like a local drive to the cluster.

    In one disaster scenario, if the SAN goes down, the client databases will be available, but sql won't run until the system databases are restored.  Now I could have backups go to both the SAN and the storage array, but getting master back on line from backup is a little involved when master mdf/ldf are unavailable.  I'd rather be able to copy the master.mdf and ldf from SAN to array every night.

    Can't do that without buying a tool that can copy live files, or using cluster.exe in a batch file to stop the cluster, copy mdf,ldf and start the cluster every night. Since getting approval to buy tools is nearly impossible ( e.g. veritas cluster sql agent ), I'm leaning towards cluster.exe -- assuming it would have to be run by the windows scheduler, not sql agent.

    Randy Petty

  • You can always start up sql from the command line and pass the location to the system databases. It will also attempt to create tempdb where it says it should live according to sysdatabases in the master database. Another issue is if the cluster dies and the quarrim(sp?) drive is unavailable the cluster will not start ether. My advice to you would be not to bother copying the system databases but to copy the data you need out of them i.e. users, permissions, jobs, dts scripts and the like. If you have a total SAN failure you are still pretty much cooked even if you do copy the system drives due to the cluster failing to start anyway.

    Wes

  • You could stop SQL once, get a valid copy of a master, then just back it up from then on to an alternate location (that I would make sure goes to tape).  On a recovery, use the valid master.mdf to start SQL then restore your latest copy of master to get the right one.  You might want to consider making sure you get MSDB also.  This only bypasses the rebuilm utility to get a valid master to start SQL.

  • Jane, I like that idea.  SANs supposedly fail completely once in a lifetime, but we still have to prepare for that possibility.   So I'm exploring the steps and order of steps for:

    1. Take cluster offline

    2. stop sql ( if #1 doesn't do that  ??)

    3. use sqlservr.exe to start sql with parameters identifying new path for master and errorlog

    4. stop sql and restart from command line with trace flag.  This allows you to attach a new msdb and model database ( off the SAN)

    5. stop sql and restart without trace flag

    6. alter database command to repoint tempdb

    7. restore latest master ,model,msdb from backup

    8. Use cluster commands to repoint quorum drive off San, then restart everything. (ommitted in earlier post)

    Then the fun continues because if the SAN comes back up in 3 hours, you get to restore things to the way they were.

    Randy Petty

  • I like Wesley's suggesstion of copying the logins and passwords. Below is the script which you can use to script out the username and password. For DTS packages, you can save them in source safe or any other place.

    http://support.microsoft.com/kb/246133

     

    If you want to stop SQL, you need to take the SQL service only offline and not stop the cluster service. If you stop the cluster service, the disk resource would be offline and you won't be able to access it. But again, i won't like to go this path.

  • Old Hand,  if you stop sql on the active node of the cluster, with cluster active, wouldn't it just attempt an automatic failover to the passive node?  Of course with the primary cluster resource, the SAN, down, I'm not sure how much that matters.  With system DBs on the SAN, nothing is going to work without help.  Now the secondary cluster resource is a storage array device ( DEll Powervault ) which normally only houses the large client database.  If the SAN is down we're attempting to get back into business using the powervault only.  Are you saying with the cluster offline we can't access the powervault from the sql box -- it's directly connected.?

    I'm thinking the order of these steps is critical and we may need to hire a cluster consultant.  Very little cluster experience in house.

    Randy

  • Below are your answers.

    Old Hand,  if you stop sql on the active node of the cluster, with cluster active, wouldn't it just attempt an automatic failover to the passive node?  Of course with the primary cluster resource, the SAN, down, I'm not sure how much that matters.

    You don't want to stop SQL services. Yes, it will trigger a failover. But, You can go to the cluster administator and take the SQL service offline. This will stop the SQL Services but not trigger a failover.

     

    This is how Active/Passive cluster is typically laid out.

    Two Groups.

    1) Cluster Group

    Resources : Cluster I.P., Cluster Name, Quorum Drive, MSDTC(Optional)

    2) SQL Group.

    Resources : Disk1(SAN), Disk2(Powervault in your case),SQL Server, SQL Agent ........

    1) If your Cluster services are down, you can't access any disk. (Note :There are some ways by disabling some drivers but i don't want to go in that detail)

    2) If  say Disk1 from above is lost, in that case, SQL Group will fail a couple times and will eventually declare the disk1 resource to be not available. SQL Server will stop because it cannot find the system files. Now, if you want to start SQL Services, you can move your system databases to Disk2 and then startup SQL server.

    Note : Cluster being offline and SQL Services being offline are different.

  • Sounds like it's time for replication.  I'll have to look into what potential performance hit this would have, but it might be the quickest, least ulcer-prone way to get things back up quickly. 

    Just point the web servers at the standby sql box in the event of SAN or other failure.

    Randy

  • Hmm, perhaps a dumb question, but why don't you put master and msdb on the storage array too? putting them on a different storage system just adds another point of failure. if the storage array with the client db's fails nothing goes anyway...

    karl

    Best regards
    karl

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply