SQL Server 2000 and Netapp

  • We have SQL Server currently set up to use a NetApp NAS for our logfile and datafile storage. However, in recent days and weeks we have had numerous failures of equipment within our data center causing our SQL Servers to behave irraticly.

    Such outages result in sql log messages:

    2004-03-03 19:42:32.65 spid1 LogWriter: Operating system error 121(The semaphore timeout period has expired.) encountered.

    2004-03-03 19:42:32.65 spid1 Write error during log flush. Shutting down server

    2004-03-03 19:42:32.84 spid1 LogWriter: Operating system error 64(The specified network name is no longer available.) encountered.

    2004-03-03 19:42:32.84 spid1 Write error during log flush. Shutting down server

    2004-03-03 19:42:32.84 spid52 Error: 9001, Severity: 21, State: 4

    2004-03-03 19:42:32.84 spid52 The log for database 'msdb' is not available..

    2004-03-03 19:42:33.60 logon Login failed for user '***************'.

    2004-03-03 19:42:33.85 spid14 Database 'msdb' cannot be opened. It has been marked SUSPECT by recovery. See the SQL Server errorlog for more information.

    2004-03-03 19:42:33.85 spid14 Database 'msdb' cannot be opened. It has been marked SUSPECT by recovery. See the SQL Server errorlog for more information.

    2004-03-03 19:42:33.85 spid14 Database 'msdb' cannot be opened. It has been marked SUSPECT by recovery. See the SQL Server errorlog for more information.

    2004-03-03 19:42:33.85 spid14 Database 'msdb' cannot be opened. It has been marked SUSPECT by recovery. See the SQL Server errorlog for more information.

    2004-03-03 19:42:33.91 spid14 fcb::close-flush: Operating system error 64(The specified network name is no longer available.) encountered.

    2004-03-03 19:42:33.91 spid14 fcb::close-flush: Operating system error 64(The specified network name is no longer available.) encountered.

    2004-03-03 19:42:33.91 spid14 Starting up database 'msdb'.

    2004-03-03 19:42:45.92 spid14 udopen: Operating system error 53(The network path was not found.) during the creation/opening of physical device \\example\datafile\msdbdata.mdf.

    2004-03-03 19:42:45.93 spid14 FCB:: Open failed: Could not open device \\example\datafile\msdbdata.mdf for virtual device number (VDN) 1.

    2004-03-03 19:42:45.93 spid14 Device activation error. The physical file name '\\example\datafile\msdbdata.mdf' may be incorrect.

    2004-03-03 19:42:45.93 spid14 Device activation error. The physical file name '\\example\datafile\msdblog.ldf' may be incorrect.

    2004-03-03 19:42:45.93 logon Login failed for user '***************'.

    +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

    I realize that these outages are difficult to avoid if hardware fails, but I

    would be interested in learning of other technologies that would more

    gracefully handle the outage (iSCSI?), methodologies to either restart the

    service until the service is restored successfully, or alert me or someone else

    that the service has been terminated. Any ideas, tips or examples of your

    implementation of this technology would be appreciated.

    Thanks...

    Nathan

  • What method are you using to attach the SQL2000 server- VLD or plain CIFS?

    From this log it seems that your network connection to the Netapp isn't up- have you determined the root cause of the problem?  Have you implemented port aggregation, etc, to make sure you have redundant paths to the filer?

    iSCSI would give you the same problem, because the LUN would be missing and you'd still end up in a heap o trouble.  Same as any other SAN implementation if the fiber connection failed or the switch rebooted or something.

    Glad to help in any way I can- (although Netapp support is pretty good with this too) -

    Glenn Dekhayser

    Voyant Strategies (Netapp partner)

  • Though I don't use "network storage" technology, haven't a read in other posts here on SSC that SAN was the preferred technology over NAS and it even had a link to Microsoft's web site indicating so? Don't want to debate pros and cons, just trying to help.

  • Fellas,

    That is exactly what happened. The network datacenter hub took a dive and we lost connectivity to the NAS. Additionally, there have been occasions where the filer went down and connectivity was again lost. We do have redundant filers, but the interface did not failover like it was supposed to. (There is a dedicated gb interface to the Netapp.) I believe this is because SQL Server lost connectity to the filer, and once it was lost it never regains connectivity.

    On the NAS v. SAN approach, the only article I found was this: http://www.sqlteam.com/item.asp?ItemID=128. I believe that the setup we currently have resembles more of a SAN than a NAS.

    Is there any way or technology that would maintain a fluid connection to the datafiles on the NetApp during a snafu? I realize that not all outages are preventable, but to eliminate some of them would be helpful. Any additional resources or advice would be greatly appreciated.

    Thanks...

    Nathan

  • WELL, if you used iSCSI, I guess there's a way to mess with the timeout values so that sql will just hang until it comes back up, which is obviously preferrable to a 'suspect' db.  I'd go down that route.  Actually, I'm pretty sure that CIFS has one to, somewhere in the registry on the servers.  Then I'd also look at the SQL settings.

    Yeah, the more I think about it, the timeout values are where you will win here.  Although- if it goes down for a longer period of time you'd want that functionality.

    You have cluster failover on the netapps and you're saying it didn't work for you?  I would definitely get support on the phone.  Something's not right there, it should failover very quickly, but you have to have things set up just right.

    Good luck

  • Do you have any idea where this setting would be?

    I think the SQL Server was distrupted for a brief moment probably while doing a backup. Therefore it noticed the loss of connectivity. Unless you are suggesting, that it should be transparent to SQL Server?

    Thanks...

    nathan

  • Nathan,

    Sorry to be the bearer of bad news but all I can really suggest is start looking for another job unless you want to be constantly recovering suspect databases.

    I worked with a SQL Server 2000 installation on a NetApp NAS and it was the worst job in my career.  We had exactly the same problem as you have described with the databases being marked suspect every few days for no apparent reason.  NetApps AsiaPacific tried to help, then passed it onto NetApps USA (I'm in Australia BTW) and it still wasn't resolved by time I resigned.  From what NetApps could determine, there was a network problem between the filer and the Sql box. (duh!) Whenever there was a slight network glitch SQL Server couldn't locate the database files on the filer and marked them suspect.  Occasionally running sp_resetstatus would fix the problem, but most times I blew the db away and restored it from backup.

    Which, is also another thing to be aware of.  Snapshot backups from the filer do not work for SQL Server databases.  I hope you are doing normal SQL Server backups instead.

    Good luck, I think you may need it.

    Angela

  • We are experiencing the exact same problem with our SQL Server 2k backing onto a NetApp filer

    Whenever we reboot our SQL Server box (Windows 2003, SQL 2k with SP3a), the databases which are stored on the Netapp box are marked as suspect

    If we re-attach the databases, all is restored, but this is not an acceptable solution for us, we would much prefer an option to do this automatically

    Other than writing scripts to automatically re-attach the databases, are there any other options available, and can SQL Server Agent alert on suspect databases.

    In regards to changing the timeout value for SQL Server, where is this value kept, and how can it be edited?

    Any help would be much appreciated,

    Regards,

  • Chris,

    Are you connecting UNC and the trace flag -T1807 at startup? Also, be sure that your SQL Service startup account is the same account used to map the drive for SQL Server data storage. I am still not sure where this timeout value is for SQL Server, but I would be interested in hearing any suggestions on how to improve any issue with the network disconnecting.

    Thanks...

    Nathan

  • Just to update everyone

    We had a consultant from NetApp onsite yesterday, and he suggested the following fix:

    -To make SQL wait for iscsi to start, you need to edit this key:

    - HKey_Local_Machine\system\CurrentControlSet\Services\MSSqlServer

    - Using Regedt32(cause that's the only one which lets you define reg_multi_sz values) and add this value:

    - DependOnService Reg_Multi_Sz MSiSCSI

    - If you now look at the service in the MMC and open the dependency tab you should see iSCSI listed

    We have not tried this fix yet, but hope to do this afternoon, just wanted to post it here and see if anyone had seen a similar fix

    As for the questions Nathan i'm not sure, are UNC and trace flag -T1807 related to NetApp or SQL

     

    Regards,

     

  • Hi Chris,

    Trace flag -T1807 should be documented in your Netapps doco.  It is AFAIK a undocumented, unsupported trace flag that can be set in SQL Server to allow it to connect to a filer upon start up.  Microsoft isn't very helpful in this department and anything to do with a NAS they always say to ..."refer to your NAS vendor..."  You'll need to set this in the startup parameters for SQL Server if it's not already set.

    In my situation we had this enabled and it didn't resolve the problem.

    Just curious, are all your databases being marked as suspect or only some?  When I experienced this problem it affected random databases, and they never all went down at the same time.

    Please let us know if the fix NetApps gave you resolves the problem.

    Cheers,

    Angela

  • Hi Angela,

    We have tried the fix from NetApp but unfortunately this did not solve the problem.

    iSCSI is listed under the dependencies tab for the service MSSQLSERVER, however still the databases are coming up as suspect.

    In our SQL server, we have some databases stored on the nas, and some still stored locally, it is only the NAS ones which come up as (Suspect)

    I will consult with my collegue who is in charge of the NetApp filer, and we have a NetApp consultant back in on monday, who hopefully will be able to shed some light on the problem.

    Originally we created the Nas drives local to the SQL Machine, copied over the mdf and ldf files and re-attached the database.

    We are now looking at a product from NetApp called SnapManager for SQL, which doing the conversion from Locally stored database to NAS Stored databases automatically, so maybe this will also help out with the situation.

    I will keep you updated with any progress

    Kind Regards,

    Chris

     

  • Registry fix worked for me. I had the same issue with iSCSI initiator not being loaded before MS SQL Server comes on. Dependency check fixed the issue and my suspect database are starting up now when server reboots.

  • I thought that the trace flag was for SQL to the netapp via cifs, not iscsi.

    The dependency thing is the fix for the boot-up suspect db problem.  Makes sense, if iSCSI isn't up and started when SQL Server starts, the db's won't be there!  So you must create a dependency.  Would be nice for the iSCSI initiator install program to ask if you're using SQL  or Exchange over iSCSI and do it all for you.

    To the person saying that if you want your job, etc., time to freshen up on your infrastructure knowledge and experience.  I've put in dozens of Exchange/SQL over iSCSI now, even with clustering, and if something goes wrong it's almost always a config or design error, not the technology. 

    Cheers,

    Glenn

  • We have SQL Server 2000 with iSCSI and tried the soultion of adding dependent services and it has fix the problem. We have also made SQL server the depend service for Cognos metrics Manager to make it wait for SQL Server to starts.

    Box comes up fine now after reboot.

    Many thanks to Chris Scott

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply