Application troubleshooting (SQL Server dropping connection?)

  • We have a 24/7 type app that is having trouble.  Its a vendor supplied pile of s**t, IMO, but its on me to support it in what limited fashion I'm able.  It was originally written for Oracle, but when the contract was negotiated, we apparently demanded SQL Server for cost reasons. (The contract was finalized before I got here, but I've been here since it was installed.)

    The problem we have is the app will stop communicating with the server and sit there until someone tries to interact with the app.  At this point the app simply exits.  No errors, no dialogs, nothing.  Just closes its window.  I have no access to the source (don't really want it as I'm sure its one giant mess), but from what I've gathered so far, the app opens one connection and shares it to get at all of the data.  It opens the connection and leaves it open for the duration of the app. 

    Their claim is that its a SQL Server problem.  According to them the server drops the connection, and that's the end of it.  The only thing they can hope to do is provide better feedback to the operators so they know when the app is dead. 

    In my mind, this is all utter fucking bullshit (Pardon my french, but I've been dealing with this for a year and a half).  I'm guessing they're wanting to keep their codebase as similar as possible w/ the Oracle version, but they took the damned contract on SQL Server and I'm fed up with their shit saying they can't fix it.  I want to go to them w/ solid evidence that their code is shit, and that's where I need help.  I've run a trace, and captured damned near everything possible when "SQL Server drops the connection".  I can see that the connection was dead for about an hour and a half before the operators came in and tried to do something.  The app exited, and they restarted the app.

    Here's the odd thing and the point where my first question comes up.  I see that once the app restarted, it got the exact same PID.  Is there a way to trace when a process is dropped/exits?  Also, I have to ask, is there any validity to their claim? I may be barking up the wrong tree by looking at the PIDs, too, but I don't really know how to go about disecting the trace.  Any suggestions would be appreciated.

  • 1. You could turn on trace for a small window of time and look at what is going on.

    2. find out how many connection does this application create in the morning, afternoon and evening.  If you find that the number increase in the afternoon and you have more connection in the evening shows that the application close less connection then it create thus you might be running out of connections as set by license for your server thus it refuse to create more connection until you restart the server.

    3. If it is the problem created by the 2nd senerio, you might be able to convince your boss to stop and restart SQL Server from time to time on a schedule basis.... I know it will break your 24/7, but you won't miss much if you do it at 12:00 AM or 1:00 AM at night.  Just give it 30 seconds to restart SQL Server and it will give you a little bit of  peace.

    Good luck.

     

    mom

  • Hi Tim,

    The last time I was trying to express my opinion in the appropriate for the case expressions I was told to get some soap and go and wash my <d..y> mouth. Now when I really have to express myself I start with telling people to go and get some soap and water for me. Or use Russian. But you made your point.

    Tell me what is the application: Web, Classic Client (Desktop Client), 3 tiers, 2 tiers, you must know.

    What it is using to connect? ODBC, OLEDB, JDBC, SQLClient or other .NET, ASP Code, All Of Above ? Whwn we know the technology making the connection we will tell you what logs you may use.

    You may get some info from Current Activity window. Tell us what authentication is used. Other then that listed to what mom is saying.

     

     

     

     

    Regards,Yelena Varsha

  • The particular app in question is more of an interface into the system.  It just displays data out of the database and allows the operators to create charges and update the status of bases (Its an MMI (Man-Machine-Interface) for a Batch Annealing shop).  The processes that do the actual work don't have problems like this because they open/close their connections like good little programs.

    I'm pretty sure that performance would suck very badly if the MMI did this (opened/closed its connection) since there's so many requests coming in.

    I have a trace from a few hours before the app died through the apps restart.  I set it up to capture nearly everything possible.  Is the PID behavior in the original post normal (having the same PID assigned after restart)? 

    In the trace, i can see the app doing its things, but it just mysteriously stops talking.  I don't see any indication that SQL Server experienced a problem, and that's really want I'm looking for help on.  What do I need to enable to capture these problems?  Can I log when a particular process exits?

     

  • Its a desktop client app.  I'm 99% sure that its VC++ 6, but I'm not sure what the data access method is (ADO, ODBC, etc...).  (Can you tell from the server?)

    It opens 2 connections.  One is used to execute this:

    select substring('NY',status/1024&1+1,1) from master..sysdatabases where name=DB_NAME()

    and then never does anything else again (at least that's what it looks like from the trace).

    Let me know if you need to know anything else about it.

     

  • I just noticed something very odd... 

    I have an Audit Login event for SPID 54 (the "workhorse" process) at "2005-08-29 11:15:14.470".  This is the same SPID the workhorse had before everything died at around 9:30.

    Here's the wierd part (remember that everything started talking at 9:30AM).  Preceeding the Login, I have an Audit Logout event for Process 54.  The Time recorded is "2005-08-29 03:30:19.410".  The events before and after this have times of "2005-08-29 11:14:25.713" and "2005-08-29 11:14:35.710" respectively.  Is this somehow normal, or does this need more investigation?

  • Tim,

    Times may be OK. It depends how it sorted in the trace. Do you have trace fields starttime and endtime? For example all events in the table show StartTime : ...15:17.. for Logout,  ...15:16...for Logon,  15:17...for another Logout. Looks like 15:16 is between two 5:17s. But if you look at EndTime for 5:16 it would be 5:17. So it was sorted by EndTime.

    ODBC versus OLEDB: if you have ODBC source on the machine for the app  it is probably ODBC. If you have ODBC source on each client then connection is open using ODBC from each client. If you don't have ODBC source on each client but have ODBC source on the server then you must have a server part of the application that connects to SQL Server database. Also, if it is ONLY 1 process in Current Activity window Processes view then it is one connection for all access. ADO can use both OLEDB and ODBC

    My other suggestion is to look for the app logs that may or may not exist.  Search for the files that were modified when the application is stopped.

    Also you have mentioned re-starting the application. How do you do that?

     

     

    Regards,Yelena Varsha

  • No app logs ... I've looked. I looked a year and half ago when this ball of shit got here on site. I wish I were so lucky. Also, I wasn't capturing the End Time ... I am now. I'm running a new trace, and I'm capturing based on the PID. I think I was filtering out server errors the first time by filtering on the HostName only. Hopefully I'll catch errors that I missed the first time.

    The logout time would make sense if the start time of the event were the original login time and the end time were the actual logout event time. I'm not looking at in a table; I'm looking at in profiler, so its events are sorted by the time received.

    Its a desktop app. Just close it and open it to "restart".

    I'm guessing its ODBC? To run the app on a partcular machine, you HAVE to create a DSN on the machine that points to the server. The DSN's name is irrelevant, since it asks you to pick a DSN when you launch the app. After the DSN selection, you have to type in a login. Its worth mentioning, too, that the dialogs used are standard Windows dialogs. The DSN dialog has the File/Machine Data Source Tabs. The login screen is a standard SQL Server login as well. It has the "Use Trusted Connection" check box, even though Windows Authentication won't work. Nothing at all custom about the login code.

  • Tim

    1. It is ODBC I guess.

    2. SQL Server does not drop connetions. It is a b.. s.. as you say

    3. Them third-parties always say it is your SQL Server. I had a case when an application owner requested Microsoft Support to make determination if it was SQL Server that was producing their front end VB errors (???!!!). We had a lot of fun after a big Microsoft memo that it was their application

    4. What drops connections may be your network. We are using some indusry-leader applications of famous vendors that would notify you if the database connection is dropped and please exit and restart. Usually after my investigation I would find out we had network problems on this subnet. Or most often users get this message when they unplug their laptops withoout shutting down applications and take laptops to the meetings. Then network connection is lost. Which is expected.

    5. To prove your point you may produce your traces, any of them to prove that the database was up and accessible from this time to that time.(when the trace was running)

    6. Check Event Logs on the client computers when the connection broke. Ask your Network Support. I mean, if you are not your own Network Support as well. Reporoduce the problem by unplugging Network Cable on a workstation. Prove that other clients are working fine at the same time.

     

    Regards,Yelena Varsha

  • The network is a train wreck. The plant was built and then people started adding computers, then they started plugging them together. It has slowly evolved into the conglomeration we suffer with today. And because of the dependence the units have on the said network and by the nature of the plant (runs 24/7), there's no hope of truly fixing it. I don't think its the problem, though. We have the exact same app running from other PCs with no problems.

    The particular computer that exhibits these behaviors is on 2 networks. One is the plant ("L3") network, and the other is the PLC (the devices that control the annealing bases) network ("L1"). Both machines (server/client)are on both networks. The problematic client is a dual monitor box. On one monitor is the app in question (App1 to protect the not-so-innocent). On the other monitor is the legacy MMI application (App2). App2 talks directly to the PLCs on the L1 network. App2 also needs the L3 network for certain things, too. App1 really only needs L3, but since it co-exists with App2, both networks are connected to the box.

    Maybe the client is getting confused on the network overwhich to talk to the server? I don't really see how that's a possibility, though, since once the address is resolved, its always the same. And I seriously doubt the routing is getting that confused on the client.

    App1 is in heavy use on the problematic client. It runs without these problems on other machines, but its not used nearly as heavily on those boxen.

    I have the trace running again... Just waiting on a crash. Hopefully I'll find something. If not, I'll look at splitting the boxes apart. I seem to remember these problems when they were separate very early on, though. Its been over a year, though, so I can't say for sure. They needed to be on the same box for convenience (mousing back and forth), but I've since discovered Synergy, so it really won't be that bad if they are separated again...

    No matter what turns up, it still burns my ass that I'm going through all of this and the vendor is still blaming SQL Server. That's just retarded. Like you said in point 2 ... SQL Server doesn't just drop connections for no apparent reasons. And the fact that they keep saying that proves beyond doubt that they're a bunch of lazy, incompetent bastards.

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply