Unresponsive sql server - high CPU usage

  • Preface: We are running SQL 2005 x64 Standard edition version 9.0.4035 on Windows Server 2003. SQL is the only thing running on that server but it hosts many databases including Microsoft CRM, SharePoint, and a host of other databases that are used with internal applications. We also have some users who run ad-hoc queries against the database.

    Earlier today right at 5pm the database server became unresponsive. I couldn’t connect through management studio and all of our applications were timing out. I asked the usual suspects that run ad-hoc queries that usually hurt the server performance if they were running any queries, but they were not.

    I logged into the server and saw that the SQL service was using up around 70% of the CPU. I couldn’t connect to the SQL server through management studio while logged into the server, either. I checked the memory and network through task manager and didn’t see anything out of the ordinary. I did not check disk I/O though.

    Whatever had a hold of the SQL server wasn’t letting go, so we decided to restart the SQL service. I went into the server manager and attempted to restart the service. It froze, and the SQL instance was still chugging along at around 70% CPU usage. Whatever had a hold of the instance REALLY didn’t want to let go. I had to go end the service task through task manager which worked to shut it down; it started back up just fine after that.

    I looked through the logs and found absolutely nothing of value. Neither the SQL logs nor the server logs provided any useful information. There are no jobs that run at 5pm so that can’t be it. This has never happened in production before and there haven’t been any major changes recently.

    So here are my questions:

    1. Has anyone had an issue like this before? Did you find the root cause? If so, what was it? If not, were you able to find out anything at all about what caused it?

    2. If the server is hosed up like that, is there any way to connect at all or do any kind of troubleshooting while it’s happening? I was thinking I could have turned off tcp/ip and then tried to connect locally but that would have required a service restart which didn’t work anyway.

    3. If there was some weird query that was the culprit, is there any way to limit the amount of resources one query can utilize at once?

    Thank you for your help in advance. I am a beginner DBA and I appreciate any help you guys and gals can give me! Also please let me know if I can provide any more information that may be useful in figuring out what happened.

    -nick

  • Do you have DAC enabled on the server? If not, you may want to enable it; if you do, good. If this happens again, you may want to try cnnecting using the DAC. I have tested the DAC connections on my servers, but I'd have to check Books Online to see how to connect to it again. This connection my allow you to run light weight queries to determine what may be going on with SQL Server when it becomes unresponsive.

    Beyond that, I got nothing for you at the moment.

  • Probably a long shot, but the only time I've seen anything like this before was when a heavily used stored procedure was rewritten to use temporary tables.

    The stored procedure was HUGE. The change to temporary tables was causing the procedure to recompile every few statements, and as the procedure was heavily used, the CPU ended up at 100%, doing nothing but compiling, and the whole server became unresponsive.

  • I've had issues where tempdb has become full because of a rogue query and so SSMS wont respond and i've used sqlcmd to connect and kill the query. might be worth a try next time?

  • Ok so this issue is happening again RIGHT NOW. I am unable to connect through DAC - it says timeout expired. There is very low disk I/O but cpu is hovering around 70-80%. Any ideas on how to troubleshoot this???

  • The DAC times out?

    On a local connection, that is surprising. I think you might want to reboot the server, and if this happens again, perhaps hold a DAC connection permanently open on the server.

  • Ok so it finally stopped - I let it run until it finished. Looking at the reports I have that show queries with high CPU usage, the offending query is obvious. Problem is, there's no way this ran the last time the server locked up with high CPU. I'm re-running the exact same job to see if it'll do the same high CPU usage lockup, but I doubt it.

  • Steve Jones - Editor (7/1/2009)


    The DAC times out?

    On a local connection, that is surprising. I think you might want to reboot the server, and if this happens again, perhaps hold a DAC connection permanently open on the server.

    Yeah, DAC timed out. I didn't try to connect using DAC locally though... darn it I, should have tried that. I'm definitely opening a permanent DAC connection on the server though - that is a good idea.

  • I have experienced SQL 2005 64bit Standard Edition doing a similar thing as you describe especially when our data analysts are crunching very large datasets. The results are; All connectivity to the server was lost, even the dedicated admin connection couldn't connect nor could the SQL Server service be stopped manually.

    After much umming and ermming this was pinned down to the way SQL 2005 64bit standard edition and the underlying o/s handles available RAM. Basically, if the o/s is under memory pressure it can page out the SQL Server to virtual memory, aka the paging file. Not exactly an efficient way to run SQL Server as you can understand.

    I may well be barking up the wrong tree, but double check your windows event logs. You may find something similar to the below in the event log.

    "A significant part of sql server process memory has been paged out. This may result in a performance degradation. Duration: 0 seconds. Working set (KB): 13128, committed (KB): 6239692, memory utilization: 0%%."

    The SQL Server Errolog also usually has something along the lines of "Non-yielding IOCP Listener.. bla, bla.."

    This support article documents the problem: http://support.microsoft.com/kb/918483.

    This link discusses it.

    http://groups.google.co.uk/group/microsoft.public.sqlserver.server/browse_frm/thread/4b03cc132e0550e3/a1171fc3bcbd4e43?lnk=raot&hl=en#a1171fc3bcbd4e43

    One of the solutions is to Lock Pages in memory, don't bother with the Standard Editiom - this only works for the Enterprise Edition so all you can really do is make sure the O/S has enough fixed memory (about 2-4GB) by limiting the RAM SQL Server can use (sounds counter-intuitive but it worked for me). This will not entirely fix the issue but I've not had a re-occurrence.

  • Richard Bradford (7/2/2009)


    I have experienced SQL 2005 64bit Standard Edition doing a similar thing as you describe especially when our data analysts are crunching very large datasets. The results are; All connectivity to the server was lost, even the dedicated admin connection couldn't connect nor could the SQL Server service be stopped manually.

    After much umming and ermming this was pinned down to the way SQL 2005 64bit standard edition and the underlying o/s handles available RAM. Basically, if the o/s is under memory pressure it can page out the SQL Server to virtual memory, aka the paging file. Not exactly an efficient way to run SQL Server as you can understand.

    I may well be barking up the wrong tree, but double check your windows event logs. You may find something similar to the below in the event log.

    "A significant part of sql server process memory has been paged out. This may result in a performance degradation. Duration: 0 seconds. Working set (KB): 13128, committed (KB): 6239692, memory utilization: 0%%."

    The SQL Server Errolog also usually has something along the lines of "Non-yielding IOCP Listener.. bla, bla.."

    This support article documents the problem: http://support.microsoft.com/kb/918483.

    This link discusses it.

    http://groups.google.co.uk/group/microsoft.public.sqlserver.server/browse_frm/thread/4b03cc132e0550e3/a1171fc3bcbd4e43?lnk=raot&hl=en#a1171fc3bcbd4e43

    One of the solutions is to Lock Pages in memory, don't bother with the Standard Editiom - this only works for the Enterprise Edition so all you can really do is make sure the O/S has enough fixed memory (about 2-4GB) by limiting the RAM SQL Server can use (sounds counter-intuitive but it worked for me). This will not entirely fix the issue but I've not had a re-occurrence.

    Thank you for this post - I set the maximum on the sql server memory usage to 3.5 gigs and we're going to put more physical memory in the box tonight. Right now it only has 4 gigs and we're going to put another 8 gigs in there for a total of 12. Hopefully this solves the issue!

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply