CPU Usage Reported By Profiler

Question

Post reply

CPU Usage Reported By Profiler

rhunt

SSCertifiable

Points: 5868
More actions
October 5, 2009 at 11:08 am

#215983

I am trying to track down odd resource utilization on one of my SQL Server 2005 machines and I am starting to question some of the data I am getting from profiler. Here's a very high level on my issue.

The issue:

I have a SQL 2005 box with 125 DB's and many apps that connect to it. After a server restart CPU utilization is very low. In fact, this is a VERY beefy box so I wouldn't expect actual CPU use to be very high. Also, this is not a primary OLTP server so not nearly as many transactions as 125 DB's might normally imply. Anyway, the server will hum along for one week, maybe 2, averaging about 10% total CPU use. Eventually, "something" happens in my environment and the server will begin to average 40 to 80% CPU use. After this "something" has occurred, CPU averages will remain high until the sqlserver.exe service is restarted. Even in the middle of the night it will average 40% CPU use with no one even on it.

Troubleshooting:

Weeks ago I started capturing baseline profiler traces..and problematic traces. My problem is I have never been able to find a smoking gun. Over similar time periods, the high CPU trace is VERY similar to the baselines with +-15% average increases in number of statements etc. Never being able to find something obvious, I figured I might be suffering from a bug of sorts so I got the machine up to date with SP2 for Windows Server 2003 and SP3 for SQL 2005.

Unfortunately, my issue still exists so I am back to Profiler...which brings me to the actual point of this post (sorry it has been so drawn out). I have always been under the impression that I should be able to capture all RPC Completed events (yes, I know I will miss some adhoc SQL but there should be very little of this. I have a couple apps that don't always use Procs but at least utilize sp_executesql coming from the client) and that the SUM of CPU use should essentially equal what the OS is reporting for sqlserver.exe I know it wouldn't be exact...but close, right?

In my case it is not close at all. And, I have lots of records where CPU use is zero. Should I be looking at this differently. Is there a better method for footing Profiler CPU use to OS stats so I can figure out what this SQL server is wasting it's time doing when things go awry?

Thanks. RH

Viewing 15 posts - 1 through 15 (of 31 total)

You must be logged in to reply to this topic. Login to reply

Roy Ernest SSC-Dedicated Points: 38780 More actions · Answer 1

Roy Ernest

SSC-Dedicated

Points: 38780

October 5, 2009 at 1:04 pm

#1061936

Maybe your stats are going out of date and lots of bad plan/compile is happening. Check how much memory you have. If there is not enough memory, then it will have to do lots of IO operations as well.

Just my 2 cents

-Roy

Grant Fritchey SSC Guru Points: 397721 More actions · Answer 2

CPU usage within Profiler would not be a good measure for overall CPU use within the system. It's only going to give you information about the execution of the procedures that you're capturing, not any other processes that are occurring, and there could be a lot of those. I'd suggest looking up one of the articles on using wait states and queues as a performance metric. These let you identify what is causing a slow down. From there you can determine which queries are causing it.

----------------------------------------------------The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood... Theodore RooseveltThe Scary DBAAuthor of: SQL Server 2017 Query Performance Tuning, 5th Edition and SQL Server Execution Plans, 3rd EditionProduct Evangelist for Red Gate Software

rhunt SSCertifiable Points: 5868 More actions · Answer 3

Yes, I've considered that...let me bounce this off you. This server is Win2003 R2 32 bit Enterprise with 3GB switch and AWE enabled with 10GB RAM.

Two things I've found recently (recently..as in, I don't have a good baseline to compare to currently) is that my page life expectency keeps building up...and then every so often (a few times an hour) the exptency drops from about 1000 seconds to 30 seconds and starts climbing again. So, something is totally purging the buffer and rebuilding...not sure what though.

Also, just realized that my SQLServer:Plan Cache:Cache Hit Ratio is averaging between 40 and 45 percent. I've seen a guideline indicated that <70% is something to pay attention to.

I have no idea what these counters look like when the server is behaving.

So, something is purging my buffer cache...and on my to do list is find out why so much of my workload is missing plans. This may be a characteristic of SharePoint workloads as 50 of my 125 DB's are SharePoint DB's and SharePoint is easily my top consumer of DB resources on this server.

Regarding your mention on the bad stats...my expectation is that the server would queue up a ton of auto update stats requests and eventually churn though them. Recall...once this server starts to misbhave it never seems to recover.

Thanks. RH

rhunt SSCertifiable Points: 5868 More actions · Answer 4

Grant, service levels are currently being adhered to...no queries are excessively blocking...or being blocked. My user environment doesn't ever notice that CPU usage goes awry. Recall that the difference in Profile workload and charteristics between when the server is having a problem and the baseline are very similar. Although, I must admit that the culling through 400K to 600K of profiler records from 60 minutes duration looking for a smoking gun is a bit tough.

As I am typing this my page life expectency went from 1200 seconds to 71 in one second. Maybe all the processor I am seeing is the system feeding the buffer pool over and over again.

I think I need to find what is causing that buffer purge.

Grant Fritchey SSC Guru Points: 397721 More actions · Answer 5

rhunt (10/5/2009)

Yes, I've considered that...let me bounce this off you. This server is Win2003 R2 32 bit Enterprise with 3GB switch and AWE enabled with 10GB RAM.

Two things I've found recently (recently..as in, I don't have a good baseline to compare to currently) is that my page life expectency keeps building up...and then every so often (a few times an hour) the exptency drops from about 1000 seconds to 30 seconds and starts climbing again. So, something is totally purging the buffer and rebuilding...not sure what though.

Also, just realized that my SQLServer:Plan Cache:Cache Hit Ratio is averaging between 40 and 45 percent. I've seen a guideline indicated that <70% is something to pay attention to.

I have no idea what these counters look like when the server is behaving.

So, something is purging my buffer cache...and on my to do list is find out why so much of my workload is missing plans. This may be a characteristic of SharePoint workloads as 50 of my 125 DB's are SharePoint DB's and SharePoint is easily my top consumer of DB resources on this server.

Regarding your mention on the bad stats...my expectation is that the server would queue up a ton of auto update stats requests and eventually churn though them. Recall...once this server starts to misbhave it never seems to recover.

Thanks. RH

30 seconds page life expectancy is still pretty good, but yeah, something is swacking your cache. I have a server that does something similar, but it's on a 24 hour cycle. It just keeps growing til about 6AM and then drops to 3 seconds and starts rebuilding again. Mine is a data load running at that time. Look for large data loads or reports... Something along those lines that is moving a lot of data.

A cache hit ratio that low means almost everything coming in is not getting plan reuse. I haven't watched our Sharepoint db to see if this is normal or not. W/O knowledge, I'd say that's an indicator of a problem, but I'd need to test it to be sure.

From the sounds of it, again, I'd go after waits & queues to identify what is slowing do the system.

----------------------------------------------------The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood... Theodore RooseveltThe Scary DBAAuthor of: SQL Server 2017 Query Performance Tuning, 5th Edition and SQL Server Execution Plans, 3rd EditionProduct Evangelist for Red Gate Software

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 6

Owing to the random nature of the problem and the fact than RPC Complete isn't showing anything (the bad thing has to actually complete to have RPC Complete capture it), it sounds like one of two things...

Either a use is asking for something very large and ungainly or a particular task has an accidental cross-join in it.

The first thing to do is determine if it's actually SQL Server or Sharepoint or something else. Sounds stupid simple, but the first thing I look at is TaskMgr, particularlly the Processes tab. Follow your nose after that.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

rhunt SSCertifiable Points: 5868 More actions · Answer 7

rhunt

SSCertifiable

Points: 5868

October 6, 2009 at 6:26 am

#1062192

It's all sqlserver.exe. This is a dedicated sql box with a couple SharePoint boxes (and other IIS boxes) connecting over the network.

Thanks. RH

Chris Howarth-536003 SSCommitted Points: 1775 More actions · Answer 8

We had a similar problem last year with CPU usage - when the CPU usage was high we found that Windows would randomly complain of 'insufficient resources' when performing simple tasks such as network file transfers. The CPU usage would be fine for a couple of weeks, then suddenly ramp up with no indiciation as to the cause. The OS is Windows 2003 Server R2 Enterprise 32-bit.

When we removed the /3GB switch and used just /PAE the problem went away and hasn't occurred since. I can't explain why this should be the case, but this fixed the problem that we were experiencing.

Chris

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 9

Chris Howarth-536003 (10/7/2009)

We had a similar problem last year with CPU usage - when the CPU usage was high we found that Windows would randomly complain of 'insufficient resources' when performing simple tasks such as network file transfers. The CPU usage would be fine for a couple of weeks, then suddenly ramp up with no indiciation as to the cause. The OS is Windows 2003 Server R2 Enterprise 32-bit.

When we removed the /3GB switch and used just /PAE the problem went away and hasn't occurred since. I can't explain why this should be the case, but this fixed the problem that we were experiencing.

Chris

BWAA-HAA!!! Wouldn't it be ironic if the /3GB switch had a "memory leak". 😀

Good tip. Thanks for posting it.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

rhunt SSCertifiable Points: 5868 More actions · Answer 10

You had me excited...sounded like the ticket...just couldn't remember if we officially took the /3GB dive back when we enabled PAE and AWE. So, I checked my boot.ini and it's looking to me like we're not using /3GB switch.

Boot.ini:

[boot loader]

timeout=30

default=multi(0)disk(0)rdisk(0)partition(1)\WINDOWS

[operating systems]

multi(0)disk(0)rdisk(0)partition(1)\WINDOWS="Windows Server 2003, Enterprise" /noexecute=optout /fastdetect /PAE

Maybe I'll do the opposite...enable it and see if the behavior changes.

Chris Howarth-536003 SSCommitted Points: 1775 More actions · Answer 11

Just to check the obvious - have you rebooted the server since removing the /3GB switch?

Thinking back there was actually another occasion when we experienced a similar problem with elevated CPU usage with no apparent SQL Server activity, and that was after applying SP2 to SQL Server 2005. That time we logged, every couple of minutes, the VAS-region memory levels (Free 'Total' and 'Maximum') within SQL Server for a month or so and found that the CPU usage jumped when the maximum contiguous block size first fell below approximately 12MB - the CPU usage would then remain high until SQL Server was restarted.

I'll dig out the monitoring query that we used, but you can obtain the same data by running the 'sqbmemory' extended procedure that is shipped with Red-Gate SQL Backup. [EDIT: Here's the query that we used...]

SET NOCOUNT ON



;WITH CTE

AS

(

SELECT  CASE    WHEN region_state = 0x00001000

                    THEN 'Commit'

                WHEN region_state = 0x00002000

                    THEN 'Reserve'

                WHEN region_state = 0x00010000

                    THEN 'Free'

        END AS 'Type',

        MIN(region_size_in_bytes) AS Minimum,

        MAX(region_size_in_bytes) AS Maximum,

        SUM(region_size_in_bytes) / COUNT(*) AS Average,

        COUNT(*) AS [Blk count],

        SUM(region_size_in_bytes) AS Total,

        CASE    WHEN region_state = 0x00001000

                    THEN 1

                WHEN region_state = 0x00002000

                    THEN 2

                WHEN region_state = 0x00010000

                    THEN 3

        END AS SortOrder

FROM sys.dm_os_virtual_address_dump

GROUP BY region_state

UNION ALL

SELECT  CASE    WHEN region_type = 0x00020000

                    THEN 'Private'

                WHEN region_type = 0x00040000

                    THEN 'Mapped'

                WHEN region_type = 0x01000000

                    THEN 'Image'

        END AS 'Type',

        MIN(region_size_in_bytes) AS Minimum,

        MAX(region_size_in_bytes) AS Maximum,

        SUM(region_size_in_bytes) / COUNT(*) AS Average,

        COUNT(*) AS [Blk count],

        SUM(region_size_in_bytes) AS Total,

        CASE    WHEN region_type = 0x00020000

                    THEN 4

                WHEN region_type = 0x00040000

                    THEN 5

                WHEN region_type = 0x01000000

                    THEN 6

        END

FROM sys.dm_os_virtual_address_dump

WHERE region_allocation_base_address <> 0x0

GROUP BY region_type

)

SELECT  Type,

        Minimum,

        Maximum,

        Average,

        [Blk count],

        Total

FROM CTE

ORDER BY SortOrder

GO

It would be good if you could run the above query when your server is showing the symptoms, then post the resultset back to the forum for examination.

In our case the problem turned out to be a memory leak associated with the sp_OAMethod and the sp_OAGetProperty extended procedures which were being called every few minutes by, ironically enough, our 3rd-party monitoring software. After raising a call with MS, they suggested a fix of applying SP3 CU1, which resolved the problem. Here's a link to the KB article describing this particular issue:

http://support.microsoft.com/kb/937277

There are other causes of VAS fragmentation, such as the use of Linked Servers, and the use of sp_xml_preparedocument without a corresponding sp_xml_removedocument, to name a couple. Also, do you have CLR enabled?

Cheers

Chris

Eoin The Forge SSCrazy Points: 2872 More actions · Answer 12

Eoin The Forge

SSCrazy

Points: 2872

October 8, 2009 at 5:21 am

#1063346

Try running the following

select name, SUM(single_pages_kb + multi_pages_kb)/1024.0 MBUsed

from sys.dm_os_memory_clerks

group by name

order by 2 desc

with particular empasis on figures for Tokenandpermuserstore, sql plans and object plans

rhunt SSCertifiable Points: 5868 More actions · Answer 13

Eoin (10/8/2009)

Try running the following

select name, SUM(single_pages_kb + multi_pages_kb)/1024.0 MBUsed

from sys.dm_os_memory_clerks

group by name

order by 2 desc

with particular empasis on figures for Tokenandpermuserstore, sql plans and object plans

There were 300 items. Here are the top 10:

NAME MB Used

Default 55.578125

SOS_Node 48.312500

SchemaMgr Store 12.093750

SNIPacket 5.664062

Bound Trees 2.554687

TokenAndPermUserStore 2.304687

SystemRowsetStore 1.093750

Lock Manager : Node 0 1.093750

Broker dormant rowsets 0.859375

Object Plans 0.601562

rhunt SSCertifiable Points: 5868 More actions · Answer 14

Chris Howarth-536003 (10/7/2009)

Just to check the obvious - have you rebooted the server since removing the /3GB switch?

Thinking back there was actually another occasion when we experienced a similar problem with elevated CPU usage with no apparent SQL Server activity, and that was after applying SP2 to SQL Server 2005. That time we logged, every couple of minutes, the VAS-region memory levels (Free 'Total' and 'Maximum') within SQL Server for a month or so and found that the CPU usage jumped when the maximum contiguous block size first fell below approximately 12MB - the CPU usage would then remain high until SQL Server was restarted.

I'll dig out the monitoring query that we used, but you can obtain the same data by running the 'sqbmemory' extended procedure that is shipped with Red-Gate SQL Backup. [EDIT: Here's the query that we used...]

SET NOCOUNT ON
;WITH CTE AS ( SELECT CASE WHEN region_state = 0x00001000 THEN 'Commit' WHEN region_state = 0x00002000 THEN 'Reserve' WHEN region_state = 0x00010000 THEN 'Free' END AS 'Type', MIN(region_size_in_bytes) AS Minimum, MAX(region_size_in_bytes) AS Maximum, SUM(region_size_in_bytes) / COUNT(*) AS Average, COUNT(*) AS [Blk count], SUM(region_size_in_bytes) AS Total, CASE WHEN region_state = 0x00001000 THEN 1 WHEN region_state = 0x00002000 THEN 2 WHEN region_state = 0x00010000 THEN 3 END AS SortOrder FROM sys.dm_os_virtual_address_dump GROUP BY region_state UNION ALL SELECT CASE WHEN region_type = 0x00020000 THEN 'Private' WHEN region_type = 0x00040000 THEN 'Mapped' WHEN region_type = 0x01000000 THEN 'Image' END AS 'Type', MIN(region_size_in_bytes) AS Minimum, MAX(region_size_in_bytes) AS Maximum, SUM(region_size_in_bytes) / COUNT(*) AS Average, COUNT(*) AS [Blk count], SUM(region_size_in_bytes) AS Total, CASE WHEN region_type = 0x00020000 THEN 4 WHEN region_type = 0x00040000 THEN 5 WHEN region_type = 0x01000000 THEN 6 END FROM sys.dm_os_virtual_address_dump WHERE region_allocation_base_address <> 0x0 GROUP BY region_type ) SELECT Type, Minimum, Maximum, Average, [Blk count], Total FROM CTE ORDER BY SortOrder
GO

It would be good if you could run the above query when your server is showing the symptoms, then post the resultset back to the forum for examination.

In our case the problem turned out to be a memory leak associated with the sp_OAMethod and the sp_OAGetProperty extended procedures which were being called every few minutes by, ironically enough, our 3rd-party monitoring software. After raising a call with MS, they suggested a fix of applying SP3 CU1, which resolved the problem. Here's a link to the KB article describing this particular issue:

http://support.microsoft.com/kb/937277

There are other causes of VAS fragmentation, such as the use of Linked Servers, and the use of sp_xml_preparedocument without a corresponding sp_xml_removedocument, to name a couple. Also, do you have CLR enabled?

Cheers

Chris

Thanks. I'm actually not at all familiar with VAS fragmentation so I'll have to research the topic in general. Thanks for the tip. Answers to some of your questions:

1) We're not using any extended stored procedures in our custom code. However, not sure if SharePoint makes use of them...I'd have to research

2) We're on SP3 (original...no pose CUME's)

3) We do make use of linked servers. The linked server with the most activity is to our AS400. SSIS does a lot of work across that link each night.

4) Turns out the /3Gb switch was never enabled on this server...just PAE and AWE.

5) Here are the results of your query...although I'm not entirely sure how I should use these results:

TYPE MIN MAX AVG Blk Count Total

Commit 4096 84672512 48258 10609 511979520

Reserve 4096 4157440 183412 8179 1500131328

Free 4096 3145728 224018 604 135307264

Private 4096 84672512 105801 18208 1926438912

Mapped 4096 1060864 107751 62 6680576

Image 4096 24948736 141561 558 78991360