Performance problems on Live server

Question

Post reply

Performance problems on Live server

Jez-448386

SSCarpal Tunnel

Points: 4570
More actions
November 8, 2007 at 4:33 am

#153945

We have some very strange performance problems and we are struggling to get a handle on what might be wrong.

We took a back up of the live databases and restored them on our test server. We developed some new code and ran this against the databases on the test server - no problems. We implemented the code into Live and started having problems. Examples include:

- steps which take a few minutes on Test take hours on Live (if we have enough patience to let it complete - we gave up on one step after it had been running for 10 hours compared with a few minutes in test)

- rerunning a step which took a few seconds to complete on one day took 3 minutes to complete a day later.

Both Live and test servers have very similar set up: raid 5 array for the data; 2 Gb RAM on Test vs 6Gb on Live (AWS was enabled on Live but have turned that off presently to see if that was causing any problems); etc

The Live server gets to a step and we see the CPU usage max out and stay there. Using sp_who2 we see the process gobbling up CPU but little or no I/O activity.

I accept that we could tune our queries some more but the fact remains that the queries run fine (within acceptable time frames) in test (which is a restore of live) but don't run well on the Live server. I don't believe that the code or the amount of data (20 million rows in one table join with another two tables on 1 million rows each) is the problem as it runs on the test server.

What should we be looking at?

Jez

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply

Jez-448386 SSCarpal Tunnel Points: 4570 More actions · Answer 1

Update.

I have split a proc with 6 steps into individual procs and still getting similar problems. However, if I copy the sql statement out the offending proc and execute it within a query analyser window it runs in 2 seconds.

Every time go back and execute the proc with this statement it takes a lot longer (I normally kill it after 5 minutes).

Is there anything different about running a query in QA rather than within an SP?

Jez

Steve Jones - SSC Editor SSC Guru Points: 728159 More actions · Answer 2

I've seen similar issues and struggled with them. For some reason the query optimizer runs things differently in a proc at times.

If possible, it helps to try and clear the plan that's cached for the sproc. A WITH RECOMPILE might help, but then you're compiling every time.

Can you post a little code or let us know what the bad proc does?

Also, are you sure indexes are up to date as well as stats. I'm assuming same data sizes in text and prod? Do you send in a wide variety of parameters to the proc?

Follow me on Twitter: http://www.twitter.com/way0utwest
Forum Etiquette: How to post data/code on a forum to get the best help
My Blog: www.voiceofthedba.com

Jez-448386 SSCarpal Tunnel Points: 4570 More actions · Answer 3

This is the code that was causing us problems this morning:

insert into account_revenue

(

Account_id,

balance_date,

revenue_type_code,

no_of_trans,

discount_value,

revenue_at_standard_tariff,

actual_revenue,

ytd_actual_revenue,

revenue_ccy_id,

ytd_discount_value,

ytd_revenue_at_standard_tariff,

ytd_revenue_ccy_id,

revenue_type_def_id,

input_batchid

)

select

previous_month.uni_account_id,

@current_processing_date,

previous_month.uni_revenue_type,

0 as no_of_trans,

0 as discount_value,

0 as revenue_at_standard_tariff,

0 as actual_revenue,

case when datepart(month,@current_processing_date)= 1 --If JANUARY then YTD Revenue = 0

then 0

else previous_month.YTD_Actual_Revenue

end as YTD_Actual_Revenue,

previous_month.uni_act_rev_ccy,

0 as ytd_discount_value,

0 as ytd_revenue_at_standard_tariff,

previous_month.Uni_YTD_Act_Rev_CCy, --v2.0

15, --not sure what this should be?!

previous_month.batchid

from wealth_lk..mer_sbi_batch_revenue previous_month

join account account on account.account_id = previous_month.uni_account_id

and account.location_id = @uni_location_id

and account.account_id < 30000000 -- exclude adjustments

left join wealth_lk..mer_sbi_batch_revenue this_month

on this_month.uni_account_id = previous_month.uni_account_id

and this_month.uni_revenue_type = previous_month.uni_revenue_type

and this_month.month_date = @current_processing_date

where previous_month.month_date = @previous_month_end

and this_month.uni_account_id is null

This is part of the execution plan:

StmtText

declare @current_processing_date datetime, @uni_location_id int, @previous_month_end datetime set @current_processing_date = '2007-10-31'

set @uni_location_id = 104

set @previous_month_end = '2007-09-30'

insert into account_revenue ( Account_id, balance_date, revenue_type_code, no_of_trans, discount_value, revenue_at_standard_tariff, actual_revenue, ytd_actual_revenue, revenue_ccy_id, ytd_discount_value, ytd_revenue_at_standard_tariff, ytd_revenue_ccy_id, revenue_type_def_id, input_batchid ) select previous_month.uni_account_id, @current_processing_date, previous_month.uni_revenue_type, 0 as no_of_trans, 0 as discount_value, 0 as revenue_at_standard_tariff, 0 as actual_revenue, case when datepart(month,@current_processing_date)= 1 --If JANUARY then YTD Revenue = 0 then 0 else previous_month.YTD_Actual_Revenue end as YTD_Actual_Revenue, previous_month.uni_act_rev_ccy, 0 as ytd_discount_value, 0 as ytd_revenue_at_standard_tariff, previous_month.Uni_YTD_Act_Rev_CCy, --v2.0 15, --not sure what this should be?! previous_month.batchid from wealth_lk..mer_sbi_batch_revenue previous_month join acco

|--Assert(WHERE:(If (NOT([Pass1027]) AND ([Expr1026] IS NULL)) then 0 else If (NOT([Pass1029]) AND ([Expr1028] IS NULL)) then 1 else If (NOT([Pass1031]) AND ([Expr1030] IS NULL)) then 2 else If (NOT([Pass1033]) AND ([Expr1032] IS NULL)) then 3 else NULL))

|--Nested Loops(Left Semi Join, WHERE:(([previous_month].[Uni_Account_Id] IS NULL) OR ([@current_processing_date] IS NULL))OUTER REFERENCES:([previous_month].[Uni_Account_Id]), DEFINE:([Expr1032] = [PROBE VALUE]))

|--Nested Loops(Left Semi Join, WHERE:(([previous_month].[Uni_Revenue_Type] IS NULL) OR (15 IS NULL))OUTER REFERENCES:([previous_month].[Uni_Revenue_Type], 15), DEFINE:([Expr1030] = [PROBE VALUE]))

| |--Nested Loops(Left Semi Join, WHERE:([previous_month].[Uni_YTD_Act_Rev_CCy] IS NULL)OUTER REFERENCES:([previous_month].[Uni_YTD_Act_Rev_CCy]), DEFINE:([Expr1028] = [PROBE VALUE]))

| | |--Nested Loops(Left Semi Join, WHERE:([previous_month].[Uni_Act_Rev_Ccy] IS NULL)OUTER REFERENCES:([previous_month].[Uni_Act_Rev_Ccy]), DEFINE:([Expr1026] = [PROBE VALUE]))

| | | |--Table Insert(OBJECT:([WEALTH_DW].[dbo].[Account_Revenue]), SET:([Account_Revenue].[YTD_Revenue_at_Standard_Tariff]=RaiseIfNull(0.00), [Account_Revenue].[timestamp]=[Expr1013], [Account_Revenue].[Userid]=RaiseIfNull(user_id(NULL)), [Account_Revenue].[Last_Changed_Date]=RaiseIfNull(getdate()), [Account_Revenue].[Input_batchID]=[previous_month].[BatchId], [Account_Revenue].[Revenue_Type_Def_Id]=RaiseIfNull(15), [Account_Revenue].[YTD_Revenue_Ccy_Id]=[previous_month].[Uni_YTD_Act_Rev_CCy], [Account_Revenue].[YTD_Discount_Value]=0.00, [Account_Revenue].[Revenue_Ccy_Id]=RaiseIfNull([previous_month].[Uni_Act_Rev_Ccy]), [Account_Revenue].[YTD_Actual_Revenue]=[Expr1007], [Account_Revenue].[Actual_Revenue]=0.00, [Account_Revenue].[Revenue_at_Standard_Tariff]=0.00, [Account_Revenue].[Discount_Value]=0.00, [Account_Revenue].[No_Of_Trans]=0, [Account_Revenue].[Revenue_Type_Code]=[previous_month].[Uni_Revenue_Type], [Account_Revenue].[Balance_Date]=RaiseIfNull([@current_processing_date]), [Acc

| | | | |--Compute Scalar(DEFINE:([Expr1007]=If (datepart(month, [@current_processing_date])=1) then 0.00 else [previous_month].[YTD_Actual_Revenue], [Expr1022]=getdate(), [Expr1024]=user_id(NULL), [Expr1013]=gettimestamp(9)))

| | | | |--Nested Loops(Inner Join, OUTER REFERENCES:([previous_month].[Uni_Account_Id]))

| | | | |--Filter(WHERE:([this_month].[Uni_Account_Id]=NULL))

| | | | | |--Merge Join(Right Outer Join, MERGE:([this_month].[Uni_Account_Id], [this_month].[Uni_Revenue_Type])=([previous_month].[Uni_Account_Id], [previous_month].[Uni_Revenue_Type]), RESIDUAL:([previous_month].[Uni_Account_Id]=[this_month].[Uni_Account_Id] AND [previous_month].[Uni_Revenue_Type]=[this_month].[Uni_Revenue_Type]))

| | | | | |--Clustered Index Seek(OBJECT:([WEALTH_LK].[dbo].[Mer_Sbi_Batch_Revenue].[PK_Mer_Sbi_Batch_Revenue] AS [this_month]), SEEK:([this_month].[Month_Date]=[@current_processing_date]) ORDERED FORWARD)

| | | | | |--Clustered Index Seek(OBJECT:([WEALTH_LK].[dbo].[Mer_Sbi_Batch_Revenue].[PK_Mer_Sbi_Batch_Revenue] AS [previous_month]), SEEK:([previous_month].[Month_Date]=[@previous_month_end]) ORDERED FORWARD)

| | | | |--Index Seek(OBJECT:([WEALTH_DW].[dbo].[Account].[IK_Account_Location_Id] AS [account]), SEEK:([account].[Location_Id]=[@uni_location_id] AND [account].[Account_Id]=[previous_month].[Uni_Account_Id]), WHERE:([account].[Account_Id]<30000000) ORDERED FORWARD)

| | | |--Row Count Spool

| | | |--Clustered Index Seek(OBJECT:([WEALTH_DW].[dbo].[Currency].[PK_Currency]), SEEK:([Currency].[Currency_Id]=[previous_month].[Uni_Act_Rev_Ccy]) ORDERED FORWARD)

| | |--Row Count Spool

| | |--Clustered Index Seek(OBJECT:([WEALTH_DW].[dbo].[Currency].[PK_Currency]), SEEK:([Currency].[Currency_Id]=[previous_month].[Uni_YTD_Act_Rev_CCy]) ORDERED FORWARD)

| |--Row Count Spool

| |--Index Seek(OBJECT:([WEALTH_DW].[dbo].[Code_Table].[IK_Code_Table_Code_Def_Id]), SEEK:([Code_Table].[Code_Def_Id]=15 AND [Code_Table].[Code_Id]=[previous_month].[Uni_Revenue_Type]) ORDERED FORWARD)

|--Row Count Spool

|--Index Seek(OBJECT:([WEALTH_DW].[dbo].[Account_Balance].[PK_Account_Balance]), SEEK:([Account_Balance].[Account_Id]=[previous_month].[Uni_Account_Id] AND [Account_Balance].[Balance_Date]=[@current_processing_date]) ORDERED FORWARD)

(and a bit more)

PhysicalOp LogicalOp

NULL NULL

Assert Assert

Nested Loops Left Semi Join

Table Insert Insert

Top Top

Compute Scalar Compute Scalar

Nested Loops Inner Join

Filter Filter

Merge Join Right Outer Join

Clustered Index Seek Clustered Index Seek

Index Seek Index Seek

Row Count Spool Lazy Spool

Clustered Index Seek Clustered Index Seek

Row Count Spool Lazy Spool

Clustered Index Seek Clustered Index Seek

Row Count Spool Lazy Spool

Index Seek Index Seek

Row Count Spool Lazy Spool

Index Seek Index Seek

We are processing for different source systems (@uni_location_id) for the current/previous month (@current_processing_date/@previous_month_end) in our warehouse and this procedure processes the data for a specific source system (specified by a parameter coming into the proc). There are about a dozen different source systems.

We have WITH RECOMPILE in the proc header already - as it is a warehouse, the overhead of recompiling the proc each time is insignificant. The proc might take between 1 and 10 minutes to run depending on which source system we are processing (some are much bigger universe of data than others) and a second to recompile is not worth worrying about. The size of the data does vary from a few thousand rows to half a million depending on the source system and SQL Server might choose a different execution plan for different sizes of data - hence the recompile I guess.

We haven't rebuilt the indexes on the tables in this particular query - there was another query that was causing us problems yesterday and rebuilt the index using DBCC DBREINDEX but this didn't have any effect. We can update the stats and see what happens.

Currently, I have extracted all the T-SQL from the proc and pasted into QA and it runs fine.

We have tried restoring the database onto a different raid array (mirrored rather than RAID 5) to see if there was a hardware problem with the array but it made no difference.

What is so frustrating is that there seems to be no consistency in where is goes awry. There are 3 procs each with 7 or 8 steps and there is no single point at which it 'fails'. It also works 'fine' on the test server which was a restore of the live databases from the previous month so full size, same indexed, same statistics etc.

Any more ideas?

Jez

theturners Grasshopper Points: 22 More actions · Answer 4

theturners

Grasshopper

Points: 22

November 9, 2007 at 5:56 am

#749798

Ok, just a suggestion but have you taken a look at the patch levels of the Raid controller, mother board board bus, nics etc.

Jez-448386 SSCarpal Tunnel Points: 4570 More actions · Answer 5

I'm not a server engineer so all of that is over my head. I cannot believe that it is something wrong with SQL Server as we have the same set up on the test server (sp4) so I can only imagine it is either the hardware on the prod server or a windows issue. Either way I don't know where to look. We have had a look at the array controllers and they report everything is OK (no failed disc, battery cache working etc) but apart from that I am lost.

Jez

theturners Grasshopper Points: 22 More actions · Answer 6

As you say it is not consistantly happening in the same place so I would get the engineers to compare the test and live hardware.

You cant stuff a large sock down a small pipe quickly and that may be your issue here. The only other tests you can do is to put your Test system under load then run the proc against it, better simulating the live system.

Also try the trial edition of Spotlight (Quest Software) to give you a visual indicator what is going on at both SQL and Windows level, I find it very usefull at times.

Peter Gadsby SSCrazy Points: 2718 More actions · Answer 7

Hi,

Have you tried running SQL Profiler on the server when the process is running. To confirm the statements you are expecting to run are in fact running.

Have you also run sp_who2 when the application is running and check for locks. Could the application be somehow locking itself by not commiting transactions correctly?

Hope the above helps!

Pete

Peter Gadsby
Business Intelligence Consultant
www.hgconsult.co.uk

Jez-448386 SSCarpal Tunnel Points: 4570 More actions · Answer 8

We've tried running profiler but not had much sucess (no real experience as to what to record).

We have run sp_who2 and it was the only process running on the box at the time and no locks. The process did split into parallel processes and for some of these the CPU time kept going up but with no DiskI/O change. I cannot think of what type of processing (from within the procedure) it would be doing which would cause the CPU usage to hit the roof. It's almost as if it has entered an infinite loop.

Jez

Mike Fitzgerald-247548 SSC Veteran Points: 273 More actions · Answer 9

I would try the maxdop option (maximum degree of parallelism) on your select query and see if that helps. That limits the query to use a specific number of processors. I would test it using 1 to start with. I've seen that make a huge difference with the type of problem you are describing. Sometimes a query will run faster when it's not running in parallel. I'm not sure why, but it's worth a shot. If I remember correctly, you put the command at the end of the query: "option (maxdop 1)" or something like that. See the "Option clause" in BOL. The degree of parallelism can also be set at the server level but that would affect evey query so be careful changing that.

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 10

So, double check one more time...

All indexes (indluding primary keys) on the test box are also available on the production box.

All tables have the same row count on both boxes.

All tables have the same, if any, triggers.

Both boxes have the same amount of memory available.

Both boxes have the same amount of TempDB available.

You're 100% sure the proc is identical on both boxes.

Default fill factor on both boxes is identical.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

Jez-448386 SSCarpal Tunnel Points: 4570 More actions · Answer 11

Jeff,

The database on the test server is a restore of live so the tables will have the same rowcount, the indexes are identical and the triggers (one to update the timestamp column and one to record details into an archive table) are the same.

The production box was using AWE (~3Gb SQL ram) but the test box is using normal ram (~2Gb) - we have currently disabled AWE on the prod box and it is now using 2Gb ram so essentially the same.

Tempdb availability is similar on both boxes - unlimited growth but in test the tempdb is on the same raid array as the database (raid5) whereas in live is is on a spearate array (raid 0+1).

What is strange is that if I run the code from the sp in Query Analyser on the prod box it works fine - it completed in less than a minute compared to the sp which was still running after a couple of hours (it is reading 20 million row tables and producing 500,000 row results set so I never expect a sub second reponse). I had to do this so that we could complete the month end processing. When running the sp and using sp_who2, we see the CPUTime column go up and up and up but there is no DiskIO activity. This is what is confusing us because the code seems to be fine when submitted through QA but 'fails' when executed via a procedure.

Jez

AndrewMurphy SSCertifiable Points: 5586 More actions · Answer 12

AndrewMurphy

SSCertifiable

Points: 5586

November 12, 2007 at 4:31 am

#750195

Are you being hit by "Parameter sniffing"? Search here for a description of typical problems attributed to this 'feature' and simple solutions to same.

Jez-448386 SSCarpal Tunnel Points: 4570 More actions · Answer 13

Andrew has hit the nail on the head.

I have amended the procs so that they use temporary (#) tables and local variables rather than the parameters passed into the proc and they all run much better.

It still doesn't answer the question of why we didn't experience these problems on the test server - maybe the sequence in which we executed the procedures for the different platforms (and different numbers of rows affected) was different and generated different execution plans.

It doesn't really matter as we have an answer to our performance problems. Parameter sniffing is definitley one to keep in mind in future.

Thanks everyone for your ideas.

Jez

Gail Shaw SSC Guru Points: 1004494 More actions · Answer 14

As an alternative to local variables, I would suggest that you change the procs back to using parameters and mark each proc WITH RECOMPILE.

With variables you'll get average performance, never bad, but also not the best you could have. If you recompile on every execution, you're guaranteed to get the best plan for each exec.

There will be a cPU overhead for the recompiles, but it may be acceptable. Try it and see.

p.s. Param sniffing is only a problem if different calls to the procs have different execution paths or affect wildly differing numbers of rows

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass