How to troubleshoot sudden drop in Page Life Expectancy?

Question

Post reply

How to troubleshoot sudden drop in Page Life Expectancy?

SQL Guy 1

SSCoach

Points: 15786
More actions
December 22, 2016 at 1:39 pm

#316180

We have a monitoring tool, which is Solarwind DPA, that shows sudden drop of P.L.E. at different times. For example it can drop from 3,000 sec to about 100 sec. During that times, some applications and processes experience some delay and timeouts.

My goal is to find a process or procedure that causes this sudden drop. I tried to trace the server with setting filter to 5,000,000 mks and greater for duration, but could not find uniquely what causes it.

Any ideas how to find the reason for it ?

Thanks

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply

doug.brown SSCertifiable Points: 5593 More actions · Answer 1

doug.brown

SSCertifiable

Points: 5593

December 22, 2016 at 2:57 pm

#1919437

In DPA you can click on a day, then an hour, and then using Timeslice you can drill down to the processes that were running in a particular minute. I use that fairly often to identify problem queries & processes.

SQL Guy 1 SSCoach Points: 15786 More actions · Answer 2

SQL Guy 1

SSCoach

Points: 15786

December 22, 2016 at 3:12 pm

#1919439

I also tried this, but within each timeslice there are dozens of processes, and how can I identify the culprit?

doug.brown SSCertifiable Points: 5593 More actions · Answer 3

doug.brown

SSCertifiable

Points: 5593

December 22, 2016 at 3:15 pm

#1919440

On the SQL tab they are usually sorted by the ones that took the longest/had the most Waits associated. I start there.

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 4

I'll suggest that it's almost a waste of time. It's certainly an indication that something wasn't in memory when it was needed but that's not necessarily a problem especially if it's a temporary sudden drop for a high norm.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

SQL Guy 1 SSCoach Points: 15786 More actions · Answer 5

SQL Guy 1

SSCoach

Points: 15786

December 22, 2016 at 3:41 pm

#1919444

I would agree if that was a small drop, for example from 3000 to 2000 sec. But a drop from 3000 to 100 sec I understand that some enormous process requested huge amount of memory and thereby forced SQL Server to flush big chunk out of cache.

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 6

SQL Guy 1 (12/22/2016)

I would agree if that was a small drop, for example from 3000 to 2000 sec. But a drop from 3000 to 100 sec I understand that some enormous process requested huge amount of memory and thereby forced SQL Server to flush big chunk out of cache.

I'm sure there will be those that disagree but the only time I truly concern myself with a vertical drop with PLE is if it drops and stays there for an uncomfortable period of time. Oddly enough, if you run a cache line on Perfmon, you usually won't see a problem. I'd be more concerned about a large drop in the cache line.

If you are concerned about occasional (and they should only be occasional during a give 24 hour period) then take a look at the following article to figure out what you can do about it. First line of defense is to figure out which code is doing it and fix it. I regularly monitor the top 10 worst performers on my systems and fix them if they're actually a problem.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

Gail Shaw SSC Guru Points: 1004494 More actions · Answer 7

It would probably be something doing a scan of a large table. Look at execution plans for table scan/index scan.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

kevaburg SSCoach Points: 18131 More actions · Answer 8

If it is a recent development I would suggest having a look at the changes made around the time that the "Problem" started occuring.

Chatting with the developers responsible will help narrow it down to a more precise cause.

Gails idea of a full table scan maybe being the culprit is also a very good place to start.

If nothing further can be done and other applications are being affected during this period of drop, do what my last employer did and throw more memory at it.

Tushar Kanti Ten Centuries Points: 1142 More actions · Answer 9

Tushar Kanti

Ten Centuries

Points: 1142

December 23, 2016 at 4:30 am

#1919493

We can start looking into the database taking most of the memory and then looking into the indexes taking most of the memory using sys.dm_os_buffer_descriptors . This will help drill down the queries taking up more memory from the buffer.

SQL Guy 1 SSCoach Points: 15786 More actions · Answer 10

Jeff Moden (12/22/2016)

but the only time I truly concern myself with a vertical drop with PLE is if it drops and stays there for an uncomfortable period of time.

After a sudden drop it steadily growths over time, until the next drop. But the problem is that we have a ETL process and if it happens to run during those low PLE (usually 50 - 200), it experience PAGEIOLATCH_EX wait types and is timing out as a result. I tried to remove as many indexes as possible from the table where it is loading into (it has 200+ columns), but I can't remove all because they are used in subsequent selects.

... then take a look at the following article to figure out what you can do about it.

Could you please give a link for the article? Thanks

GilaMonster (12/23/2016)

It would probably be something doing a scan of a large table. Look at execution plans for table scan/index scan.

I like this idea. How can I catch the processes that are doing large table scans? Can I trace it? And should I include C.I. scans as well?

kevaburg (12/23/2016)

If it is a recent development I would suggest having a look at the changes made around the time that the "Problem" started occuring.

It is a recent migration from 2008 to 2014. We also had drops in PLE in our old system of course, but not such steep. What I suspect is probably in 2014 SQL Server overestimates the number of pages (probably for certain circumstances) that results in flushing of large areas in memory.

Tushar Kanti (12/23/2016)

We can start looking into the database taking most of the memory and then looking into the indexes taking most of the memory using sys.dm_os_buffer_descriptors . This will help drill down the queries taking up more memory from the buffer.

This what I did in the first place. But this query is also resource consuming and does not return results immediately, we have about 120 GB of memory. And another problem is that my tool (DPA) displays reports with some delay, so if I apply this query after the fact, it is almost useless.

SteveOC Ten Centuries Points: 1183 More actions · Answer 11

That is odd - I had to look to make sure that I didn't post this as I started to post similar ,earlier in the week.

I actually posted something related to this, but not specifically PLE, on Thwack.

So I am watching this with interest.............................

We too are running SQL 2014, using DPA for monitoring, and have almost exactly the same amount of Memory specified, (so throwing memory at this issue as somebody suggested is not an option as we are running Standard Edition).

I have been tracking low PLE values for protracted periods during the day for some months now, but have been at a loss to account for it.

I expected to see additional IO and an impact on Cache Hit Ratio, but this is not the case as far as I can tell.

Curious to know how long has the server (Hardware or SQL) been up, what size is the Buffer Cache reported by DPA now (as opposed to Target), and assuming it has been up for some time, does it show a decline over the Month view?

I run some DMV code (may even have originated from somebody on here) that shows what is in the Buffer Pool ,and have specifically run it on interval and immediately when I see this drop in PLE to see if the contents change (i.e. which Indexes or portions of are in ther ) and oddly it doesn't seem to change.

I too had expected the contents of the Bufferpool to be flushed to accommodate something else, but that seems not to be the case - unless it switches back before I can look at it?

What I can say is that I have been unable to attribute the low PLE values to any timeouts we see - in our case they seem to be caused bu Locking, Blocking and Deadlocks but not a low PLE value.

Steve O.

Sue_H SSC Guru Points: 90860 More actions · Answer 12

Another thing to keep in mind is that some of what you are describing can be totally normal and nothing to worry about. The low PLEs can show when the lazy writer goes through it's cycle and sweeps the cache - you may just be capturing the values at the beginning of this.

If it is caused by a query, you should have what you need in sys.dm_exec_query_stats and queries based off that. I'd look first at the physical reads. You'd also want to include the query plan and look at missing indexes with the high physical reads. There are several of those queries available - Glenn Barry's site has some. I think I got this off a post by Jonathan Kehayias:

SELECT TOP 50



qs.execution_count,

AvgPhysicalReads  = isnull( qs.total_physical_reads/ qs.execution_count, 0 ),

MinPhysicalReads  = qs.min_physical_reads,

MaxPhysicalReads  = qs.max_physical_reads,

AvgPhysicalReads_kbsize  = isnull( qs.total_physical_reads/ qs.execution_count, 0 ) *8,

MinPhysicalReads_kbsize  = qs.min_physical_reads*8,

MaxPhysicalReads_kbsize  = qs.max_physical_reads*8,

CreationDateTime = qs.creation_time,

SUBSTRING(qt.[text], qs.statement_start_offset/2, (

   CASE

       WHEN qs.statement_end_offset = -1 THEN LEN(CONVERT(NVARCHAR(MAX), qt.[text])) * 2

       ELSE qs.statement_end_offset

   END - qs.statement_start_offset)/2

) AS query_text,

qt.[dbid],

qt.objectid,

tp.query_plan,

tp.query_plan.exist('declare default element namespace "http://schemas.microsoft.com/sqlserver/2004/07/showplan";

/ShowPlanXML/BatchSequence/Batch/Statements/StmtSimple/QueryPlan/MissingIndexes') missing_index_info

FROM

sys.dm_exec_query_stats qs

CROSS APPLY sys.dm_exec_sql_text (qs.[sql_handle]) AS qt

OUTER APPLY sys.dm_exec_query_plan(qs.plan_handle) tp

ORDER BY AvgPhysicalReads DESC

Sue

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 13

SQL Guy 1 (12/23/2016)

Jeff Moden (12/22/2016)

but the only time I truly concern myself with a vertical drop with PLE is if it drops and stays there for an uncomfortable period of time.

After a sudden drop it steadily growths over time, until the next drop. But the problem is that we have a ETL process and if it happens to run during those low PLE (usually 50 - 200), it experience PAGEIOLATCH_EX wait types and is timing out as a result. I tried to remove as many indexes as possible from the table where it is loading into (it has 200+ columns), but I can't remove all because they are used in subsequent selects.

... then take a look at the following article to figure out what you can do about it.

Could you please give a link for the article? Thanks

Apologies for the missing link. Had several interruptions.

The first link is by Paul Randall where he explains some of the things going on and some of the reasons to either worry about low PLE or not.

https://sqlperformance.com/2014/10/sql-performance/knee-jerk-page-life-expectancy

A great article on just about everything one might think of that can affect PLE is found in the following article by Steve Hood, who has spoken on the subject many times (if you the chance to see Steve Hood's presentation on PLE, you won't be disappointed). For example, did you know that the size of TempDB can affect PLE?

https://simplesqlserver.com/2013/08/19/fixing-page-life-expectancy-ple/

Steve's article is a bit long but well structured and well worth the read.

Let me say that everything that SQL Server does has to go through memory in one form or another. One of the largest wastes of memory is performance challenged code which can waste huge amounts of memory for things like accidental many-to-many joins (Do you have DISTINCT in your code? Might be a bigger problem than you think), Triangular Joins, non-SARGable code, code that simply returns too many unused columns, code that does a lot of index scans, etc, etc. And don't forget that TempDB also start off in memory and only spills to disk if something get's too big.

The bottom line is, performance and a decent PLE is in the code... or not. Find and fix the performance challenged code and don't forget there are two types... batch code and normal everyday GUI code. Either can require too much memory to be used and, surprisingly, it's normally not the batch code. It frequently looks like it might be the batch code because it runs relatively infrequently but the normal GUI code might have a whole lot of wasted memory tied up because it wasn't written well and uses too many resources.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions