How to troubleshoot sudden drop in Page Life Expectancy?

  • We have a monitoring tool, which is Solarwind DPA, that shows sudden drop of P.L.E. at different times. For example it can drop from 3,000 sec to about 100 sec. During that times, some applications and processes experience some delay and timeouts.

    My goal is to find a process or procedure that causes this sudden drop. I tried to trace the server with setting filter to 5,000,000 mks and greater for duration, but could not find uniquely what causes it.

    Any ideas how to find the reason for it ?

    Thanks

  • In DPA you can click on a day, then an hour, and then using Timeslice you can drill down to the processes that were running in a particular minute. I use that fairly often to identify problem queries & processes.

  • I also tried this, but within each timeslice there are dozens of processes, and how can I identify the culprit?

  • On the SQL tab they are usually sorted by the ones that took the longest/had the most Waits associated. I start there.

  • I'll suggest that it's almost a waste of time. It's certainly an indication that something wasn't in memory when it was needed but that's not necessarily a problem especially if it's a temporary sudden drop for a high norm.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • I would agree if that was a small drop, for example from 3000 to 2000 sec. But a drop from 3000 to 100 sec I understand that some enormous process requested huge amount of memory and thereby forced SQL Server to flush big chunk out of cache.

  • SQL Guy 1 (12/22/2016)


    I would agree if that was a small drop, for example from 3000 to 2000 sec. But a drop from 3000 to 100 sec I understand that some enormous process requested huge amount of memory and thereby forced SQL Server to flush big chunk out of cache.

    I'm sure there will be those that disagree but the only time I truly concern myself with a vertical drop with PLE is if it drops and stays there for an uncomfortable period of time. Oddly enough, if you run a cache line on Perfmon, you usually won't see a problem. I'd be more concerned about a large drop in the cache line.

    If you are concerned about occasional (and they should only be occasional during a give 24 hour period) then take a look at the following article to figure out what you can do about it. First line of defense is to figure out which code is doing it and fix it. I regularly monitor the top 10 worst performers on my systems and fix them if they're actually a problem.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • It would probably be something doing a scan of a large table. Look at execution plans for table scan/index scan.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • If it is a recent development I would suggest having a look at the changes made around the time that the "Problem" started occuring.

    Chatting with the developers responsible will help narrow it down to a more precise cause.

    Gails idea of a full table scan maybe being the culprit is also a very good place to start.

    If nothing further can be done and other applications are being affected during this period of drop, do what my last employer did and throw more memory at it.

  • We can start looking into the database taking most of the memory and then looking into the indexes taking most of the memory using sys.dm_os_buffer_descriptors . This will help drill down the queries taking up more memory from the buffer.

  • Jeff Moden (12/22/2016)

    but the only time I truly concern myself with a vertical drop with PLE is if it drops and stays there for an uncomfortable period of time.

    After a sudden drop it steadily growths over time, until the next drop. But the problem is that we have a ETL process and if it happens to run during those low PLE (usually 50 - 200), it experience PAGEIOLATCH_EX wait types and is timing out as a result. I tried to remove as many indexes as possible from the table where it is loading into (it has 200+ columns), but I can't remove all because they are used in subsequent selects.

    ... then take a look at the following article to figure out what you can do about it.

    Could you please give a link for the article? Thanks

    GilaMonster (12/23/2016)


    It would probably be something doing a scan of a large table. Look at execution plans for table scan/index scan.

    I like this idea. How can I catch the processes that are doing large table scans? Can I trace it? And should I include C.I. scans as well?

    kevaburg (12/23/2016)


    If it is a recent development I would suggest having a look at the changes made around the time that the "Problem" started occuring.

    It is a recent migration from 2008 to 2014. We also had drops in PLE in our old system of course, but not such steep. What I suspect is probably in 2014 SQL Server overestimates the number of pages (probably for certain circumstances) that results in flushing of large areas in memory.

    Tushar Kanti (12/23/2016)


    We can start looking into the database taking most of the memory and then looking into the indexes taking most of the memory using sys.dm_os_buffer_descriptors . This will help drill down the queries taking up more memory from the buffer.

    This what I did in the first place. But this query is also resource consuming and does not return results immediately, we have about 120 GB of memory. And another problem is that my tool (DPA) displays reports with some delay, so if I apply this query after the fact, it is almost useless.

  • That is odd - I had to look to make sure that I didn't post this as I started to post similar ,earlier in the week.

    I actually posted something related to this, but not specifically PLE, on Thwack.

    So I am watching this with interest.............................

    We too are running SQL 2014, using DPA for monitoring, and have almost exactly the same amount of Memory specified, (so throwing memory at this issue as somebody suggested is not an option as we are running Standard Edition).

    I have been tracking low PLE values for protracted periods during the day for some months now, but have been at a loss to account for it.

    I expected to see additional IO and an impact on Cache Hit Ratio, but this is not the case as far as I can tell.

    Curious to know how long has the server (Hardware or SQL) been up, what size is the Buffer Cache reported by DPA now (as opposed to Target), and assuming it has been up for some time, does it show a decline over the Month view?

    I run some DMV code (may even have originated from somebody on here) that shows what is in the Buffer Pool ,and have specifically run it on interval and immediately when I see this drop in PLE to see if the contents change (i.e. which Indexes or portions of are in ther ) and oddly it doesn't seem to change.

    I too had expected the contents of the Bufferpool to be flushed to accommodate something else, but that seems not to be the case - unless it switches back before I can look at it?

    What I can say is that I have been unable to attribute the low PLE values to any timeouts we see - in our case they seem to be caused bu Locking, Blocking and Deadlocks but not a low PLE value.

    Steve O.

  • Another thing to keep in mind is that some of what you are describing can be totally normal and nothing to worry about. The low PLEs can show when the lazy writer goes through it's cycle and sweeps the cache - you may just be capturing the values at the beginning of this.

    If it is caused by a query, you should have what you need in sys.dm_exec_query_stats and queries based off that. I'd look first at the physical reads. You'd also want to include the query plan and look at missing indexes with the high physical reads. There are several of those queries available - Glenn Barry's site has some. I think I got this off a post by Jonathan Kehayias:

    SELECT TOP 50

    qs.execution_count,

    AvgPhysicalReads = isnull( qs.total_physical_reads/ qs.execution_count, 0 ),

    MinPhysicalReads = qs.min_physical_reads,

    MaxPhysicalReads = qs.max_physical_reads,

    AvgPhysicalReads_kbsize = isnull( qs.total_physical_reads/ qs.execution_count, 0 ) *8,

    MinPhysicalReads_kbsize = qs.min_physical_reads*8,

    MaxPhysicalReads_kbsize = qs.max_physical_reads*8,

    CreationDateTime = qs.creation_time,

    SUBSTRING(qt.[text], qs.statement_start_offset/2, (

    CASE

    WHEN qs.statement_end_offset = -1 THEN LEN(CONVERT(NVARCHAR(MAX), qt.[text])) * 2

    ELSE qs.statement_end_offset

    END - qs.statement_start_offset)/2

    ) AS query_text,

    qt.[dbid],

    qt.objectid,

    tp.query_plan,

    tp.query_plan.exist('declare default element namespace "http://schemas.microsoft.com/sqlserver/2004/07/showplan";

    /ShowPlanXML/BatchSequence/Batch/Statements/StmtSimple/QueryPlan/MissingIndexes') missing_index_info

    FROM

    sys.dm_exec_query_stats qs

    CROSS APPLY sys.dm_exec_sql_text (qs.[sql_handle]) AS qt

    OUTER APPLY sys.dm_exec_query_plan(qs.plan_handle) tp

    ORDER BY AvgPhysicalReads DESC

    Sue

  • SQL Guy 1 (12/23/2016)


    Jeff Moden (12/22/2016)

    but the only time I truly concern myself with a vertical drop with PLE is if it drops and stays there for an uncomfortable period of time.

    After a sudden drop it steadily growths over time, until the next drop. But the problem is that we have a ETL process and if it happens to run during those low PLE (usually 50 - 200), it experience PAGEIOLATCH_EX wait types and is timing out as a result. I tried to remove as many indexes as possible from the table where it is loading into (it has 200+ columns), but I can't remove all because they are used in subsequent selects.

    ... then take a look at the following article to figure out what you can do about it.

    Could you please give a link for the article? Thanks

    Apologies for the missing link. Had several interruptions.

    The first link is by Paul Randall where he explains some of the things going on and some of the reasons to either worry about low PLE or not.

    https://sqlperformance.com/2014/10/sql-performance/knee-jerk-page-life-expectancy

    A great article on just about everything one might think of that can affect PLE is found in the following article by Steve Hood, who has spoken on the subject many times (if you the chance to see Steve Hood's presentation on PLE, you won't be disappointed). For example, did you know that the size of TempDB can affect PLE?

    https://simplesqlserver.com/2013/08/19/fixing-page-life-expectancy-ple/

    Steve's article is a bit long but well structured and well worth the read.

    Let me say that everything that SQL Server does has to go through memory in one form or another. One of the largest wastes of memory is performance challenged code which can waste huge amounts of memory for things like accidental many-to-many joins (Do you have DISTINCT in your code? Might be a bigger problem than you think), Triangular Joins, non-SARGable code, code that simply returns too many unused columns, code that does a lot of index scans, etc, etc. And don't forget that TempDB also start off in memory and only spills to disk if something get's too big.

    The bottom line is, performance and a decent PLE is in the code... or not. Find and fix the performance challenged code and don't forget there are two types... batch code and normal everyday GUI code. Either can require too much memory to be used and, surprisingly, it's normally not the batch code. It frequently looks like it might be the batch code because it runs relatively infrequently but the normal GUI code might have a whole lot of wasted memory tied up because it wasn't written well and uses too many resources.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply