Page Life Expectancy drops like a rock at several points throughout the day

  • We just went live on 2014 a few weeks ago. Previously on 2008R2, we've noticed a few small differences in behavior for the different DB Engines.

    This one I have been going mad trying to understand. PLE will just take a complete dive at several points during the day and I can't explain why yet. This morning for example, our busiest server was rolling along with a PLE about 3000 then out of nowhere hits 114. Yesterday one of our less busy servers was rolling along with a PLE around 11,145 then out of nowhere, 86.

    These servers are granted RAM of between 120GB to 240GB and when I start to inspect the plan cache there isn't an obvious smoking gun like a DROP CLEAN BUFFERS that stands out, or any queries with a massive memory commitment. There are a whole bunch of 1 off ad-hoc queries though, but that is probably a different issue to deal with.

    We use Idera's SqlDM to monitor and catalog the server health 24/7 so I have some history to look at, but as of yet I have not been able to find anything useful to go on.

    Anyone have a thought on this?

  • Are you restoring databases on the server, or rebuilding indexes?

    John

  • Ok I probably should have specified that, index rebuilds are off hours.

    As for restoring db's good point, i'll verify that isn't happening. Our users have a process that can do that for a new customer. I'll see if that is happening at that time.

    EDIT: looks like the user process DOES restore a db, but only on one server. Still unexplained for the remaining servers.

  • How big is the largest table in each database? Is it possible that a query is coming along every now and then that does a table/clustered index scan against a 200GB table, thus knocking everything else out of the buffer?

    John

  • It could be a number of things , like John mentioned the most likely issue is user queries. However you need to monitor non user related operations as well.

    the next time the issue happens you could run sys.dm_os_memory_clerks. to check if the RAM is being allocated outside the buffer pool this could help you eliminate a llot of external factors.

    You could also configure Resource governor to limit the RAM available to Adhoc SQL and this way eliminate application related issues and ad hoc user related issues.

    Jayanth Kurup[/url]

  • mortalic (7/10/2015)


    We just went live on 2014 a few weeks ago. Previously on 2008R2, we've noticed a few small differences in behavior for the different DB Engines.

    This one I have been going mad trying to understand. PLE will just take a complete dive at several points during the day and I can't explain why yet. This morning for example, our busiest server was rolling along with a PLE about 3000 then out of nowhere hits 114. Yesterday one of our less busy servers was rolling along with a PLE around 11,145 then out of nowhere, 86.

    These servers are granted RAM of between 120GB to 240GB and when I start to inspect the plan cache there isn't an obvious smoking gun like a DROP CLEAN BUFFERS that stands out, or any queries with a massive memory commitment. There are a whole bunch of 1 off ad-hoc queries though, but that is probably a different issue to deal with.

    We use Idera's SqlDM to monitor and catalog the server health 24/7 so I have some history to look at, but as of yet I have not been able to find anything useful to go on.

    Anyone have a thought on this?

    Yes... did you or your users notice any difference in perceived performance? I'm thinking probably not. PLE is pretty sensitive to just about anything being pushed out of memory. Usually, it's an indication that something different than what's normal just happened. A lot of times, whatever it was that occurred happened so fast that it's not really going to make a difference. If it drops low and stays low for a fair bit of time, then you might have a problem that can and should be looked into.

    Paul Randal has a good write up on PLE at the following URL. In particular, look at the section of the article titled "What Can You Do About PLE Dropping?"

    http://sqlperformance.com/2014/10/sql-performance/knee-jerk-page-life-expectancy

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Let's see if I can get everyone's responses covered.

    John, great advice, I'll look closely at that, there are a few tables that are extremely large, I'll see if I can find a query plan that hits one of those and see how big the query plan is. Last time I looked, I never saw a query larger than a few MB. But you never know!

    Jayanth, after looking at sys.dm_os_memory_clerks, buffer pool has two entries, and are by far the largest. Nothing else even close.

    There are a lot of ad-hoc queries, but after inspecting them, it appears they are direct calls from our application. I've started talks with the devs to work on correcting them. Could this fact be the whole issue?

    Jeff, no, there was no performance issue reported. Our user base is pretty vocal about that too. Cache hit ratio didn't drop very much during that time, went from like 81% to 79%. So maybe this isn't really a problem? Good tip on Paul's article, I'll read this this morning.

    Thanks!

  • mortalic (7/13/2015)


    Jeff, no, there was no performance issue reported. Our user base is pretty vocal about that too. Cache hit ratio didn't drop very much during that time, went from like 81% to 79%. So maybe this isn't really a problem? Good tip on Paul's article, I'll read this this morning.

    Agreed. Cache hit ratio is a much better measurement. I only use PLE as an indication that something different just happened and that I might need to watch things for a minute or two.

    As to a cache hit ratio of only 81%, now THAT might be a problem. I get nervous when it drops to 94% and I turn on the war room lights when it drops to 90%. My very strong recommendation would be to open up SSMS, open the Object Explorer (if not already open), right click on the instance, select {reports}, {Standard reports}, and run the {Performance - Top Queries by Total IO} report. Chances are, you have a lot of code that's doing unnecessary scans which will possibly cause a whole lot of unnecessary data to reside in memory.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Thanks Jeff, Yeah we are aware of the cache hit ratio issue. We have some REALLY bad dynamic sql queries coming out of the application that cause this. I ran the report you suggested and it showed a few more queries I hadn't seen before. I also sent them over to our dev team to work on.

    This is probably the larger issue isn't it?

  • mortalic (7/14/2015)


    This is probably the larger issue isn't it?

    Yes. I'd say so.

    You might also want to run the similar report for CPU.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Thanks everyone for the brain storming session. I really appreciate it.

  • Just wanted to pint out a very detailed explaination on BCHR and why it is misunderstood as a memory pressure counter.

    https://www.simple-talk.com/sql/database-administration/great-sql-server-debates-buffer-cache-hit-ratio/

    In this case I feel the root cause is adhoc sql where a user must have written a query to fetch a very large volume of data.

    Jayanth Kurup[/url]

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply