Execution plan estimates Join hint spilling.

  • Hi,

    I've been troubleshooting a performance problem with a stored procedure where depending on the parameters the row count on two temporary tables can be very small (1 row) or very large (50,000).

    The stored procedure works nicely when the cached execution plan has estimated a large row count because it uses a hash match when joining the two temporary tables but when the cached execution plan has estimated a small row count it uses a nested loop.

    To ensure consistent performance I've used a Join hint to ensure the execution plan uses a Hash match which works but in some instances this appears to cause spilling to the disk because the row count and memory grant are not big enough. The joins that suffer are speicifcally the two temporary table joins.

    Are row counts and memory grants cached or re-calculated per execution? 

    If the row counts/memory grant is cached with the execution plan, and the first execution produces the low row count and the second execution produces a large row count we see spilling to disk.

    Thanks,

  • Given that memory grants are based on row counts, and row counts are assessed to actually build the execution plan itself, it's pretty safe to assume these are cached.

    It sounds to me like a potential issue with table statistics - are there multiple actions being performed on the temporary table within the procedure? This could cause statistics to be out of date if so, leading to your bad row estimates. You could look at forcing a statistics update on those tables prior to the actual workload? You would need to have a look at those table statistics involved to understand if this is the case or not.

  • george.p - Friday, January 27, 2017 8:59 AM

    Given that memory grants are based on row counts, and row counts are assessed to actually build the execution plan itself, it's pretty safe to assume these are cached.

    It sounds to me like a potential issue with table statistics - are there multiple actions being performed on the temporary table within the procedure? This could cause statistics to be out of date if so, leading to your bad row estimates. You could look at forcing a statistics update on those tables prior to the actual workload? You would need to have a look at those table statistics involved to understand if this is the case or not.

    My thoughts exactly. 

    The temporary tables are created and filled, no further modifications or multiple actions being performed so I'm happy that its not the statistics.

    It is only one specific combination of parameters the temp tables have 1 row each, which was cached and subsequent queries with different combinations performed badly.

    Not sure there is anything I could do to avoid the spilling to disk?

  • If the parameter can be linked back to the statistics, it's still going to be related to the row count.

    I'm assuming in an ideal scenario you want correct row counts for both tables? I'd be interested in understanding how the parameters impact the data coming out of those tables, and then having a look at what (I assume are) automatically generated statistics are being used. It might be that some manually created statistics will do a better job?

  • Please will you post the query and execution plan(s)?  How about eliminating the temp tables and working directly from the source tables?  Not always possible and/or desirable, but worth considering.

    John

  • @george.p - On review, the parameters don't affect the queries used to fill the temporary tables. The parameters are used in a where clause on the final query that pulls the result sets together to filter the results. The where clause is unrelated to the fields in the temporary tables. It is the joins that filter the row count from the temporary tables.

    @john-2 Mitchell-245523 - I would like to provide the query and execution plans but I'm not permitted to. I understand that makes it very difficult, and I don't expect someone to identify the specific problem, more looking for guidance. I had to put the temporary tables in to break the query up because the number of joins was excessive causing the query optimisation to timeout and poor performance, but I might be able to use CTEs instead. I will try that and let you know.

    Thank you both for your suggestions and guidance.

  • Is there no way you could anonymise the query plan and tables before sending them up?

  • Yes, SQL Sentry Plan Explorer made that very easy. 

    Please find attached the plan. The spilling is in the last statement.

  • Quick thought, if it is spilling to tempdb at times then why not simply remove the join hints and add "with recompile"?
    😎

    Further, have you looked into those missing indices?

  • Eirikur Eiriksson - Saturday, January 28, 2017 12:40 AM

    Quick thought, if it is spilling to tempdb at times then why not simply remove the join hints and add "with recompile"?
    😎

    Further, have you looked into those missing indices?

    'With recompile' would definitely fix the spilling to tempdb at times but I would like to avoid that if possible.

    At the moment, the join hints and 'with recompile' is an acceptable solution to achieve consistent performance of the stored procedure but as a challenge, learning exercise (whatever one may call it) I wanted to see if it were possible to stop the disk spilling without the query hint 'with recompile'.

    I've not spent a huge amount of time investigating the missing indices as they appear to be unrelated to the joins with the spilling

    The stored procedure completed between 1-5 seconds, which may seem a long time but it is a report.

    Thanks for the help!

  • Looking at a part of your last query, I'd start by looking at all those function calls.  If I read it right and you really have one function being called multiple times, try to eliminate the redundant calls.  Without knowing the data types, there's no way to tell if you have any implicit casts going on, so take a look for those.

  • Ed Wagner - Saturday, January 28, 2017 9:28 PM

    Looking at a part of your last query, I'd start by looking at all those function calls.  If I read it right and you really have one function being called multiple times, try to eliminate the redundant calls.  Without knowing the data types, there's no way to tell if you have any implicit casts going on, so take a look for those.

    Hi Ed,

    The function calls are a little miss leading when SQL Sentry Plan explorer obfuscates the plan as the function calls are mostly aggregate function calls but are all necessary built in function calls.

    I will have a review of the plan today to see if there are any implicit cases happening. 

    Thanks for the tips.

  • Hoping to be able to update this thread later today with results from all the suggestions.

  • The optimiser is timing out. If you can simplify the query you would almost certainly get a better plan.
    You have duplication of effort: the group of tables Object4, Object14, Object13 are queried twice. Consider simplifying the CTE version using SUM(CASE..., and take a look at SUM() OVER() in books online.
    You've mentioned that the query can be run with either very low rowcounts or much higher rowcounts. Unless you're sure that one plan will suit both scenarios, use two queries. It won't be too much effort if the query can be simplified - which seems likely.

    Edit:

    Remove your existing join hints. Experiment with hash join hints on Object16 and Object17: estimated cost is 23%, with 25 executions. In fact there are about 30,000 executions and almost the entire table is pulled in each case.
    Examine the generated plan for accuracy and see if the time out has been eliminated.

    β€œWrite the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • ChrisM@Work - Monday, January 30, 2017 5:11 AM

    The optimiser is timing out. If you can simplify the query you would almost certainly get a better plan.
    You have duplication of effort: the group of tables Object4, Object14, Object13 are queried twice. Consider simplifying the CTE version using SUM(CASE..., and take a look at SUM() OVER() in books online.
    You've mentioned that the query can be run with either very low rowcounts or much higher rowcounts. Unless you're sure that one plan will suit both scenarios, use two queries. It won't be too much effort if the query can be simplified - which seems likely.

    Edit:

    Remove your existing join hints. Experiment with hash join hints on Object16 and Object17: estimated cost is 23%, with 25 executions. In fact there are about 30,000 executions and almost the entire table is pulled in each case.
    Examine the generated plan for accuracy and see if the time out has been eliminated.

    Strange, I had spotted and thought that my optimisation efforts so far had resolved the query timeout, more optimisation required. Do you happen to know if the query optimiser times out whether it has time to generate sensible estimates? This could explain some of the spilling.

    When trying to stop the query timeouts I created Objects 1 and 6 (temporary tables) that I created to break up the query and avoid more repetition. I've been considering breaking the query up into reusable objects (views, inline table functions) but although it would make the code simpler and more readable. I expect the query plan to mash the objects together creating a large execution that would suffer from optimisation timeouts? But I don't know and will need to test this.

    I will re-write that CTE.

    Thanks,

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply