outer join perfomance question

  • we have a query that aggregates data ready for reporting on in our MIS system.

    It ran in approx 30 secs last week. when we revisited it this week it is taking 30 mins+ to execute now.

    the only difference I can think of is that the day sales table now approx 5m rows(as opposed 4.5m rows last week).

    the query uses a nested loop join with a full table scan on sls_music_hierarchy2 to which it allocates 1% of the query cost.

    49 % is then allocated to bookmark lookup and 49% index scan on day_sales(week_key)

    non clustered indexes on store_key and week_key - day sales. no indexes on sls_music_hierarchy2 (815 rows)

    query below, any ideas appreciated.

    select

    h.format_key,

    h.sdept_key,

    h.store_key,

    h.week_key,

    h.week_comm,

    sum(r.retail_value) actual

    into

    sls_mus_tmp1_2

    from

    sls_music_hierarchy2 h

    LEFT OUTER JOIN day_sales r ON

    h.format_key = r.format_key AND

    h.store_key = r.store_key AND

    h.week_key = r.week_key

    group by

    h.format_key,

    h.sdept_key,

    h.store_key,

    h.week_key,

    h.week_comm

  • forgot to metion - we do have auto update statistices turned on.

    thanks

  • You should manually update the statistics. Auto update statistics does not run immediately after you dump data into the database.

    Yes, SQL 2000 may jump in and update the stats but it certainly doesn't look like it has in your case, and why leave it to chance?

    Checking BOL the auto update runs "periodically" based on the number of the rows in the table. SQL 2000 may consider your new 50000 records as not enough to update.

    Run Update Stats after every data dump and save yourself a lot of worry.

    Patrick

    Quand on parle du loup, on en voit la queue

  • After lots of trial and error. managed to fix the problem.

    when I reduced the number of rows in the dominant table to 10 from 815 the query optimizer chose a hash join and the query ran in 5 secs. BOL suggested that nested loop joins are better when joining a small table to a larger table without giving any specifics so I assumed that the optimzer had chosen correctly. However when I specified a hash join in the the original query using a hint;

    LEFT OUTER hash JOIN day_sales r ON

    the query ran in under 10 secs.

    Is this a question of the optimizer choosing incorrectly, does this happen often? OR Is it down to the auto statistics update and the optimizer having incorrect data to work with.

    thanks for your help.

    Kevin

  • It could be a mixture. From this distance it's hard to tell.

    A nested loop join joins the tables in a series of nested iterations. In a two-table join every selected row in the outer table causes the inner table to accessed. This is a scan (but not a table scan). The number of times the inner table is accessed is called the scan count. The outer table will have a scan count of 1, the inner table will have a scan count equal to the number of rows of the outer table. If the outer table suplies only a few rows to the query (it's a small table or filter by the WHERE criteria) and the inner table has an index on the join column then nested loop joins are very efficient. But if large tables are being joined then a hash or merge join would be more efficient.

    A hash join has two inputs - the build and the probe. The build is typically the smaller table. An interesting point of a hash join is that there need be no useful indexes on the tables to be joined. The Query optimizer chooses (I hope) the smaller table to be the build and reads every row in turn from the table. For each row the value in the join column is processed by a hashing algorithm. This value is used to identify a cell in memory - a hash bucket. The number of hash buckets is a function of the size of the build input. Once that is finished the probe input starts. Each row in the probe, the bigger table, is read and the value of the join is input to the same hash algorithm. This identifies another hash bucket. The query optimizer checks if there are any rows from the input bucket in the same hash bucket. If there are any the row is retrieved and with the row of the probe phase returned to the query. If there is no match the row may be discarded (depends on the type of join).

    With a hash join the rows are visited just once. Memory is needed for the hash buckets, so hash joins are memory and CPU intensive. They are better than nested loops if both tables are large. However, because the build input is performed before any rows are returned hash joins are not efficient when the first row must be retrieved quickly.

    I would still recommend that you manually update your statistics.

    Patrick

    Quand on parle du loup, on en voit la queue

  • Thanks for the explanation.

    Read (sqlsever-perfomance.com)that stats should be updated on table over 500 rows when 500 + 20% rows updated/inserted. In this case 500 + 1m = 1000500, so pretty infrequently.

    I will update stats manually from now on.

    Kevin

  • Kevin,

    Glad I could help.

    Yes, just add a line to the job that dumps the data in to update statistics. And don't rely on auto anything...

    Laughing.

    Patrick

    Quand on parle du loup, on en voit la queue

  • The bookmark lookup can be eliminated altogether by modifying your index to be a covered index and I noticed that no-one had mentioned that. There's no guarantee that it would run faster, but definately worth a look.....

  • You know I didn't think of that. Laughing. Too focused on indexing and joins.

    Patrick

    Quand on parle du loup, on en voit la queue

  • Hello Patrick,

    Please forgive my ignorance, but can you explain what a covered index is?

    Thanks,

    Julie

  • A covered index is one which contains all fields referenced by the query. In this case, the query gets ALL its data from the index, and never touches the table itself for the data. This prevents the SQL server engine from having to find the index value, and doing a bookmarked lookup to actually retrieve the data values, allowing it to simply get the values when it finds the index row. In many cases, this technique can reduce the time of execution many times over. Of course, in some cases, this can show no difference or actually increase the time. Generally, and especially in cases where there is no clustered index, this will show a huge improvement. However, the flip side also has to be considered, which is that the index must be maintained, and adds to the overall cost of updates or inserts involving the table, especially when the indexes are very wide (involving many fields). Using this technique is always something that has to be evaluated thouroughly, and carefully, but as I said, it's well worth it in many cases.

  • I think that if you create a compound index on all the join columns in day_sales (and the aggregate column) the optimizer can get rid of the bookmark lookup and do an index seek on the table which should be quicker.

    ducking for cover.

    kevin

  • Can anyone tell me, I want to automate the process of Update Statistics with sample 100 percent..

    The default behaviour of update statistics is 10 percent. In my database there are almost a more than 2000 tables and it is impossible for me to update them manually with 100 percent sample.

    Thus if anyone has the idea please let me know in priority.

    Mansoor

  • If you have any time when the database is not in use or has low usage, set up weeekly jobs to update stats at those times. You may need to stagger the jobs through the week to update all your tables. You could also look at triggers on tables which may have large updates - the trigger could check whether count(*) from deleted or inserted is >, say, 5000 (arbitrary value for illustration only) , then update a table to tell the sheduled task to update stats for that table.

    Alternatively, and better, if you have clustered primary key indexes on your tables, you could check the column 'rowmodctr' in the 'sysindexes' system table.

    Don't forget to rebuild indexes as well.

    Tim

    Tim Wilkinson

    "If it doesn't work in practice, you're using the wrong theory"
    - Immanuel Kant

  • For very small tables, and for lookup tables with static data, (these are likely to be that same tables in practice), there is no need to update stats regularly - so these could be left out of the process.

    Tim Wilkinson

    "If it doesn't work in practice, you're using the wrong theory"
    - Immanuel Kant

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply