group by optimize

  • I have a table, tShortTerm, that I need to store a crunched down version of for the long term, tLongTerm.

    So the query that crunches it down looks like this:

    insert into tLongTerm

    select userId, max(date) from tShortTerm

    group by userId

    On tShortTerm I have a unique clustered index on userId and date desc but looking at the execution plan, I can see that it always does a scan (not a seek) on the clustered index.

    I thought that if the unique clustered index was on userId and date desc that maybe query optimizer would be smart enough not to search further once it found the first row with a new userId since that row would be the row with the max(date) for that userID??

    How can I make a query with a group by do a seek?

    Thanks,

    -J

  • In most cases, Index selection is based on SARGs

    which is determined by your [WHERE] clause.

    The rule is that for an index to be considered by the optimizer, at least

    the first column of the index must match the SARG

    In your case you can write it the query

    select userId, max(date)

    from tShortTerm

    Where userid IS NOT NULL -- SARG

    group by userId

    MW


    MW

  • The covering index (all fields required are in the index) will be helping performance compared to a table scan, especially if the table is wide. This is because the query does not need to referenec the data pages individually, but ony the (more compact)information in the index, which is also sorted ready for the aggregating operation.

    But you can't expect an index seek, as the query requires that every date value be inspected for every userid.

    You could use a trigger to maintain your summary table, but this is very likely to impact the performance of updates.

    This query doesn't look like a killer. It also looks like the kind of thing you might schedule to run at a quiet time (if there is one).

    Tim

    Tim Wilkinson

    "If it doesn't work in practice, you're using the wrong theory"
    - Immanuel Kant

  • quote:


    In most cases, Index selection is based on SARGs

    which is determined by your [WHERE] clause.


    Forgive the newbility, but WTF is a SARG? I've never seen it in WROX, and BOL is giving me a blank stare.

  • From the Sybase site.

    quote:


    SARG (search argument) ­ the query's qualifying clause(s).


    In other words, the WHERE clause ...

  • quote:


    Tim:

    Index covering is for non-clustered indexes, but in this case the aggregate

    function would force index covering b/c the optimizer in almost all cases

    would resort to index covering whenever it encounters aggregates

    quote:


    talleman:

    but WTF is a SARG?


    What is the point


    MW

    Edited by - mworku on 06/19/2003 08:26:10 AM

    Edited by - mworku on 06/19/2003 08:26:42 AM


    MW

  • The results are in:

    quote:


    WTF is a SARG?


    That's a searchable argument I think... when you look at the execution plan at the bottom of the bubbles that pop up over each icon there is a section that says arguments. It first lists OBJECT indicating what index it used. Then if a SARG is present it may list SEEK: for instance where userID = @userid. Based off of the where clause, the optimizer will guess which index to use.

    I ran the following three statements:

    insert into tImpressionViewThroughBuilder

    select userID, max(time)

    from tImpressionWithID1

    where userId is not null

    group by userID

    truncate table tImpressionViewThroughBuilder

    insert into tImpressionViewThroughBuilder

    select userID, max(time)

    from tImpressionWithID1

    group by userID

    There are no null's in the userId column so the where clause has no effect on the actual data returned however, it had a big effect on the execution plans.

    A clustered index seek was performed in the first block. 58% of the query runtime

    A clustered index scan was performed in the second block. 83% of the query runtime

    The first then used parallelism when performing the stream aggregate (4 processors on this machine) while the second didn't use parallelism.

    At the end the runtime percentage was 41% vs 59% of the total time which cut the query from 3 minutes 43 seconds to 2 minutes 7 seconds... (I know the math doesn't work out, but there are some other things going on on the server)

    Thanks all, but I'll be posting some more. Hopefully, my learning can help you all out aswell.

    -J

    Edited by - jraha on 06/19/2003 08:52:42 AM

  • jraha's query forces the index to be used, but you may want to rethink that composite index.

    For example, tShortTerm has a one to many relationship with tLongTerm, but as long as new rows are inserted into tShortTerm sequentially by date then you should add an identity column to tShortTerm and use that instead.

    For example, say you have a composite clustered index on identity (column stid) and userid. (having stid first will speed up your inserts to tLongTerm, but having userid first would probably be more useful when selecting.)

    insert into tLongTerm

    Select st1.userid, st1.date

    From

    (

    select userId, max(stid) as maxStid

    from tShortTerm (nolock)

    Where stid is not null --only necessary if stid if 1st in index

    Group by userid

    ) st

    JOIN tShortTerm st1 (nolock) on st.maxStid= st1.stid

    Having a nonclustered index on date is most likely a good idea as well, but experiment. Stay away from max(date) as often as you can!

    cl

    Signature is NULL

  • Thank you NPeters,

    People love to use big vocabulary when it is unecessary. I am with you.

    What a stupidity! Let's get real and let's speak in English! Only DBA with complex will start talking with weird language.

  • Hey guys,

    This is very interesting. The where statement influencing the quesry optimization.

    Let's get real!!!! MIcrosoft is very stupid... they can have had forecasted that; adding a where statement just to improve the performance is a non-sense but well, I guess it pays...sometimes Microsoft gives me a big laugh 🙂

  • quote:


    but as long as new rows are inserted into tShortTerm sequentially by date then you should add an identity column to tShortTerm and use that instead.


    Good point! unless he also frequently has to query on a date-range, in which case I'd think you would want the clustered index on the date column, but a non-clustered index on an identity column would probably still be a good idea for the max() query even then.

    Thanks for the idea!

  • quote:


    Hey guys,

    This is very interesting. The where statement influencing the quesry optimization.

    Let's get real!!!! MIcrosoft is very stupid... they can have had forecasted that; adding a where statement just to improve the performance is a non-sense but well, I guess it pays...sometimes Microsoft gives me a big laugh 🙂


    Come down bro.

    How would you do it ?

    MW

    Edited by - mworku on 06/20/2003 11:19:24 AM


    MW

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply