4 hour plus query

  • Hi there,

    I have the following tables:

    UserActivity (userId, date, col1, col2, col3)

    UserImpression (userId, date, cola, colb, colc)

    UserDate (userId, latestDate)

    UserActivity has 10 million rows.

    UserImpression has 100 million rows.

    UserDate has 70 million rows.

    UserImpression and UserActivity store information based on the joint primary key of userId and date

    UserDate stores the latest date that any information for a particular userId has been stored thus a primary key on userId alone.

    These tables need to be joined together as follows:

    select I.*

    from UserActivity A

    join UserDate D on D.userId = A.userId

    and D.date = A.date

    join UserImpression I on D.userId= I.userId and D.latestDate = I.date

    If this query returns 70 million rows, what is a reasonable expected runtime?

    I'm currently running this on a machine with 2 Gigs memory and 4 PIII Xeon 700MHz processors

    Due to the enormous size and frequency of data turnover of UserImpression, I have created that table as a partitioned view with the check constraint on the date column. Each day has approximately 30 million rows and 30 days of data are stored.

    Please help.

    Edited by - jraha on 06/24/2003 11:32:27 AM

  • No idea. Kinda overwhelmed at the idea of actually returning 70 million rows - what do you do with them? You could try just returning a subset and extrapolating from there.

    Andy

    http://qa.sqlservercentral.com/columnists/awarren/

  • Actually I didn't express that correctly... I would return the 70 million rows into a new table and eventually run some aggregating queries on that.

  • the ultimate wet dream would be getting this one sub-second πŸ˜‰ I'd recomment on matching the indexes (clustered ? ) and trying to get them in ram (more than 2gb). Of course you'll be disk dependant because of massive IO-ops (at least for userimpression).

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution πŸ˜€

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

  • the ultimate wet dream would be getting this one sub-second πŸ˜‰ I'd recomment on matching the indexes (clustered ? ) and trying to get them in ram (more than 2gb). Of course you'll be disk dependant because of massive IO-ops (at least for userimpression).

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution πŸ˜€

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

  • Normally such operations are required for data warehousing sort of applications. But even there we can better represent information using some statistical ideas like summing, frequency table etc. If you can post the exact requirement I think many will be there to help you out

    Instead of join if you are for β€œexists β€œ in where clause then it will not block the tables as a whole since the sub query runs row by row, even it may take one more hour (joke)

    Regards

    John

    
    

    select I.*
    from
    fom UserImpression I
    where
    exists (
    select *
    from
    UserActivity A ,
    UserDate D
    where
    D.userId= I.userId and D.latestDate = I.date
    and UserDate D on D.userId = A.userId
    )

  • First off based on

    quote:


    UserDate (userId, latestDate)


    should this line

    quote:


    join UserDate D on D.userId = A.userId

    and D.date = A.date


    actually be

    join UserDate D on D.userId = A.userId

    and D.latestDate = A.date

    and if so then why not just do

    
    
    SELECT
    I.*
    FROM
    UserActivity A
    INNER JOIN
    UserImpression I
    ON
    A.userId= I.userId AND
    A.[date] = I.[date]

    Even so based on this

    quote:


    UserImpression (userId, date, cola, colb, colc)


    70 million rows may or may not fit in 2GB as I don't know what userid, date, cola, colb, colc are as far as width so I can't tell you if your running into any memory bottlenecks here.

    But you come back and say

    quote:


    I would return the 70 million rows into a new table


    To get maximum insert performance make sure all constraints, triggers and indexes are disabled or removed from this table then add or enable after insert is done.

    Also I assume your indexes would be

    UserDate (userId, latestDate) CLUSTERED

    UserImpression (userId, date) CLUSTERED

    UserActivity (userId, date) CLUSTERED

    To maximize join performance, if you always use userid and latestdate/date to join these tables and work with them I would consider also making them

    datefield, userid as the order of the columns in the index. It is best to have you most unique value first in a composite (multicolumn) index because only the first column is sampled to build the statistics value.

    Another thing which I assume you have already done is the columns used in the join work fastest as long as they are the same datatype. Otherwise SQL will auto-type or you will need to perform an explicit CAST (which is better than SQL auto-typing) if they are not.

    Other than that it is a matter of execution plan and tweaking to see what can be done.

  • If you do run into performance problems due to a lack of memory, consider breaking up the whole insert into smaller chunks.

    Should be fairly easy, based on the description you gave of the UserDate table.

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply