How do you limit records using Distinct and Count - TOP and Rowcount dont work ?

  • Lynn Pettis (1/13/2010)


    ifila (1/13/2010)


    create TABLE #OutputEmails

    (

    Email1 nvarchar(1000),

    Email2 nvarchar(1000),

    Email3 nvarchar(1000),

    Email4 nvarchar(1000),

    Email5 nvarchar(1000),

    Email6 nvarchar(1000),

    Email7 nvarchar(1000),

    Email8 nvarchar(1000),

    Email9 nvarchar(1000),

    Email10 nvarchar(1000)

    );

    INSERT INTO #OutputEmails SELECT Email1,Email2,Email3,Email4,Email5,Email6,Email7,Email8,Email9,Email10 FROM OutputResume3

    where (contains (originalresume, '"livelink" and "java"'))

    select count(DISTINCT (Email)) FROM

    (

    SELECT top 1000 Email1 as Email FROM #OutputEmails

    UNION

    SELECT Email2 as Email FROM #OutputEmails

    UNION

    SELECT Email3 as Email From #OutputEmails

    Union

    SELECT Email4 as Email From #OutputEmails

    UNION

    SELECT Email5 as Email FROM #OutputEmails

    UNION

    SELECT Email6 as Email From #OutputEmails

    Union

    SELECT Email7 as Email From #OutputEmails

    UNION

    SELECT Email8 as Email FROM #OutputEmails

    UNION

    SELECT Email9 as Email From #OutputEmails

    Union

    SELECT Email10 as Email From #OutputEmails

    ) as T where Email like '%@%'

    You do realize that the TOP 1000 only applies to the first query in the union, not all the rest of them, right?

    Yes, i was doing some testing.

  • Lynn Pettis (1/13/2010)


    Could you please explain what you are trying to achieve? What is the business case for this query (now I'm starting to sound life Jeff Moden)? How are you using it?

    When a user wants to know how many resumes / candidate emails we have based on a certain skill set, they can press the Count Button to see the size of the group.

    eg. They search for Java and get a count of 300k prospective candidates, so they add oracle, that takes it down to 100k, then they add sharepoint, down to 10k, then they add j2ee, down to 1k, which is now a workable group.

    Occasionally on very large counts, we get a timeout error, so i wanted to limit the Count to give a number if under 50k and just say 'Over 50k records' if over 50k records, instead of risking a timeout scenario.

    Make sense ?

  • ifila (1/13/2010)


    Lynn Pettis (1/13/2010)


    Could you please explain what you are trying to achieve? What is the business case for this query (now I'm starting to sound life Jeff Moden)? How are you using it?

    When a user wants to know how many resumes / candidate emails we have based on a certain skill set, they can press the Count Button to see the size of the group.

    eg. They search for Java and get a count of 300k prospective candidates, so they add oracle, that takes it down to 100k, then they add sharepoint, down to 10k, then they add j2ee, down to 1k, which is now a workable group.

    Occasionally on very large counts, we get a timeout error, so i wanted to limit the Count to give a number if under 50k and just say 'Over 50k records' if over 50k records, instead of risking a timeout scenario.

    Make sense ?

    Why does it take so long?

    Here's the query...

    SELECT TOP 15 Email

    FROM (

    SELECT RowID, Emails, Email

    FROM (

    -- substitute this for your own query

    SELECT RowID, Email1, Email2, Email3, Email4, Email5, Email6, Email7, Email8, Email9, Email10

    FROM #Sample

    -- /substitute this for your own query

    ) p

    UNPIVOT

    (Email FOR Emails IN

    (Email1, Email2, Email3, Email4, Email5, Email6, Email7, Email8, Email9, Email10)

    )AS unpvt

    ) d GROUP BY Email

    ORDER BY NEWID()

    -- (15 row(s) affected) / 00:00:01

    -- (1 second to extract random set of 15 distinct email addresses)

    Here's some sample data:

    -- create some sample data

    DROP TABLE #Sample

    CREATE TABLE #Sample (RowID INT IDENTITY (1,1) ,

    Email1 VARCHAR(60), Email2 VARCHAR(60), Email3 VARCHAR(60), Email4 VARCHAR(60), Email5 VARCHAR(60),

    Email6 VARCHAR(60), Email7 VARCHAR(60), Email8 VARCHAR(60), Email9 VARCHAR(60), Email10 VARCHAR(60))

    INSERT INTO #Sample (Email1, Email2, Email3, Email4, Email5, Email6, Email7, Email8, Email9, Email10)

    SELECT TOP 100000

    Email1 = LEFT(NEWID(), ABS(CHECKSUM(NEWID()))%20) + '@Hotmail.com',

    Email2 = LEFT(NEWID(), ABS(CHECKSUM(NEWID()))%20) + '@Hotmail.com',

    Email3 = LEFT(NEWID(), ABS(CHECKSUM(NEWID()))%20) + '@Hotmail.com',

    Email4 = LEFT(NEWID(), ABS(CHECKSUM(NEWID()))%20) + '@Hotmail.com',

    Email5 = LEFT(NEWID(), ABS(CHECKSUM(NEWID()))%20) + '@Hotmail.com',

    Email6 = LEFT(NEWID(), ABS(CHECKSUM(NEWID()))%20) + '@Hotmail.com',

    Email7 = LEFT(NEWID(), ABS(CHECKSUM(NEWID()))%20) + '@Hotmail.com',

    Email8 = LEFT(NEWID(), ABS(CHECKSUM(NEWID()))%20) + '@Hotmail.com',

    Email9 = LEFT(NEWID(), ABS(CHECKSUM(NEWID()))%20) + '@Hotmail.com',

    Email10 = LEFT(NEWID(), ABS(CHECKSUM(NEWID()))%20) + '@Hotmail.com'

    FROM master.dbo.syscolumns a, master.dbo.syscolumns b

    UPDATE u SET

    Email2 = CASE WHEN d2.RowID = u.RowID THEN NULL ELSE Email2 END,

    Email3 = CASE WHEN d3.RowID = u.RowID THEN NULL ELSE Email3 END,

    Email4 = CASE WHEN d4.RowID = u.RowID THEN NULL ELSE Email4 END,

    Email5 = CASE WHEN d5.RowID = u.RowID THEN NULL ELSE Email5 END,

    Email6 = CASE WHEN d6.RowID = u.RowID THEN NULL ELSE Email6 END,

    Email7 = CASE WHEN d7.RowID = u.RowID THEN NULL ELSE Email7 END,

    Email8 = CASE WHEN d8.RowID = u.RowID THEN NULL ELSE Email8 END,

    Email9 = CASE WHEN d9.RowID = u.RowID THEN NULL ELSE Email9 END,

    Email10 = CASE WHEN d10.RowID = u.RowID THEN NULL ELSE Email10 END

    FROM #Sample u

    LEFT JOIN (SELECT top 30 PERCENT RowID FROM #Sample ORDER BY NEWID()) d2 ON d2.RowID = u.RowID

    LEFT JOIN (SELECT top 40 PERCENT RowID FROM #Sample ORDER BY NEWID()) d3 ON d3.RowID = u.RowID

    LEFT JOIN (SELECT top 50 PERCENT RowID FROM #Sample ORDER BY NEWID()) d4 ON d4.RowID = u.RowID

    LEFT JOIN (SELECT top 60 PERCENT RowID FROM #Sample ORDER BY NEWID()) d5 ON d5.RowID = u.RowID

    LEFT JOIN (SELECT top 70 PERCENT RowID FROM #Sample ORDER BY NEWID()) d6 ON d6.RowID = u.RowID

    LEFT JOIN (SELECT top 80 PERCENT RowID FROM #Sample ORDER BY NEWID()) d7 ON d7.RowID = u.RowID

    LEFT JOIN (SELECT top 90 PERCENT RowID FROM #Sample ORDER BY NEWID()) d8 ON d8.RowID = u.RowID

    LEFT JOIN (SELECT top 95 PERCENT RowID FROM #Sample ORDER BY NEWID()) d9 ON d9.RowID = u.RowID

    LEFT JOIN (SELECT top 99 PERCENT RowID FROM #Sample ORDER BY NEWID()) d10 ON d10.RowID = u.RowID

    SELECT * FROM #Sample -- 100,000 rows / 00:00:18 (18 seconds to prepare sample data)

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • A count of unique email addresses, I am assuming as well. Does the user have the option to restrict the date range on the search?

    If so, I would also recommend a change to your table. One, drop and recreate the primary key as a nonclustered index. Two, create a clustered index on currentdateout.

  • GilaMonster (1/13/2010)


    Try this then

    SELECT COUNT(Email) AS NumberOfEmails -- get the count

    FROM (

    SELECT DISTINCT TOP 15 Email -- get 15 unique emails

    FROM (

    SELECT Email1,Email2,Email3, Email4,Email5,Email6,Email7,Email8,Email9,Email10 FROM outputresume3

    WHERE ((contains (originalresume, '"j2ee" and "java"'))

    AND (currentdateout BETWEEN '2000-01-01' AND '2010-01-06'))

    )p

    UNPIVOT (Email FOR Emails IN (Email1,Email2,Email3,Email4,Email5,Email6,Email7,Email8,Email9,Email10)) as unpvt

    WHERE LEN(Email) > 0 ) sub

    I'm curious as to why. You just want the number 15? For what purpose?

    We have a winner:-)

    The 15 was for test purposes.

    Your solution works - thank you!

  • Dave Ballantyne (1/10/2010)


    I recently went to buy a simple car of all things. I went into the showroom and said "I want a car!".

    God , you would not believe the stuff that came out of this guys mouth. "What was my budget?", "What luggage capacity would i require? ", "What MPG would i be expecting ?" MPG !!!! WTF does that even mean ? I told him that he was wasting my time, and that if he could answer my original question correctly he should just of ignored me.

    Anyway , long story short, i phoned an offshore car expert , who recommend a Trabant. So i hopped on the first available plane to the old east germany , and bought the first one i saw. Shipping fees we quite high to bring it home , but still , job done ive got my car.

    Now my problem is, my pet tiger who accompanies me every where cant fit on the back seat, what should i do ?

    That is awesome!

  • Lynn Pettis (1/13/2010)


    A count of unique email addresses, I am assuming as well. Does the user have the option to restrict the date range on the search? YES

    If so, I would also recommend a change to your table. One, drop and recreate the primary key as a nonclustered index. Two, create a clustered index on currentdateout.

    At the moment the 'Primary Key' is clustered, so you suggest changing to nonclustered.

    At the moment 'currentdateout' is nonclustered and you suggest changing to clustered.

    What performance improvement might that provide?

    BTW based on some initial performance testing with Gail's solution:

    If i set TOP 60000 using her query it takes 8 seconds to get the count.

    If i set TOP 100000 on email1, which gives approx 60000 records, it takes 2 seconds to get the count.

    So i guess the Sub Query slows things down quite a bit.

  • Yes, I am recommending the the Primary Key be changed to a nonclustered index and that currentdateout be made the clustered index.

    If you look at your actual execution plan, you will probably see that this query is doing a clustered index scan, which is essentially a table scan. Even if you reduce the date range in the query to say 2009-01-01 to 2009-12-31 from the current 10 year range, it probably will still do a clustered index scan.

    This is one reason I asked if the user can limit the date range of the query as well.

  • Chris Morris-439714 (1/13/2010)


    ifila (1/13/2010)


    Lynn Pettis (1/13/2010)


    Could you please explain what you are trying to achieve? What is the business case for this query (now I'm starting to sound life Jeff Moden)? How are you using it?

    When a user wants to know how many resumes / candidate emails we have based on a certain skill set, they can press the Count Button to see the size of the group.

    eg. They search for Java and get a count of 300k prospective candidates, so they add oracle, that takes it down to 100k, then they add sharepoint, down to 10k, then they add j2ee, down to 1k, which is now a workable group.

    Occasionally on very large counts, we get a timeout error, so i wanted to limit the Count to give a number if under 50k and just say 'Over 50k records' if over 50k records, instead of risking a timeout scenario.

    Make sense ?

    Why does it take so long?

    Here's the query...

    SELECT TOP 15 Email

    FROM (

    SELECT RowID, Emails, Email

    FROM (

    -- substitute this for your own query

    SELECT RowID, Email1, Email2, Email3, Email4, Email5, Email6, Email7, Email8, Email9, Email10

    FROM #Sample

    -- /substitute this for your own query

    ) p

    UNPIVOT

    (Email FOR Emails IN

    (Email1, Email2, Email3, Email4, Email5, Email6, Email7, Email8, Email9, Email10)

    )AS unpvt

    ) d GROUP BY Email

    ORDER BY NEWID()

    -- (15 row(s) affected) / 00:00:01

    -- (1 second to extract random set of 15 distinct email addresses)

    Here's some sample data:

    -- create some sample data

    DROP TABLE #Sample

    CREATE TABLE #Sample (RowID INT IDENTITY (1,1) ,

    Email1 VARCHAR(60), Email2 VARCHAR(60), Email3 VARCHAR(60), Email4 VARCHAR(60), Email5 VARCHAR(60),

    Email6 VARCHAR(60), Email7 VARCHAR(60), Email8 VARCHAR(60), Email9 VARCHAR(60), Email10 VARCHAR(60))

    INSERT INTO #Sample (Email1, Email2, Email3, Email4, Email5, Email6, Email7, Email8, Email9, Email10)

    SELECT TOP 100000

    Email1 = LEFT(NEWID(), ABS(CHECKSUM(NEWID()))%20) + '@Hotmail.com',

    Email2 = LEFT(NEWID(), ABS(CHECKSUM(NEWID()))%20) + '@Hotmail.com',

    Email3 = LEFT(NEWID(), ABS(CHECKSUM(NEWID()))%20) + '@Hotmail.com',

    Email4 = LEFT(NEWID(), ABS(CHECKSUM(NEWID()))%20) + '@Hotmail.com',

    Email5 = LEFT(NEWID(), ABS(CHECKSUM(NEWID()))%20) + '@Hotmail.com',

    Email6 = LEFT(NEWID(), ABS(CHECKSUM(NEWID()))%20) + '@Hotmail.com',

    Email7 = LEFT(NEWID(), ABS(CHECKSUM(NEWID()))%20) + '@Hotmail.com',

    Email8 = LEFT(NEWID(), ABS(CHECKSUM(NEWID()))%20) + '@Hotmail.com',

    Email9 = LEFT(NEWID(), ABS(CHECKSUM(NEWID()))%20) + '@Hotmail.com',

    Email10 = LEFT(NEWID(), ABS(CHECKSUM(NEWID()))%20) + '@Hotmail.com'

    FROM master.dbo.syscolumns a, master.dbo.syscolumns b

    UPDATE u SET

    Email2 = CASE WHEN d2.RowID = u.RowID THEN NULL ELSE Email2 END,

    Email3 = CASE WHEN d3.RowID = u.RowID THEN NULL ELSE Email3 END,

    Email4 = CASE WHEN d4.RowID = u.RowID THEN NULL ELSE Email4 END,

    Email5 = CASE WHEN d5.RowID = u.RowID THEN NULL ELSE Email5 END,

    Email6 = CASE WHEN d6.RowID = u.RowID THEN NULL ELSE Email6 END,

    Email7 = CASE WHEN d7.RowID = u.RowID THEN NULL ELSE Email7 END,

    Email8 = CASE WHEN d8.RowID = u.RowID THEN NULL ELSE Email8 END,

    Email9 = CASE WHEN d9.RowID = u.RowID THEN NULL ELSE Email9 END,

    Email10 = CASE WHEN d10.RowID = u.RowID THEN NULL ELSE Email10 END

    FROM #Sample u

    LEFT JOIN (SELECT top 30 PERCENT RowID FROM #Sample ORDER BY NEWID()) d2 ON d2.RowID = u.RowID

    LEFT JOIN (SELECT top 40 PERCENT RowID FROM #Sample ORDER BY NEWID()) d3 ON d3.RowID = u.RowID

    LEFT JOIN (SELECT top 50 PERCENT RowID FROM #Sample ORDER BY NEWID()) d4 ON d4.RowID = u.RowID

    LEFT JOIN (SELECT top 60 PERCENT RowID FROM #Sample ORDER BY NEWID()) d5 ON d5.RowID = u.RowID

    LEFT JOIN (SELECT top 70 PERCENT RowID FROM #Sample ORDER BY NEWID()) d6 ON d6.RowID = u.RowID

    LEFT JOIN (SELECT top 80 PERCENT RowID FROM #Sample ORDER BY NEWID()) d7 ON d7.RowID = u.RowID

    LEFT JOIN (SELECT top 90 PERCENT RowID FROM #Sample ORDER BY NEWID()) d8 ON d8.RowID = u.RowID

    LEFT JOIN (SELECT top 95 PERCENT RowID FROM #Sample ORDER BY NEWID()) d9 ON d9.RowID = u.RowID

    LEFT JOIN (SELECT top 99 PERCENT RowID FROM #Sample ORDER BY NEWID()) d10 ON d10.RowID = u.RowID

    SELECT * FROM #Sample -- 100,000 rows / 00:00:18 (18 seconds to prepare sample data)

    I dont have a 'rowid' column on my table. Is the only solution to add this to my table as a workaround?

  • No.

  • ifila (1/13/2010)


    What performance improvement might that provide?

    BTW based on some initial performance testing with Gail's solution:

    If i set TOP 60000 using her query it takes 8 seconds to get the count.

    If i set TOP 100000 on email1, which gives approx 60000 records, it takes 2 seconds to get the count.

    So i guess the Sub Query slows things down quite a bit.

    Post the execution plans please (as a .sqlplan file, zipped and attached to the thread) and I'll give you some tuning advice. I didn't use much data to test, plus I don't have the full text catalog, so no way to do proper tuning here.

    If you're unsure how to get the execution plans, have a look at this article. http://qa.sqlservercentral.com/articles/SQLServerCentral/66909/

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • ifila (1/13/2010)


    I dont have a 'rowid' column on my table. Is the only solution to add this to my table as a workaround?

    No. You don't need it at all. But I like it there. You can remove it if you want. IAC that was the sample data. This is the query, with the column removed:

    SELECT TOP 15 Email

    FROM (

    SELECT Emails, Email

    FROM (

    -- substitute this for your own query

    SELECT Email1, Email2, Email3, Email4, Email5, Email6, Email7, Email8, Email9, Email10

    FROM #Sample

    -- /substitute this for your own query

    ) p

    UNPIVOT

    (Email FOR Emails IN

    (Email1, Email2, Email3, Email4, Email5, Email6, Email7, Email8, Email9, Email10)

    )AS unpvt

    ) d GROUP BY Email

    ORDER BY NEWID()

    -- (15 row(s) affected) / 00:00:01

    -- (1 second to extract random set of 15 distinct email addresses)

    Cheers

    ChrisM@home


    [font="Arial"]Low-hanging fruit picker and defender of the moggies[/font]

    For better assistance in answering your questions, please read this[/url].


    Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

    Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

  • GilaMonster (1/13/2010)


    ifila (1/13/2010)


    What performance improvement might that provide?

    BTW based on some initial performance testing with Gail's solution:

    If i set TOP 60000 using her query it takes 8 seconds to get the count.

    If i set TOP 100000 on email1, which gives approx 60000 records, it takes 2 seconds to get the count.

    So i guess the Sub Query slows things down quite a bit.

    Post the execution plans please (as a .sqlplan file, zipped and attached to the thread) and I'll give you some tuning advice. I didn't use much data to test, plus I don't have the full text catalog, so no way to do proper tuning here.

    If you're unsure how to get the execution plans, have a look at this article. http://qa.sqlservercentral.com/articles/SQLServerCentral/66909/

    Please find attached the execution plan.

    Thanks

  • Again, I have to ask, can the user specify the data range used in the query?

  • Lynn Pettis (1/13/2010)


    Again, I have to ask, can the user specify the data range used in the query?

    Apologies i missed you question - yes they can

    Selection Criteria can be:

    Email

    From Date

    To Date

    Skills (Comma delimiter) Resume is stored as a FTI

Viewing 15 posts - 61 through 75 (of 103 total)

You must be logged in to reply to this topic. Login to reply