How do you limit records using Distinct and Count - TOP and Rowcount dont work ?

Question

Post reply

How do you limit records using Distinct and Count - TOP and Rowcount dont work ?

Viewing 15 posts - 61 through 75 (of 103 total)

You must be logged in to reply to this topic. Login to reply

isuckatsql SSChampion Points: 10381 More actions · Answer 1

Lynn Pettis (1/13/2010)

ifila (1/13/2010)

create TABLE #OutputEmails

(

Email1 nvarchar(1000),

Email2 nvarchar(1000),

Email3 nvarchar(1000),

Email4 nvarchar(1000),

Email5 nvarchar(1000),

Email6 nvarchar(1000),

Email7 nvarchar(1000),

Email8 nvarchar(1000),

Email9 nvarchar(1000),

Email10 nvarchar(1000)

);

INSERT INTO #OutputEmails SELECT Email1,Email2,Email3,Email4,Email5,Email6,Email7,Email8,Email9,Email10 FROM OutputResume3

where (contains (originalresume, '"livelink" and "java"'))

select count(DISTINCT (Email)) FROM

(

SELECT top 1000 Email1 as Email FROM #OutputEmails

UNION

SELECT Email2 as Email FROM #OutputEmails

UNION

SELECT Email3 as Email From #OutputEmails

Union

SELECT Email4 as Email From #OutputEmails

UNION

SELECT Email5 as Email FROM #OutputEmails

UNION

SELECT Email6 as Email From #OutputEmails

Union

SELECT Email7 as Email From #OutputEmails

UNION

SELECT Email8 as Email FROM #OutputEmails

UNION

SELECT Email9 as Email From #OutputEmails

Union

SELECT Email10 as Email From #OutputEmails

) as T where Email like '%@%'

You do realize that the TOP 1000 only applies to the first query in the union, not all the rest of them, right?

Yes, i was doing some testing.

isuckatsql SSChampion Points: 10381 More actions · Answer 2

Lynn Pettis (1/13/2010)

Could you please explain what you are trying to achieve? What is the business case for this query (now I'm starting to sound life Jeff Moden)? How are you using it?

When a user wants to know how many resumes / candidate emails we have based on a certain skill set, they can press the Count Button to see the size of the group.

eg. They search for Java and get a count of 300k prospective candidates, so they add oracle, that takes it down to 100k, then they add sharepoint, down to 10k, then they add j2ee, down to 1k, which is now a workable group.

Occasionally on very large counts, we get a timeout error, so i wanted to limit the Count to give a number if under 50k and just say 'Over 50k records' if over 50k records, instead of risking a timeout scenario.

Make sense ?

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 3

ifila (1/13/2010)

Lynn Pettis (1/13/2010)

Could you please explain what you are trying to achieve? What is the business case for this query (now I'm starting to sound life Jeff Moden)? How are you using it?

When a user wants to know how many resumes / candidate emails we have based on a certain skill set, they can press the Count Button to see the size of the group.

eg. They search for Java and get a count of 300k prospective candidates, so they add oracle, that takes it down to 100k, then they add sharepoint, down to 10k, then they add j2ee, down to 1k, which is now a workable group.

Occasionally on very large counts, we get a timeout error, so i wanted to limit the Count to give a number if under 50k and just say 'Over 50k records' if over 50k records, instead of risking a timeout scenario.

Make sense ?

Why does it take so long?

Here's the query...

SELECT TOP 15 Email



FROM (

	SELECT RowID, Emails, Email

	FROM (

	 -- substitute this for your own query

	SELECT RowID, Email1, Email2, Email3, Email4, Email5, Email6, Email7, Email8, Email9, Email10

	FROM #Sample

	 -- /substitute this for your own query

	) p

	UNPIVOT

	(Email FOR Emails IN

		(Email1, Email2, Email3, Email4, Email5, Email6, Email7, Email8, Email9, Email10)

	)AS unpvt

) d GROUP BY Email

ORDER BY NEWID()

-- (15 row(s) affected) / 00:00:01

-- (1 second to extract random set of 15 distinct email addresses)

Here's some sample data:

-- create some sample data



DROP TABLE #Sample

CREATE TABLE #Sample (RowID INT IDENTITY  (1,1)  ,

	Email1 VARCHAR(60), Email2 VARCHAR(60), Email3 VARCHAR(60), Email4 VARCHAR(60), Email5 VARCHAR(60),

	Email6 VARCHAR(60), Email7 VARCHAR(60), Email8 VARCHAR(60), Email9 VARCHAR(60), Email10 VARCHAR(60))

INSERT INTO #Sample (Email1, Email2, Email3, Email4, Email5, Email6, Email7, Email8, Email9, Email10)

SELECT TOP 100000

	Email1 = LEFT(NEWID(), ABS(CHECKSUM(NEWID()))%20) + '@Hotmail.com',

	Email2 = LEFT(NEWID(), ABS(CHECKSUM(NEWID()))%20) + '@Hotmail.com',

	Email3 = LEFT(NEWID(), ABS(CHECKSUM(NEWID()))%20) + '@Hotmail.com',

	Email4 = LEFT(NEWID(), ABS(CHECKSUM(NEWID()))%20) + '@Hotmail.com',

	Email5 = LEFT(NEWID(), ABS(CHECKSUM(NEWID()))%20) + '@Hotmail.com',

	Email6 = LEFT(NEWID(), ABS(CHECKSUM(NEWID()))%20) + '@Hotmail.com',

	Email7 = LEFT(NEWID(), ABS(CHECKSUM(NEWID()))%20) + '@Hotmail.com',

	Email8 = LEFT(NEWID(), ABS(CHECKSUM(NEWID()))%20) + '@Hotmail.com',

	Email9 = LEFT(NEWID(), ABS(CHECKSUM(NEWID()))%20) + '@Hotmail.com',

	Email10 = LEFT(NEWID(), ABS(CHECKSUM(NEWID()))%20) + '@Hotmail.com'

FROM master.dbo.syscolumns a, master.dbo.syscolumns b

UPDATE u SET

	Email2 = CASE WHEN d2.RowID = u.RowID THEN NULL ELSE Email2 END,

	Email3 = CASE WHEN d3.RowID = u.RowID THEN NULL ELSE Email3 END,

	Email4 = CASE WHEN d4.RowID = u.RowID THEN NULL ELSE Email4 END,

	Email5 = CASE WHEN d5.RowID = u.RowID THEN NULL ELSE Email5 END,

	Email6 = CASE WHEN d6.RowID = u.RowID THEN NULL ELSE Email6 END,

	Email7 = CASE WHEN d7.RowID = u.RowID THEN NULL ELSE Email7 END,

	Email8 = CASE WHEN d8.RowID = u.RowID THEN NULL ELSE Email8 END,

	Email9 = CASE WHEN d9.RowID = u.RowID THEN NULL ELSE Email9 END,

	Email10 = CASE WHEN d10.RowID = u.RowID THEN NULL ELSE Email10 END

FROM #Sample u

LEFT JOIN (SELECT top 30 PERCENT RowID FROM #Sample ORDER BY NEWID()) d2 ON d2.RowID = u.RowID

LEFT JOIN (SELECT top 40 PERCENT RowID FROM #Sample ORDER BY NEWID()) d3 ON d3.RowID = u.RowID

LEFT JOIN (SELECT top 50 PERCENT RowID FROM #Sample ORDER BY NEWID()) d4 ON d4.RowID = u.RowID

LEFT JOIN (SELECT top 60 PERCENT RowID FROM #Sample ORDER BY NEWID()) d5 ON d5.RowID = u.RowID

LEFT JOIN (SELECT top 70 PERCENT RowID FROM #Sample ORDER BY NEWID()) d6 ON d6.RowID = u.RowID

LEFT JOIN (SELECT top 80 PERCENT RowID FROM #Sample ORDER BY NEWID()) d7 ON d7.RowID = u.RowID

LEFT JOIN (SELECT top 90 PERCENT RowID FROM #Sample ORDER BY NEWID()) d8 ON d8.RowID = u.RowID

LEFT JOIN (SELECT top 95 PERCENT RowID FROM #Sample ORDER BY NEWID()) d9 ON d9.RowID = u.RowID

LEFT JOIN (SELECT top 99 PERCENT RowID FROM #Sample ORDER BY NEWID()) d10 ON d10.RowID = u.RowID

SELECT * FROM #Sample -- 100,000 rows / 00:00:18 (18 seconds to prepare sample data)

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Lynn Pettis SSC Guru Points: 442447 More actions · Answer 4

Lynn Pettis

SSC Guru

Points: 442447

January 13, 2010 at 9:33 am

#1103161

A count of unique email addresses, I am assuming as well. Does the user have the option to restrict the date range on the search?

If so, I would also recommend a change to your table. One, drop and recreate the primary key as a nonclustered index. Two, create a clustered index on currentdateout.

isuckatsql SSChampion Points: 10381 More actions · Answer 5

GilaMonster (1/13/2010)

Try this then

SELECT COUNT(Email) AS NumberOfEmails -- get the count
FROM ( SELECT DISTINCT TOP 15 Email -- get 15 unique emails FROM ( SELECT Email1,Email2,Email3, Email4,Email5,Email6,Email7,Email8,Email9,Email10 FROM outputresume3 WHERE ((contains (originalresume, '"j2ee" and "java"')) AND (currentdateout BETWEEN '2000-01-01' AND '2010-01-06')) )p UNPIVOT (Email FOR Emails IN (Email1,Email2,Email3,Email4,Email5,Email6,Email7,Email8,Email9,Email10)) as unpvt
WHERE LEN(Email) > 0 ) sub

I'm curious as to why. You just want the number 15? For what purpose?

We have a winner:-)

The 15 was for test purposes.

Your solution works - thank you!

rothj Hall of Fame Points: 3087 More actions · Answer 6

Dave Ballantyne (1/10/2010)

I recently went to buy a simple car of all things. I went into the showroom and said "I want a car!".

God , you would not believe the stuff that came out of this guys mouth. "What was my budget?", "What luggage capacity would i require? ", "What MPG would i be expecting ?" MPG !!!! WTF does that even mean ? I told him that he was wasting my time, and that if he could answer my original question correctly he should just of ignored me.

Anyway , long story short, i phoned an offshore car expert , who recommend a Trabant. So i hopped on the first available plane to the old east germany , and bought the first one i saw. Shipping fees we quite high to bring it home , but still , job done ive got my car.

Now my problem is, my pet tiger who accompanies me every where cant fit on the back seat, what should i do ?

That is awesome!

isuckatsql SSChampion Points: 10381 More actions · Answer 7

Lynn Pettis (1/13/2010)

A count of unique email addresses, I am assuming as well. Does the user have the option to restrict the date range on the search? YES

If so, I would also recommend a change to your table. One, drop and recreate the primary key as a nonclustered index. Two, create a clustered index on currentdateout.

At the moment the 'Primary Key' is clustered, so you suggest changing to nonclustered.

At the moment 'currentdateout' is nonclustered and you suggest changing to clustered.

What performance improvement might that provide?

BTW based on some initial performance testing with Gail's solution:

If i set TOP 60000 using her query it takes 8 seconds to get the count.

If i set TOP 100000 on email1, which gives approx 60000 records, it takes 2 seconds to get the count.

So i guess the Sub Query slows things down quite a bit.

Lynn Pettis SSC Guru Points: 442447 More actions · Answer 8

Yes, I am recommending the the Primary Key be changed to a nonclustered index and that currentdateout be made the clustered index.

If you look at your actual execution plan, you will probably see that this query is doing a clustered index scan, which is essentially a table scan. Even if you reduce the date range in the query to say 2009-01-01 to 2009-12-31 from the current 10 year range, it probably will still do a clustered index scan.

This is one reason I asked if the user can limit the date range of the query as well.

isuckatsql SSChampion Points: 10381 More actions · Answer 9

Chris Morris-439714 (1/13/2010)

ifila (1/13/2010)

Lynn Pettis (1/13/2010)

Could you please explain what you are trying to achieve? What is the business case for this query (now I'm starting to sound life Jeff Moden)? How are you using it?

When a user wants to know how many resumes / candidate emails we have based on a certain skill set, they can press the Count Button to see the size of the group.

eg. They search for Java and get a count of 300k prospective candidates, so they add oracle, that takes it down to 100k, then they add sharepoint, down to 10k, then they add j2ee, down to 1k, which is now a workable group.

Occasionally on very large counts, we get a timeout error, so i wanted to limit the Count to give a number if under 50k and just say 'Over 50k records' if over 50k records, instead of risking a timeout scenario.

Make sense ?

Why does it take so long?

Here's the query...

SELECT TOP 15 Email
FROM ( SELECT RowID, Emails, Email FROM ( -- substitute this for your own query SELECT RowID, Email1, Email2, Email3, Email4, Email5, Email6, Email7, Email8, Email9, Email10 FROM #Sample -- /substitute this for your own query ) p UNPIVOT (Email FOR Emails IN (Email1, Email2, Email3, Email4, Email5, Email6, Email7, Email8, Email9, Email10) )AS unpvt ) d GROUP BY Email ORDER BY NEWID() -- (15 row(s) affected) / 00:00:01 -- (1 second to extract random set of 15 distinct email addresses)

Here's some sample data:

-- create some sample data
DROP TABLE #Sample CREATE TABLE #Sample (RowID INT IDENTITY (1,1) , Email1 VARCHAR(60), Email2 VARCHAR(60), Email3 VARCHAR(60), Email4 VARCHAR(60), Email5 VARCHAR(60), Email6 VARCHAR(60), Email7 VARCHAR(60), Email8 VARCHAR(60), Email9 VARCHAR(60), Email10 VARCHAR(60)) INSERT INTO #Sample (Email1, Email2, Email3, Email4, Email5, Email6, Email7, Email8, Email9, Email10) SELECT TOP 100000 Email1 = LEFT(NEWID(), ABS(CHECKSUM(NEWID()))%20) + '@Hotmail.com', Email2 = LEFT(NEWID(), ABS(CHECKSUM(NEWID()))%20) + '@Hotmail.com', Email3 = LEFT(NEWID(), ABS(CHECKSUM(NEWID()))%20) + '@Hotmail.com', Email4 = LEFT(NEWID(), ABS(CHECKSUM(NEWID()))%20) + '@Hotmail.com', Email5 = LEFT(NEWID(), ABS(CHECKSUM(NEWID()))%20) + '@Hotmail.com', Email6 = LEFT(NEWID(), ABS(CHECKSUM(NEWID()))%20) + '@Hotmail.com', Email7 = LEFT(NEWID(), ABS(CHECKSUM(NEWID()))%20) + '@Hotmail.com', Email8 = LEFT(NEWID(), ABS(CHECKSUM(NEWID()))%20) + '@Hotmail.com', Email9 = LEFT(NEWID(), ABS(CHECKSUM(NEWID()))%20) + '@Hotmail.com', Email10 = LEFT(NEWID(), ABS(CHECKSUM(NEWID()))%20) + '@Hotmail.com' FROM master.dbo.syscolumns a, master.dbo.syscolumns b UPDATE u SET Email2 = CASE WHEN d2.RowID = u.RowID THEN NULL ELSE Email2 END, Email3 = CASE WHEN d3.RowID = u.RowID THEN NULL ELSE Email3 END, Email4 = CASE WHEN d4.RowID = u.RowID THEN NULL ELSE Email4 END, Email5 = CASE WHEN d5.RowID = u.RowID THEN NULL ELSE Email5 END, Email6 = CASE WHEN d6.RowID = u.RowID THEN NULL ELSE Email6 END, Email7 = CASE WHEN d7.RowID = u.RowID THEN NULL ELSE Email7 END, Email8 = CASE WHEN d8.RowID = u.RowID THEN NULL ELSE Email8 END, Email9 = CASE WHEN d9.RowID = u.RowID THEN NULL ELSE Email9 END, Email10 = CASE WHEN d10.RowID = u.RowID THEN NULL ELSE Email10 END FROM #Sample u LEFT JOIN (SELECT top 30 PERCENT RowID FROM #Sample ORDER BY NEWID()) d2 ON d2.RowID = u.RowID LEFT JOIN (SELECT top 40 PERCENT RowID FROM #Sample ORDER BY NEWID()) d3 ON d3.RowID = u.RowID LEFT JOIN (SELECT top 50 PERCENT RowID FROM #Sample ORDER BY NEWID()) d4 ON d4.RowID = u.RowID LEFT JOIN (SELECT top 60 PERCENT RowID FROM #Sample ORDER BY NEWID()) d5 ON d5.RowID = u.RowID LEFT JOIN (SELECT top 70 PERCENT RowID FROM #Sample ORDER BY NEWID()) d6 ON d6.RowID = u.RowID LEFT JOIN (SELECT top 80 PERCENT RowID FROM #Sample ORDER BY NEWID()) d7 ON d7.RowID = u.RowID LEFT JOIN (SELECT top 90 PERCENT RowID FROM #Sample ORDER BY NEWID()) d8 ON d8.RowID = u.RowID LEFT JOIN (SELECT top 95 PERCENT RowID FROM #Sample ORDER BY NEWID()) d9 ON d9.RowID = u.RowID LEFT JOIN (SELECT top 99 PERCENT RowID FROM #Sample ORDER BY NEWID()) d10 ON d10.RowID = u.RowID
SELECT * FROM #Sample -- 100,000 rows / 00:00:18 (18 seconds to prepare sample data)

I dont have a 'rowid' column on my table. Is the only solution to add this to my table as a workaround?

Lynn Pettis SSC Guru Points: 442447 More actions · Answer 10

Lynn Pettis

SSC Guru

Points: 442447

January 13, 2010 at 11:09 am

#1103210

No.

Gail Shaw SSC Guru Points: 1004494 More actions · Answer 11

ifila (1/13/2010)

What performance improvement might that provide?

BTW based on some initial performance testing with Gail's solution:

If i set TOP 60000 using her query it takes 8 seconds to get the count.

If i set TOP 100000 on email1, which gives approx 60000 records, it takes 2 seconds to get the count.

So i guess the Sub Query slows things down quite a bit.

Post the execution plans please (as a .sqlplan file, zipped and attached to the thread) and I'll give you some tuning advice. I didn't use much data to test, plus I don't have the full text catalog, so no way to do proper tuning here.

If you're unsure how to get the execution plans, have a look at this article. http://qa.sqlservercentral.com/articles/SQLServerCentral/66909/

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

ChrisM@home SSC-Insane Points: 24260 More actions · Answer 12

ifila (1/13/2010)

I dont have a 'rowid' column on my table. Is the only solution to add this to my table as a workaround?

No. You don't need it at all. But I like it there. You can remove it if you want. IAC that was the sample data. This is the query, with the column removed:

SELECT TOP 15 Email

FROM (

SELECT Emails, Email

FROM (

-- substitute this for your own query

SELECT Email1, Email2, Email3, Email4, Email5, Email6, Email7, Email8, Email9, Email10

FROM #Sample

-- /substitute this for your own query

) p

UNPIVOT

(Email FOR Emails IN

(Email1, Email2, Email3, Email4, Email5, Email6, Email7, Email8, Email9, Email10)

)AS unpvt

) d GROUP BY Email

ORDER BY NEWID()

-- (15 row(s) affected) / 00:00:01

-- (1 second to extract random set of 15 distinct email addresses)

Cheers

ChrisM@home

[font="Arial"]^{Low-hanging fruit picker and defender of the moggies}[/font]

For better assistance in answering your questions, please read this[/url].

Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

isuckatsql SSChampion Points: 10381 More actions · Answer 13

GilaMonster (1/13/2010)

ifila (1/13/2010)

What performance improvement might that provide?

BTW based on some initial performance testing with Gail's solution:

If i set TOP 60000 using her query it takes 8 seconds to get the count.

If i set TOP 100000 on email1, which gives approx 60000 records, it takes 2 seconds to get the count.

So i guess the Sub Query slows things down quite a bit.

Post the execution plans please (as a .sqlplan file, zipped and attached to the thread) and I'll give you some tuning advice. I didn't use much data to test, plus I don't have the full text catalog, so no way to do proper tuning here.

If you're unsure how to get the execution plans, have a look at this article. http://qa.sqlservercentral.com/articles/SQLServerCentral/66909/

Please find attached the execution plan.

Thanks

Lynn Pettis SSC Guru Points: 442447 More actions · Answer 14

Lynn Pettis

SSC Guru

Points: 442447

January 13, 2010 at 1:43 pm

#1103292

Again, I have to ask, can the user specify the data range used in the query?

isuckatsql SSChampion Points: 10381 More actions · Answer 15

isuckatsql

SSChampion

Points: 10381

January 13, 2010 at 1:51 pm

#1103298

Lynn Pettis (1/13/2010)

Again, I have to ask, can the user specify the data range used in the query?

Apologies i missed you question - yes they can

Selection Criteria can be:

Email

From Date

To Date

Skills (Comma delimiter) Resume is stored as a FTI