Remove duplicate values

Question

Post reply

Remove duplicate values

daveriya

SSCrazy

Points: 2352
More actions
February 25, 2011 at 4:26 pm

#235231

hi

i have tried this query to remove duplicate values.

but getting error like 'incorrect syntax near keyword where"

select *

2 from

3 (select d.*

4 , count(*) over

5 (partition by empno) cnt

6 from dup_emp d

7 )

8 where cnt > 1;

what is wrong with this one?

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply

diamondgm SSCrazy Points: 2824 More actions · Answer 1

daveriya (2/25/2011)

hi

i have tried this query to remove duplicate values.

but getting error like 'incorrect syntax near keyword where"

select *

2 from

3 (select d.*

4 , count(*) over

5 (partition by empno) cnt

6 from dup_emp d

7 )

8 where cnt > 1;

what is wrong with this one?

The problem with your syntax seems to be that the sub-query is not named.



select *

from

  (select d.*, count(*) over (partition by empno) cnt

   from dup_emp d

  )SQ1

where cnt > 1;

This will probably parse as the sub-query has now been named/aliased (SQ1).

However, so far I have found the following method to be the simplest and cheapest for identifying and deleting duplicates:



;WITH Duplicates (RankX) AS

(

    SELECT	RANK() OVER (PARTITION BY empno ORDER BY NEWID()) RankX

	FROM dup_emp

)

Delete FROM Duplicates WHERE RankX > 1

I came across this code some time ago on SQL Server Central.

PS.

If anyone could shed some light on how to do away with (or minimize the cost of) the ORDER BY clause, I would appreciate it.

Paul White SSC Guru Points: 150457 More actions · Answer 2

diamondgm (2/26/2011)

However, so far I have found the following method to be the simplest and cheapest for identifying and deleting duplicates:

SELECT RANK() OVER (PARTITION BY empno ORDER BY NEWID()) RankX

You could write PARTITION BY empno ORDER BY empno - there's really no call to generate a new GUID per row.

Better still:



DECLARE @T TABLE

        (

        A   INT NOT NULL

        )

;

INSERT  @T (A)

VALUES  (1), (1), (2), (2), (2), (2), (3)

;



SELECT  *

FROM    @T

;

DELETE  SQ

FROM    (

        SELECT  *,

                rn = ROW_NUMBER() OVER (PARTITION BY A ORDER BY A)

        FROM    @T AS T

        ) AS SQ

WHERE   rn > 1

;

SELECT  *

FROM    @T

;

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

diamondgm SSCrazy Points: 2824 More actions · Answer 3

SQLkiwi (2/26/2011)

diamondgm (2/26/2011)

However, so far I have found the following method to be the simplest and cheapest for identifying and deleting duplicates:

SELECT RANK() OVER (PARTITION BY empno ORDER BY NEWID()) RankX

You could write PARTITION BY empno ORDER BY empno - there's really no call to generate a new GUID per row.

Better still:

DECLARE @T TABLE ( A INT NOT NULL ) ; INSERT @T (A) VALUES (1), (1), (2), (2), (2), (2), (3) ;
SELECT * FROM @T ; DELETE SQ FROM ( SELECT *, rn = ROW_NUMBER() OVER (PARTITION BY A ORDER BY A) FROM @T AS T ) AS SQ WHERE rn > 1 ; SELECT * FROM @T
;

I was trying to indicate that the order by clause seems inconsequential.

Thanks for answering my question; there seems to be no way to avoid the the clause - even though it will cause unneeded processing.

Paul White SSC Guru Points: 150457 More actions · Answer 4

diamondgm (2/26/2011)

I was trying to indicate that the order by clause seems inconsequential.

Yes but calling NEWID() is not for free, and neither is sorting by a uniqueidentifier.

Thanks for answering my question; there seems to be no way to avoid the the clause - even though it will cause unneeded processing.

The optimizer is smart enough to ignore the ORDER BY when it matches the PARTITION BY clause, but if you prefer, you can write:

PARTITION BY A ORDER BY (SELECT 0).

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

diamondgm SSCrazy Points: 2824 More actions · Answer 5

SQLkiwi (2/26/2011)

diamondgm (2/26/2011)

I was trying to indicate that the order by clause seems inconsequential.

Yes but calling NEWID() is not for free, and neither is sorting by a uniqueidentifier.

Thanks for answering my question; there seems to be no way to avoid the the clause - even though it will cause unneeded processing.

The optimizer is smart enough to ignore the ORDER BY when it matches the PARTITION BY clause, but if you prefer, you can write:

PARTITION BY A ORDER BY (SELECT 0).

Yeah, I'm aware of the cost of sorting the guid, but it was lame of me not to see what you pointed out about the optimiser - thank you!

Paul White SSC Guru Points: 150457 More actions · Answer 6

Paul White

SSC Guru

Points: 150457

February 27, 2011 at 12:07 am

#1291991

diamondgm (2/26/2011)

Yeah, I'm aware of the cost of sorting the guid, but it was lame of me not to see what you pointed out about the optimiser - thank you!

'Lame' is a bit harsh, but you are welcome.

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi