Remove duplicate values

  • hi

    i have tried this query to remove duplicate values.

    but getting error like 'incorrect syntax near keyword where"

    select *

    2 from

    3 (select d.*

    4 , count(*) over

    5 (partition by empno) cnt

    6 from dup_emp d

    7 )

    8 where cnt > 1;

    what is wrong with this one?

  • daveriya (2/25/2011)


    hi

    i have tried this query to remove duplicate values.

    but getting error like 'incorrect syntax near keyword where"

    select *

    2 from

    3 (select d.*

    4 , count(*) over

    5 (partition by empno) cnt

    6 from dup_emp d

    7 )

    8 where cnt > 1;

    what is wrong with this one?

    The problem with your syntax seems to be that the sub-query is not named.

    select *

    from

    (select d.*, count(*) over (partition by empno) cnt

    from dup_emp d

    )SQ1

    where cnt > 1;

    This will probably parse as the sub-query has now been named/aliased (SQ1).

    However, so far I have found the following method to be the simplest and cheapest for identifying and deleting duplicates:

    ;WITH Duplicates (RankX) AS

    (

    SELECT RANK() OVER (PARTITION BY empno ORDER BY NEWID()) RankX

    FROM dup_emp

    )

    Delete FROM Duplicates WHERE RankX > 1

    I came across this code some time ago on SQL Server Central.

    PS.

    If anyone could shed some light on how to do away with (or minimize the cost of) the ORDER BY clause, I would appreciate it.

  • diamondgm (2/26/2011)


    However, so far I have found the following method to be the simplest and cheapest for identifying and deleting duplicates:

    SELECT RANK() OVER (PARTITION BY empno ORDER BY NEWID()) RankX

    You could write PARTITION BY empno ORDER BY empno - there's really no call to generate a new GUID per row.

    Better still:

    DECLARE @T TABLE

    (

    A INT NOT NULL

    )

    ;

    INSERT @T (A)

    VALUES (1), (1), (2), (2), (2), (2), (3)

    ;

    SELECT *

    FROM @T

    ;

    DELETE SQ

    FROM (

    SELECT *,

    rn = ROW_NUMBER() OVER (PARTITION BY A ORDER BY A)

    FROM @T AS T

    ) AS SQ

    WHERE rn > 1

    ;

    SELECT *

    FROM @T

    ;

  • SQLkiwi (2/26/2011)


    diamondgm (2/26/2011)


    However, so far I have found the following method to be the simplest and cheapest for identifying and deleting duplicates:

    SELECT RANK() OVER (PARTITION BY empno ORDER BY NEWID()) RankX

    You could write PARTITION BY empno ORDER BY empno - there's really no call to generate a new GUID per row.

    Better still:

    DECLARE @T TABLE

    (

    A INT NOT NULL

    )

    ;

    INSERT @T (A)

    VALUES (1), (1), (2), (2), (2), (2), (3)

    ;

    SELECT *

    FROM @T

    ;

    DELETE SQ

    FROM (

    SELECT *,

    rn = ROW_NUMBER() OVER (PARTITION BY A ORDER BY A)

    FROM @T AS T

    ) AS SQ

    WHERE rn > 1

    ;

    SELECT *

    FROM @T

    ;

    I was trying to indicate that the order by clause seems inconsequential.

    Thanks for answering my question; there seems to be no way to avoid the the clause - even though it will cause unneeded processing.

  • diamondgm (2/26/2011)


    I was trying to indicate that the order by clause seems inconsequential.

    Yes but calling NEWID() is not for free, and neither is sorting by a uniqueidentifier.

    Thanks for answering my question; there seems to be no way to avoid the the clause - even though it will cause unneeded processing.

    The optimizer is smart enough to ignore the ORDER BY when it matches the PARTITION BY clause, but if you prefer, you can write:

    PARTITION BY A ORDER BY (SELECT 0).

  • SQLkiwi (2/26/2011)


    diamondgm (2/26/2011)


    I was trying to indicate that the order by clause seems inconsequential.

    Yes but calling NEWID() is not for free, and neither is sorting by a uniqueidentifier.

    Thanks for answering my question; there seems to be no way to avoid the the clause - even though it will cause unneeded processing.

    The optimizer is smart enough to ignore the ORDER BY when it matches the PARTITION BY clause, but if you prefer, you can write:

    PARTITION BY A ORDER BY (SELECT 0).

    Yeah, I'm aware of the cost of sorting the guid, but it was lame of me not to see what you pointed out about the optimiser - thank you!

  • diamondgm (2/26/2011)


    Yeah, I'm aware of the cost of sorting the guid, but it was lame of me not to see what you pointed out about the optimiser - thank you!

    'Lame' is a bit harsh, but you are welcome.

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply