Deduplication Help

  • Hi,

    My goal is to dedupe (with a twist) the sample table below with the following criteria. For each duplicate record I want to:

    1) If [item_ordered?] column is 'Yes' for all the deplicate records keep any one of the records - doesn't matter which record.

    2) If [item_ordered?] column is 'No' for all the duplicate records keep any one of the records - doesn't matter with record.

    3) If [item_ordered?] column is 'Yes' for some duplicate records and 'No' for others, keep one of the 'Yes' records - doesn't matter which record.

    /* sample dataset */

    CREATE TABLE #TestTable (id INT, order_status VARCHAR(10), [item_ordered?] VARCHAR(3))

    INSERT #TestTable (id, order_status, [item_ordered?])

    SELECT 100, 'Ordered', 'Yes' UNION ALL

    SELECT 100, 'Staged', 'Yes' UNION ALL

    SELECT 100, 'Shipped', 'Yes' UNION ALL

    SELECT 200, 'Hold', 'No' UNION ALL

    SELECT 200, 'Cancelled', 'No' UNION ALL

    SELECT 200, 'Hold', 'No' UNION ALL

    SELECT 200, 'Returned', 'No' UNION ALL

    SELECT 300, 'Ordered', 'Yes' UNION ALL

    SELECT 300, 'Ordered', 'Yes' UNION ALL

    SELECT 300, 'Staged', 'Yes' UNION ALL

    SELECT 300, 'Shipped', 'Yes' UNION ALL

    SELECT 300, 'Shipped', 'Yes' UNION ALL

    SELECT 400, 'Cancelled', 'No' UNION ALL

    SELECT 400, 'Review', 'No' UNION ALL

    SELECT 400, 'Hold', 'No' UNION ALL

    SELECT 400, 'Returned', 'No' UNION ALL

    SELECT 500, 'Ordered', 'Yes' UNION ALL

    SELECT 500, 'Pending', 'No' UNION ALL

    SELECT 500, 'Staged', 'Yes' UNION ALL

    SELECT 500, 'Shipped', 'Yes' UNION ALL

    SELECT 600, 'Hold', 'No' UNION ALL

    SELECT 600, 'Review', 'No' UNION ALL

    SELECT 600, 'Pending', 'Yes' UNION ALL

    SELECT 700, 'Ordered', 'Yes' UNION ALL

    SELECT 700, 'Hold', 'No' UNION ALL

    SELECT 700, 'Cancelled', 'No' UNION ALL

    SELECT 700, 'Returned', 'No' UNION ALL

    SELECT 800, 'Hold', 'No' UNION ALL

    SELECT 800, 'Returned', 'No' UNION ALL

    SELECT 900, 'Staged', 'Yes' UNION ALL

    SELECT 900, 'Shipped', 'Yes'

    GO

    Thanks for your help.

    MCTS: BI 2008, MCITP: BI 2008
    Stay Thirsty My Friends

  • WITH CTE AS (

    SELECT *,

    ROW_NUMBER() OVER(PARTITION BY id, order_status

    ORDER BY CASE WHEN [item_ordered?]='Yes' THEN 0 ELSE 1 END) AS rn

    FROM #TestTable)

    DELETE FROM CTE

    WHERE rn>1;

    ____________________________________________________

    Deja View - The strange feeling that somewhere, sometime you've optimised this query before

    How to get the best help on a forum

    http://www.sqlservercentral.com/articles/Best+Practices/61537
  • First of all you have to define what "duplicate" means to you. Duplicate records for the same Id? Duplicate records for the same id+status?

    If the former, this should do:

    SELECT *

    FROM (

    SELECT *, RN = ROW_NUMBER() OVER (PARTITION BY id ORDER BY [item_ordered?] DESC, id ASC)

    FROM #TestTable

    ) AS src

    WHERE RN = 1

    -- Gianluca Sartori

  • Gianluca Sartori (1/30/2012)


    First of all you have to define what "duplicate" means to you. Duplicate records for the same Id? Duplicate records for the same id+status?

    If the former, this should do:

    SELECT *

    FROM (

    SELECT *, RN = ROW_NUMBER() OVER (PARTITION BY id ORDER BY [item_ordered?] DESC, id ASC)

    FROM #TestTable

    ) AS src

    WHERE RN = 1

    Sorry I wasn't clear - Deplicate for the same ID not ID+status.

    MCTS: BI 2008, MCITP: BI 2008
    Stay Thirsty My Friends

  • I think you should be ok, then.

    -- Gianluca Sartori

  • This is exactly the solution I was looking for..works like a charm. Thanks for all your help.

    MCTS: BI 2008, MCITP: BI 2008
    Stay Thirsty My Friends

  • Hi,

    I was just following this thread and understood the requirement and I also executed your solution and found that it satisfies the requirement...but not able to understand how you achieved it.

    Can you please explain..?

    Thanks!

    Siva.

  • ROW_NUMBER() OVER (PARTITION BY id ORDER BY [item_ordered?] DESC, id ASC)

    means "give each row in the table a row number (rank) sorting rows by [item_ordered?] in reverse order (so that "yes" ranks higher than "no") and then by id, partitioning by id (means that the rank will be "reset" for each different id)".

    Applied to the test data, it gives this output:

    id order_status item_ordered? RN

    ----------- ------------ ------------- --------------------

    100 Ordered Yes 1

    100 Staged Yes 2

    100 Shipped Yes 3

    200 Hold No 1

    200 Cancelled No 2

    200 Hold No 3

    200 Returned No 4

    300 Ordered Yes 1

    300 Ordered Yes 2

    300 Staged Yes 3

    300 Shipped Yes 4

    300 Shipped Yes 5

    400 Cancelled No 1

    400 Review No 2

    400 Hold No 3

    400 Returned No 4

    500 Ordered Yes 1

    500 Staged Yes 2

    500 Shipped Yes 3

    500 Pending No 4

    600 Pending Yes 1

    600 Hold No 2

    600 Review No 3

    700 Ordered Yes 1

    700 Hold No 2

    700 Cancelled No 3

    700 Returned No 4

    800 Hold No 1

    800 Returned No 2

    900 Staged Yes 1

    900 Shipped Yes 2

    Hope this answers your question.

    -- Gianluca Sartori

  • Siva Ramasamy (1/30/2012)


    Hi,

    I was just following this thread and understood the requirement and I also executed your solution and found that it satisfies the requirement...but not able to understand how you achieved it.

    Can you please explain..?

    Thanks!

    Siva.

    The following may answer your question;

    http://www.simple-talk.com/sql/t-sql-programming/sql-server-cte-basics/

    If everything seems to be going well, you have obviously overlooked something.

    Ron

    Please help us, help you -before posting a question please read[/url]
    Before posting a performance problem please read[/url]

  • Thanks Gianluca Sartori and Thanks Ron.

    I understood the logic.

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply