Help avoiding RBAR

  • There must be a set based answer to this ugly situation but my brain isn't finding it and time is running out. So I'm appealing for help.

    There are some duplicates names in a table (without a primary key and 600,000 rows) which are distinguished by date. I need to delete all but the newest dated record for each person. Ie the person can become the primary key.

    I can easily get a list of those names which are duplicated into a temporary table, but it's cycling through these names to get the MAX date for each one that's causing me to run to the WHILE loop. Please help me avoid the RBAR solution if possible. If it's not then do I use a CURSOR or stick with my C# programming and use a WHILE loop.

    CREATE TABLE #Table1

    (

    NameOfSomeone nvarchar(100),

    DateOfSomeone datetime

    )

    INSERT INTO #Table1 (NameOfSomeone, DateOfSomeone) VALUES ('Paul', '2009-01-01')

    INSERT INTO #Table1 (NameOfSomeone, DateOfSomeone) VALUES ('Paul', '2009-01-02')

    INSERT INTO #Table1 (NameOfSomeone, DateOfSomeone) VALUES ('Peter', '2009-01-03')

    INSERT INTO #Table1 (NameOfSomeone, DateOfSomeone) VALUES ('Peter', '2009-01-04')

    INSERT INTO #Table1 (NameOfSomeone, DateOfSomeone) VALUES ('Peter', '2009-01-07')

    INSERT INTO #Table1 (NameOfSomeone, DateOfSomeone) VALUES ('James', '2009-01-05')

    INSERT INTO #Table1 (NameOfSomeone, DateOfSomeone) VALUES ('James', '2009-01-06')

    INSERT INTO #Table1 (NameOfSomeone, DateOfSomeone) VALUES ('Amy', '2009-01-01')

    INSERT INTO #Table1 (NameOfSomeone, DateOfSomeone) VALUES ('Grace', '2009-01-01')

    INSERT INTO #Table1 (NameOfSomeone, DateOfSomeone) VALUES ('Jane', '2009-01-01')

  • There are many methods by which you can delete duplicate rows from one table. Here is one of the ways you can do it.

    It can also be done using temporary table with IDENTITY column and inserting the data into this table in

    sorted order as explained here

    IF ( OBJECT_ID( 'tempdb..#Table1' ) IS NOT NULL )

    DROP TABLE #Table1

    CREATE TABLE #Table1

    (

    NameOfSomeone nvarchar(100),

    DateOfSomeone datetime

    )

    INSERT INTO #Table1 (NameOfSomeone, DateOfSomeone) VALUES ('Paul', '2009-01-01')

    INSERT INTO #Table1 (NameOfSomeone, DateOfSomeone) VALUES ('Paul', '2009-01-02')

    INSERT INTO #Table1 (NameOfSomeone, DateOfSomeone) VALUES ('Peter', '2009-01-03')

    INSERT INTO #Table1 (NameOfSomeone, DateOfSomeone) VALUES ('Peter', '2009-01-04')

    INSERT INTO #Table1 (NameOfSomeone, DateOfSomeone) VALUES ('Peter', '2009-01-07')

    INSERT INTO #Table1 (NameOfSomeone, DateOfSomeone) VALUES ('James', '2009-01-05')

    INSERT INTO #Table1 (NameOfSomeone, DateOfSomeone) VALUES ('James', '2009-01-06')

    INSERT INTO #Table1 (NameOfSomeone, DateOfSomeone) VALUES ('Amy', '2009-01-01')

    INSERT INTO #Table1 (NameOfSomeone, DateOfSomeone) VALUES ('Grace', '2009-01-01')

    INSERT INTO #Table1 (NameOfSomeone, DateOfSomeone) VALUES ('Jane', '2009-01-01')

    DELETE T1

    FROM #Table1 T1

    INNER JOIN

    (

    SELECT NameOfSomeone, MAX( DateOfSomeone ) AS DateOfSomeone

    FROM #Table1

    GROUP BY NameOfSomeone

    ) T2 ON T1.NameOfSomeone = T2.NameOfSomeone AND T1.DateOfSomeone != T2.DateOfSomeone

    SELECT * FROM #Table1

    --Ramesh


  • ---- Edit ----------------------------------------------------------

    Sorry ignore this: I've just noticed its the SQL2000 group, and ROW_NUMBER() isn't available. But I've left it here for information only.

    -------------------------------------------------------------------

    Another de-duping method, using the ROW_NUMBER PARTITION BY feature.

    WITH DUPS AS (

    SELECT NameOfSomeone, DateOfSomeone, ROW_NUMBER() OVER (PARTITION BY NameOfSomeone ORDER BY DateOfSomeone DESC) AS RN

    FROM #Table1

    )

    DELETE FROM #Table1

    FROM #Table1 JOIN DUPS

    ON #Table1.NameOfSomeone = DUPS.NameOfSomeone AND

    #Table1.DateOfSomeone = DUPS.DateOfSomeone

    WHERE DUPS.RN > 1

    This creates A CTE with each name in a separate group (partition) - and the dates are numbered in descending sequence within each group, so you can delete all those with a row number greather than 1.

    As with any potentially dangerous changes, try it out in a transaction with rollback first.

  • Tom Brown (5/19/2009)


    Another de-duping method, using the ROW_NUMBER PARTITION BY feature.

    It does work in 2K5 but we are in 2K forum. It happened to me many times in the past:-D

    --Ramesh


  • A huge thankyou to all who replied. 😀 I was so busy looking at memory tables and then joining them with the original I lost sight of the root cause - cr*p data.

    I've run the script by SSCrazy and it worked a treat! :w00t: The 3rd party's data is now in a fit state to be imported into live, the boss is happy and I can get to work on the next stages of integration.

    Now where's the emoticon for a beer...

  • FNS (5/19/2009)


    Now where's the emoticon for a beer...

    Steve ... ?

    We could do with a few more emoticons

    these ones don't display properly :beer: :rocks: :thumbsup:

  • FNS (5/19/2009)


    A huge thankyou to all who replied. 😀 I was so busy looking at memory tables and then joining them with the original I lost sight of the root cause - cr*p data.

    I've run the script by SSCrazy and it worked a treat! :w00t: The 3rd party's data is now in a fit state to be imported into live, the boss is happy and I can get to work on the next stages of integration.

    Now where's the emoticon for a beer...

    You are welcome, and I am glad that it worked out well for you. "BeerIcon", I wandered around for years but still couldn't find one:w00t::w00t:

    BTW, I am not SSCrazy:hehe:

    --Ramesh


Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply