Removing duplicates

  • Hi,

    I work in a place where two of our subsidiaries merged to the parent company. My problem is all companies used same database schema so all tables (& constraints) are the same (as they use Macola they are restricted to using the schema dictated by Macola). After merger, I have duplicate customers (thousands of them)

    hence I cannot use the trick I know (group by...having), to make matters worse some of the unique constraints are on composite (5 or more) columns. I have removed the unique constraints and merged the records. Is there a way to somewhat easily eliminate the duplicates so I can rebuild the unique constraints on the merged tables. Please suggest.

    Thanks for any help I can get

    Raghu


    Raghu

  • I wrote a loop that looked for duplcates (using COUNT and HAVING COUNT(*) > 1) by name or other criteria. Then issue a SET ROWCOUNT 1 and a DELETE for each name. Continue looping until you have no more duplcates.

    Kludgy and slow (er than set ops), but it worked well.

    Steve Jones

    steve@dkranch.net

  • Can you explain more on that raghu, might be post your table schema .

    if am not missing anything(though am sure i am:( ) ,you can use a distinct clause and dump the data in a temp table.

    or

    if therez no pk , you can add a identity column .

    and try something like this

    select col1,col2,col3,Recono = max(identitycol)

    into #temptable

    from tablename

    group by col1,col2,col3

    having count(*) > 1

    and

    delete from tablename

    where identitycol not in (select recono from

    #temptable)

    later you can drop the temp table and as well the identity column too.

    HTH

  • Normally I would suggest create a composite index clustered on UniCol1, UniCol2, UniCol3 (your columns that should be unique) then add an IDENTITY column so you have a number reference. Then use something like

    SELECT * FROM tblX oX

    WHERE idX IN (SELECT TOP 1 idX FROM tblX iX

    WHERE ix.UniCOl1 = ox.UniCol1 AND ix.UniCOl2 = ox.UniCol2 AND ix.UniCOl3 = ox.UniCol3)

    This should only return one row per item you will want to test this on known rows to make sure not there. Once you are sure change ...idX IN (SELECT... to idX NOT IN (SELECT... And make sure you get the others instead. If test is fine the change SELECT * FROM tblX... to DELETE tblX... and let run. This will clear the duplicates. Once done drop the IDENTITY column we added and change make you unique column items a PK and once done make sure your index was saved (may have create a non-clustered as index for PK if so delete yours and change PK index to cluserted, not much should happen).

    "Don't roll your eyes at me. I will tape them in place." (Teacher on Boston Public)

  • Thanks for all your suggestions, I can start thinking on each one of them and try it out...I will let you guys know how I made out.

    Thanks again

    Raghu


    Raghu

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply