Help needed with removing duplicates

  • Hi All

    I have a table of approx 500,000 members which I need to dedupe only by email address. and i dont mind which records get removed. as long as there is a unique email address.

    e.g.

    tom, harrow, tom@tom.com, 30/09/1978, blue, £456

    thomas, harrow, tom@tom.com, <NULL>, blue, £456

    tom, harrow, tom@tom.com, 30/09/1978, <NULL>, <NULL>

    I dont mind which record remains... as long as there are no duplicate email addresses.

    I have tried a number of ways but none seem to work properly and the only effective method opnly keeps one field(email) i.e. creates a table with only distinct email addresses.

    Any help hugely appreciated.

    Thanks

    Tom

  • There's a script that someone anonymous posted in the scripts section. Do a search on this site for remove duplicates:

    Remove Duplicates from Table - No Cursors

    This script deletes all duplicates from a table.  It keeps the first instance of the record, and discards all others.  If you want to keep the last instance of the record, use MAX instead of MIN. 

    The script can be modified very easily to include more/less fields for comparison.  I've successfully used this SP to find dupes in over 100 fields.  The only requirement is that the table must have a unique ID field.

  • hey chris thanks for your help ... i found the script ... just having a little trouble unbdrestanding the pseudo code - is t1 a tablew which i need to ctreate first? my table is called customer do i replace MyTable with customer?

    DELETE FROM

     t1

    FROM

     MyTable t1

    INNER JOIN

     (

      SELECT

       MIN(FieldID) AS FieldID,

       FieldWithDupes1,

       FieldWithDupes2,

       FieldWithDupes3

      FROM

       MyTable

      GROUP BY

       FieldWithDupes1,

       FieldWithDupes2,

       FieldWithDupes3

      HAVING

       COUNT(*) > 1

    &nbsp t2

     ON(

      t1.FieldWithDupes1 = t2.FieldWithDupes1

      AND t1.FieldWithDupes2 = t2.FieldWithDupes2

      AND t1.FieldWithDupes3 = t2.FieldWithDupes3

      AND t1.FieldID <> t2.FieldID

    &nbsp

     

     

  • t1 and t2 are the same aliased table. The table you want to remove dups from.

  • so do i need to creat t1 and t2 before i start?

     

    DELETE FROM t1

    FROM         customer t1 INNER JOIN

                              (SELECT     MIN(autoID) AS autoID, Email

                                FROM          customer

                                GROUP BY Email

                                HAVING      COUNT(*) > 1 t2 ON (t1.email = t2.email AND t1.email = t2.email AND t1.email = t2.email AND t1.AutoID <> t2.AutoID

     

    does this look correct?

  • I would try something like the following (sorry, I haven't had a chance to test it):

     

    DELETE FROM t1

    FROM         customer t1

    INNER JOIN

                              (SELECT     MIN(autoID) AS autoID, Email

                                FROM          customer

                                GROUP BY Email

                                HAVING      COUNT(*) > 1) t2

    ON(

     t1.email = t2.email AND t1.AutoID <> t2.AutoID

    )

    Since you are just filtering on 1 field (email), I don't think you need those additional email fields in the last section (after ON). I couldn't make the formatting look any better, so SQL Server QA would probably complain 🙁

  • You don't need to create anything.

    t1 and t2 are aliases for the table name.

    It's the same as doing this :

    Select * from dbo.SysObjects O1 inner join dbo.SysObjects O2 on O1.id = O2.id

  • thanks chris that was very clear ...i  executed this successfully but affected 0 records??

    there are definately duplicates

    any ideas?

  • How many records does this returns??

    SELECT MIN(autoID) AS autoID, Email

    FROM dbo.customer

    GROUP BY Email

    HAVING COUNT(*) > 1

  • In my opinion

    >>SELECT MIN(autoID) AS autoID, Email

    FROM dbo.customer

    GROUP BY Email

    HAVING COUNT(*) > 1 <<<

     will return ALWAYS 0 records affected !

     


    * Noel

  • Still don't see it...

    what did I miss Noeld???

    This returns a lot of records for me >>>

    Select name, min(id) as AutoID, count(*) from dbo.SysColumns group by name having count(*) > 1

  • Well,

    I GOOFED! my count(*) > 1 was count(*) > 11

    then I copied an pasted the SAME query twice to change only the columns 

     

    Sorry for the confusion


    * Noel

  • Hehe... I thaught I was going crazy missing something so simple in that query...

    But that still doesn't tell us what's wrong with the delete statement. Any ideas?

  • I noticed this thread because of the news letter! 

    delete

    FROM         Customer

    WHERE     (autoid >

                              (SELECT     MIN(autoid)

                                FROM          Customer t1

                                WHERE      Customer.email = t1.email))

  • Tom, If you are having trouble understanding the SQL syntax in the scripts above, you definitely need some more training in this area.  Get to it!  You will benefit by the effort!

    Another way (isn't there always) is to create a matching table structure and create a unique index on the email field with the "ignore duplicates" box checked.  Then use DTS to append the data from your existing table into the new structure.   This will prevent the duplicates from entering the table.  If you incorporate this index in your table, it will stop this problem from happening again in the future.  ( a very good thing! )

    Mike

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply