Finding and Deleting Duplicate Data

  • I have a table that has many duplicate records and I need to delete all but

    one of the duplicate records. Is there a way I can identify these records with a sql

    statement?

    Example records Where one needs to stay while the other two get deleted:

    (Comment_id, site_id, date_entered, user_id, comment)

    6978, AL28492, 07/17/2005, 408, No longer a Franchise, replace w/ non-franchise signage.

    8161, AL28492, 08/15/2005, 408, No longer a Franchise, replace w/ non-franchise signage.

    8702, AL28492, 09/11/2005, 408, No longer a Franchise, replace w/ non-franchise signage.

    The comment field is a data type text, if that makes a difference.

     

  • Mike - you don't say which 2 get deleted - would it be the min or max comment_id or the min or max date_entered that would stay and the others deleted ?!?!

    Here's a select you can "play around" with..and then you can work on your "delete" once you get the "select" to give you the results you want...

    select t2.comment_id, t1.site_id, t1.date_entered, t1.user_id, t1.comment
    from tblTest t1
    join 
    (select min(comment_id) as comment_id from tblTest 
    group by site_id, user_id)t2
    on t1.comment_id = t2.comment_id
    







    **ASCII stupid question, get a stupid ANSI !!!**

  • I tried what you said and I'm getting 3 errors (all the same) "line 1: The column prefix 't1' does not match with a table name or alias name used in the query."...And my little brain says it looks correct.

    BTW, According to my boss, I can keep whichever comment row I decide...just as long as the others are gone. I was planning on keeping the ones with the max comment_id.

    select t2.comment_id, t1.site_id, t1.date_entered, t1.user_id, t1.comment

    from statusrptcomments1 t1

    join

    (select min(t1.comment_id) as comment_id from statusrptcomments1

    group by t1.site_id, t1.user_id) t2 on t1.comment_id = t2.comment_id

  • Once you decide on which fields you can use to denote the duplicate records i.e. is it combination of user_id and site_id itself that makes it a duplicate ?

    If so, then you can do something like this:

    CREATE TABLE TESTCASE (COL1 INT, COL2 DATETIME, COL3 VARCHAR(100), SITE_ID VARCHAR(20), USER_ID INT, COMMENT TEXT)

    GO

    INSERT INTO TESTCASE VALUES (1, GETDATE(), 'FIRST ROW', 'AL28492', 408, 'SAME ROW')

    INSERT INTO TESTCASE VALUES (2, DATEADD(mi, 100, GETDATE()), 'SECOND ROW', 'AL28492', 408, 'SAME ROW')

    INSERT INTO TESTCASE VALUES (3, DATEADD(mi, 1000, GETDATE()), 'THIRD ROW', 'AL28492', 408, 'SAME ROW')

    GO

    --PRESERVE THE LAST ROW

    DELETE FROM TESTCASE

     WHERE EXISTS (SELECT 1 FROM TESTCASE DUPSINNER

       WHERE DUPSINNER.SITE_ID = TESTCASE.SITE_ID

       AND DUPSINNER.USER_ID = TESTCASE.USER_ID

       AND DUPSINNER.COL1 > TESTCASE.COL1)

    --PRESERVE THE FIRST ROW

    DELETE FROM TESTCASE

     WHERE EXISTS (SELECT 1 FROM TESTCASE DUPSINNER

       WHERE DUPSINNER.SITE_ID = TESTCASE.SITE_ID

       AND DUPSINNER.USER_ID = TESTCASE.USER_ID

       AND DUPSINNER.COL1 < TESTCASE.COL1)

    And once you have deleted and fixed the data, you should enforce a unique constraint on the combination that should be unique in this table. 

  • select t2.comment_id, t1.site_id, t1.date_entered, t1.[user_id], t1.comment

    from dbo.statusrptcomments1 t1 inner join

    (select min(comment_id) as comment_id from dbo.statusrptcomments1

    group by site_id, [user_id],comment) t2 on t1.comment_id = t2.comment_id

    DELETE t1

    FROM  dbo.statusrptcomments1 t1 LEFT OUTER JOIN

    (select min(comment_id) as comment_id from dbo.statusrptcomments1

    group by site_id, [user_id],comment) t2 on t1.comment_id = t2.comment_id

    WHERE t2.Comment_ID IS NULL


    Kindest Regards,

    Vasc

  • Thanks Vasc, but I just finished using the example that Journeyman gave me and it did exactly what I needed.

    Thanks to all of you for your help...now that I look at the examples, it seems so simple. Basically, as I read the examples, I need an inner sql statement that makes a row unique and then an outer statement that deletes all the other rows...Or in the case of Journeyman's example, an inner statement that pulls all the non-unique rows and deletes them. Is this correct?

  • Well rsharma is deleting all the post made by a user in one dep but keeps 1 (a user will have MAX 1 post for each dep) while I m deleting all duplicate posts made by a user is a certain dep : )

     

    DECLARE  @TESTCASE TABLE(COL1 INT, COL2 DATETIME, COL3 VARCHAR(100), SITE_ID VARCHAR(20), USER_ID INT, COMMENT TEXT)

    INSERT INTO @TESTCASE VALUES (1, GETDATE(), 'FIRST ROW', 'AL28492', 408, 'SAME ROW')

    INSERT INTO @TESTCASE VALUES (2, DATEADD(mi, 100, GETDATE()), 'SECOND ROW', 'AL28492', 408, 'SAME ROW')

    INSERT INTO @TESTCASE VALUES (3, DATEADD(mi, 1000, GETDATE()), 'THIRD ROW', 'AL28492', 408, 'SAME ROW')

    INSERT INTO @TESTCASE VALUES (4, DATEADD(mi, 2000, GETDATE()), 'THIRD ROW', 'AL28492', 408, 'NEW ROW')

    --PRESERVE THE LAST ROW

    /*

    DELETE FROM @TESTCASE

     WHERE EXISTS (SELECT 1 FROM @TESTCASE DUPSINNER

       WHERE DUPSINNER.SITE_ID = @TESTCASE.SITE_ID

       AND DUPSINNER.USER_ID = @TESTCASE.USER_ID

       AND DUPSINNER.COL1 > @TESTCASE.COL1)

    */

    --PRESERVE THE FIRST ROW

    DELETE T FROM @TESTCASE T

     WHERE EXISTS (SELECT 1 FROM @TESTCASE DUPSINNER

       WHERE DUPSINNER.SITE_ID = T.SITE_ID

       AND DUPSINNER.USER_ID = T.USER_ID

       AND DUPSINNER.COL1 < T.COL1)

    SELECT * FROM @TESTCASE

     

    VS

     

    declare @T table (Comment_id integer, site_id varchar(10), date_entered datetime, [user_id] integer, comment varchar(50))

    --replace @T with @T

    INSERT INTO @T(Comment_id,  date_entered, site_id, [user_id], comment) VALUES (1, GETDATE(), 'AL28492', 408, 'SAME ROW')

    INSERT INTO @T(Comment_id,  date_entered, site_id, [user_id], comment) VALUES (2, DATEADD(mi, 100, GETDATE()),  'AL28492', 408, 'SAME ROW')

    INSERT INTO @T(Comment_id,  date_entered, site_id, [user_id], comment) VALUES (3, DATEADD(mi, 1000, GETDATE()), 'AL28492', 408, 'SAME ROW')

    INSERT INTO @T(Comment_id,  date_entered, site_id, [user_id], comment) VALUES (4, DATEADD(mi, 2000, GETDATE()), 'AL28492', 408, 'NEW ROW')

    SELECT * FROM @T

    select t2.comment_id, t1.site_id, t1.date_entered, t1.[user_id], t1.comment

    from @T t1 inner join

    (select min(comment_id) as comment_id from @T

    group by site_id, [user_id],comment) t2 on t1.comment_id = t2.comment_id

    DELETE t1

    FROM  @T t1 LEFT OUTER JOIN

    (select min(comment_id) as comment_id from @T

    group by site_id, [user_id],comment) t2 on t1.comment_id = t2.comment_id

    WHERE t2.Comment_ID IS NULL

     


    Kindest Regards,

    Vasc

  • If if doesn't matter which record you keep you could just create a cursor that includes each duplicate and the number of times the duplicate appears in the table.  Loop through the cursor and each time set rowcount to the number of duplicates minus 1 then do a delete from the table where the site id is equal to your cursor value.

    First just to see your duplicates run this;

    Select site_id,count(*) as count from duplicate_tab

    group by site_id

    having count(*) > 1

     

    Then you can run this;

    declare

    @site varchar(50),

    @cnt int,

    @rowcnt int

    /** This get all Site_ID's that have more than one in the table   **/

    Declare c_site cursor FOR

    Select site_id,count(*) as count from duplicate_tab

    group by site_id

    having count(*) > 1

    Open c_site

    Fetch Next from c_site into @site, @cnt

    While (@@Fetch_Status = 0)

    Begin

     set @rowcnt = @cnt - 1

     set rowcount @rowcnt

     

     Delete from duplicate_tab

     Where site_id = @site

     set rowcount 0

     Fetch Next From c_site into @site, @cnt

    End

    close c_site

    deallocate c_site

    --------------------------------------------------------

    This always worked for me.

    Thanks,

    Jeannine

  • Thanks...I try to stay away from cursors, but I like the select statement you did.

  • What a wonderfully diplomatic response Mike - applause!

    Jeannine - I wanted to warn you before the "CursorsAreEvil" police come swooping down on you - please search this site for cursors etc.. and you'll find an abundance of articles, posts etc.. that explain at great length about why cursors are evil and how you must avoid them whenever possible...







    **ASCII stupid question, get a stupid ANSI !!!**

  • Yes cursors are nasty if you have a lot of data.  But as a quick looping example they are good. 

    I usually loop through the table using one of the column identifiers.  Usually they are the PK or at least indexed. 

    Below is an example that uses the method of determining the minimum and maximum values for the id column and then augmenting the ID each time through the loop.

    Declare

    @maxsite varchar(50),

    @minsite varchar(50),

    @cnt int,

    @rowcnt int

    -- Get the min and max site_id values where the site ID has dups

    select @minsite = Min(site_id), @maxsite = Max(Site_ID) from duplicate_tab Where Exists(Select site_id,count(*) as count from duplicate_tab

    group by site_id

    having count(*) > 1)

     

    -- Loop through the site_id's based on the min and max

    While (@minsite <= @maxsite)

    Begin

     select  @cnt = count(*) from duplicate_tab Where 

     site_id = @minsite

     group by site_id

     having count(*) > 1

     set @rowcnt = @cnt - 1

     set rowcount @rowcnt

     

     Delete from duplicate_tab

     Where site_id =  @minsite

     set rowcount 0

     

    -- determine the next site_id that has duplicates but is still the minimum in the remaining list

     select @minsite = Min(site_id) from duplicate_tab Where

     site_id > @minsite and

     Exists(Select site_id,count(*) as count from duplicate_tab

     group by site_id

     having count(*) > 1)

    End

     -Jeannine

  • Hi friend,

                 check out this query

    (Comment_id, site_id, date_entered, user_id, comment)

    6978, AL28492, 07/17/2005, 408, No longer a Franchise, replace w/ non-franchise signage.

    8161, AL28492, 08/15/2005, 408, No longer a Franchise, replace w/ non-franchise signage.

    8702, AL28492, 09/11/2005, 408, No longer a Franchise, replace w/ non-franchise signage

    Delete from tblTest where Comment_id not in(select max(Comment_id) from tblTest group by site_id, user_id)

    bye

    prabhanjan

    prabhanjanbr@yahoo.com

  • Just to let you know that this has been a real help for me. I was looking to eliminate duplicates with different timestamp and I got stuck for a while as I'm just doing SQL once in a while, so I'm very rusty. Thanks for the query.

    Regards,

    Martin

     

     

  • Glad my post could help. I was surprised to see activity on this post a year later

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply