Finding and Deleting Duplicate Data

Question

Post reply

Finding and Deleting Duplicate Data

Mike Collins-249907

SSC Enthusiast

Points: 122
More actions
October 20, 2005 at 5:23 pm

#89798

I have a table that has many duplicate records and I need to delete all but

one of the duplicate records. Is there a way I can identify these records with a sql

statement?

Example records Where one needs to stay while the other two get deleted:

(Comment_id, site_id, date_entered, user_id, comment)

6978, AL28492, 07/17/2005, 408, No longer a Franchise, replace w/ non-franchise signage.

8161, AL28492, 08/15/2005, 408, No longer a Franchise, replace w/ non-franchise signage.

8702, AL28492, 09/11/2005, 408, No longer a Franchise, replace w/ non-franchise signage.

The comment field is a data type text, if that makes a difference.

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply

sushila SSC-Dedicated Points: 35293 More actions · Answer 1

Mike - you don't say which 2 get deleted - would it be the min or max comment_id or the min or max date_entered that would stay and the others deleted ?!?!

Here's a select you can "play around" with..and then you can work on your "delete" once you get the "select" to give you the results you want...

select t2.comment_id, t1.site_id, t1.date_entered, t1.user_id, t1.comment
from tblTest t1
join 
(select min(comment_id) as comment_id from tblTest 
group by site_id, user_id)t2
on t1.comment_id = t2.comment_id

**ASCII stupid question, get a stupid ANSI !!!**

Mike Collins-249907 SSC Enthusiast Points: 122 More actions · Answer 2

I tried what you said and I'm getting 3 errors (all the same) "line 1: The column prefix 't1' does not match with a table name or alias name used in the query."...And my little brain says it looks correct.

BTW, According to my boss, I can keep whichever comment row I decide...just as long as the others are gone. I was planning on keeping the ones with the max comment_id.

select t2.comment_id, t1.site_id, t1.date_entered, t1.user_id, t1.comment

from statusrptcomments1 t1

join

(select min(t1.comment_id) as comment_id from statusrptcomments1

group by t1.site_id, t1.user_id) t2 on t1.comment_id = t2.comment_id

rsharma SSCarpal Tunnel Points: 4865 More actions · Answer 3

Once you decide on which fields you can use to denote the duplicate records i.e. is it combination of user_id and site_id itself that makes it a duplicate ?

If so, then you can do something like this:

CREATE TABLE TESTCASE (COL1 INT, COL2 DATETIME, COL3 VARCHAR(100), SITE_ID VARCHAR(20), USER_ID INT, COMMENT TEXT)

GO

INSERT INTO TESTCASE VALUES (1, GETDATE(), 'FIRST ROW', 'AL28492', 408, 'SAME ROW')

INSERT INTO TESTCASE VALUES (2, DATEADD(mi, 100, GETDATE()), 'SECOND ROW', 'AL28492', 408, 'SAME ROW')

INSERT INTO TESTCASE VALUES (3, DATEADD(mi, 1000, GETDATE()), 'THIRD ROW', 'AL28492', 408, 'SAME ROW')

GO

--PRESERVE THE LAST ROW

DELETE FROM TESTCASE

WHERE EXISTS (SELECT 1 FROM TESTCASE DUPSINNER

WHERE DUPSINNER.SITE_ID = TESTCASE.SITE_ID

AND DUPSINNER.USER_ID = TESTCASE.USER_ID

AND DUPSINNER.COL1 > TESTCASE.COL1)

--PRESERVE THE FIRST ROW

DELETE FROM TESTCASE

WHERE EXISTS (SELECT 1 FROM TESTCASE DUPSINNER

WHERE DUPSINNER.SITE_ID = TESTCASE.SITE_ID

AND DUPSINNER.USER_ID = TESTCASE.USER_ID

AND DUPSINNER.COL1 < TESTCASE.COL1)

And once you have deleted and fixed the data, you should enforce a unique constraint on the combination that should be unique in this table.

Vasc SSCarpal Tunnel Points: 4623 More actions · Answer 4

select t2.comment_id, t1.site_id, t1.date_entered, t1.[user_id], t1.comment

from dbo.statusrptcomments1 t1 inner join

(select min(comment_id) as comment_id from dbo.statusrptcomments1

group by site_id, [user_id],comment) t2 on t1.comment_id = t2.comment_id

DELETE t1

FROM dbo.statusrptcomments1 t1 LEFT OUTER JOIN

(select min(comment_id) as comment_id from dbo.statusrptcomments1

group by site_id, [user_id],comment) t2 on t1.comment_id = t2.comment_id

WHERE t2.Comment_ID IS NULL

Kindest Regards,

Vasc

Mike Collins-249907 SSC Enthusiast Points: 122 More actions · Answer 5

Thanks Vasc, but I just finished using the example that Journeyman gave me and it did exactly what I needed.

Thanks to all of you for your help...now that I look at the examples, it seems so simple. Basically, as I read the examples, I need an inner sql statement that makes a row unique and then an outer statement that deletes all the other rows...Or in the case of Journeyman's example, an inner statement that pulls all the non-unique rows and deletes them. Is this correct?

Vasc SSCarpal Tunnel Points: 4623 More actions · Answer 6

Well rsharma is deleting all the post made by a user in one dep but keeps 1 (a user will have MAX 1 post for each dep) while I m deleting all duplicate posts made by a user is a certain dep : )

DECLARE @TESTCASE TABLE(COL1 INT, COL2 DATETIME, COL3 VARCHAR(100), SITE_ID VARCHAR(20), USER_ID INT, COMMENT TEXT)

INSERT INTO @TESTCASE VALUES (1, GETDATE(), 'FIRST ROW', 'AL28492', 408, 'SAME ROW')

INSERT INTO @TESTCASE VALUES (2, DATEADD(mi, 100, GETDATE()), 'SECOND ROW', 'AL28492', 408, 'SAME ROW')

INSERT INTO @TESTCASE VALUES (3, DATEADD(mi, 1000, GETDATE()), 'THIRD ROW', 'AL28492', 408, 'SAME ROW')

INSERT INTO @TESTCASE VALUES (4, DATEADD(mi, 2000, GETDATE()), 'THIRD ROW', 'AL28492', 408, 'NEW ROW')

--PRESERVE THE LAST ROW

/*

DELETE FROM @TESTCASE

WHERE EXISTS (SELECT 1 FROM @TESTCASE DUPSINNER

WHERE DUPSINNER.SITE_ID = @TESTCASE.SITE_ID

AND DUPSINNER.USER_ID = @TESTCASE.USER_ID

AND DUPSINNER.COL1 > @TESTCASE.COL1)

*/

--PRESERVE THE FIRST ROW

DELETE T FROM @TESTCASE T

WHERE EXISTS (SELECT 1 FROM @TESTCASE DUPSINNER

WHERE DUPSINNER.SITE_ID = T.SITE_ID

AND DUPSINNER.USER_ID = T.USER_ID

AND DUPSINNER.COL1 < T.COL1)

SELECT * FROM @TESTCASE

VS

declare @T table (Comment_id integer, site_id varchar(10), date_entered datetime, [user_id] integer, comment varchar(50))

--replace @T with @T

INSERT INTO @T(Comment_id, date_entered, site_id, [user_id], comment) VALUES (1, GETDATE(), 'AL28492', 408, 'SAME ROW')

INSERT INTO @T(Comment_id, date_entered, site_id, [user_id], comment) VALUES (2, DATEADD(mi, 100, GETDATE()), 'AL28492', 408, 'SAME ROW')

INSERT INTO @T(Comment_id, date_entered, site_id, [user_id], comment) VALUES (3, DATEADD(mi, 1000, GETDATE()), 'AL28492', 408, 'SAME ROW')

INSERT INTO @T(Comment_id, date_entered, site_id, [user_id], comment) VALUES (4, DATEADD(mi, 2000, GETDATE()), 'AL28492', 408, 'NEW ROW')

SELECT * FROM @T

select t2.comment_id, t1.site_id, t1.date_entered, t1.[user_id], t1.comment

from @T t1 inner join

(select min(comment_id) as comment_id from @T

group by site_id, [user_id],comment) t2 on t1.comment_id = t2.comment_id

DELETE t1

FROM @T t1 LEFT OUTER JOIN

(select min(comment_id) as comment_id from @T

group by site_id, [user_id],comment) t2 on t1.comment_id = t2.comment_id

WHERE t2.Comment_ID IS NULL

Kindest Regards,

Vasc

jvarunok SSC Enthusiast Points: 104 More actions · Answer 7

If if doesn't matter which record you keep you could just create a cursor that includes each duplicate and the number of times the duplicate appears in the table. Loop through the cursor and each time set rowcount to the number of duplicates minus 1 then do a delete from the table where the site id is equal to your cursor value.

First just to see your duplicates run this;

Select site_id,count(*) as count from duplicate_tab

group by site_id

having count(*) > 1

Then you can run this;

declare

@site varchar(50),

@cnt int,

@rowcnt int

/** This get all Site_ID's that have more than one in the table **/

Declare c_site cursor FOR

Select site_id,count(*) as count from duplicate_tab

group by site_id

having count(*) > 1

Open c_site

Fetch Next from c_site into @site, @cnt

While (@@Fetch_Status = 0)

Begin

set @rowcnt = @cnt - 1

set rowcount @rowcnt

Delete from duplicate_tab

Where site_id = @site

set rowcount 0

Fetch Next From c_site into @site, @cnt

End

close c_site

deallocate c_site

--------------------------------------------------------

This always worked for me.

Thanks,

Jeannine

Mike Collins-249907 SSC Enthusiast Points: 122 More actions · Answer 8

Mike Collins-249907

SSC Enthusiast

Points: 122

October 21, 2005 at 9:50 am

#599180

Thanks...I try to stay away from cursors, but I like the select statement you did.

sushila SSC-Dedicated Points: 35293 More actions · Answer 9

What a wonderfully diplomatic response Mike - applause!

Jeannine - I wanted to warn you before the "CursorsAreEvil" police come swooping down on you - please search this site for cursors etc.. and you'll find an abundance of articles, posts etc.. that explain at great length about why cursors are evil and how you must avoid them whenever possible...

**ASCII stupid question, get a stupid ANSI !!!**

jvarunok SSC Enthusiast Points: 104 More actions · Answer 10

Yes cursors are nasty if you have a lot of data. But as a quick looping example they are good.

I usually loop through the table using one of the column identifiers. Usually they are the PK or at least indexed.

Below is an example that uses the method of determining the minimum and maximum values for the id column and then augmenting the ID each time through the loop.

Declare

@maxsite varchar(50),

@minsite varchar(50),

@cnt int,

@rowcnt int

-- Get the min and max site_id values where the site ID has dups

select @minsite = Min(site_id), @maxsite = Max(Site_ID) from duplicate_tab Where Exists(Select site_id,count(*) as count from duplicate_tab

group by site_id

having count(*) > 1)

-- Loop through the site_id's based on the min and max

While (@minsite <= @maxsite)

Begin

select @cnt = count(*) from duplicate_tab Where

site_id = @minsite

group by site_id

having count(*) > 1

set @rowcnt = @cnt - 1

set rowcount @rowcnt

Delete from duplicate_tab

Where site_id = @minsite

set rowcount 0

-- determine the next site_id that has duplicates but is still the minimum in the remaining list

select @minsite = Min(site_id) from duplicate_tab Where

site_id > @minsite and

Exists(Select site_id,count(*) as count from duplicate_tab

group by site_id

having count(*) > 1)

End

-Jeannine

Top Star Grasshopper Points: 22 More actions · Answer 11

Hi friend,

check out this query

(Comment_id, site_id, date_entered, user_id, comment)

6978, AL28492, 07/17/2005, 408, No longer a Franchise, replace w/ non-franchise signage.

8161, AL28492, 08/15/2005, 408, No longer a Franchise, replace w/ non-franchise signage.

8702, AL28492, 09/11/2005, 408, No longer a Franchise, replace w/ non-franchise signage

Delete from tblTest where Comment_id not in(select max(Comment_id) from tblTest group by site_id, user_id)

bye

prabhanjan

prabhanjanbr@yahoo.com

Manko T SSC Enthusiast Points: 147 More actions · Answer 12

Manko T

SSC Enthusiast

Points: 147

October 10, 2006 at 6:59 pm

#665110

Just to let you know that this has been a real help for me. I was looking to eliminate duplicates with different timestamp and I got stuck for a while as I'm just doing SQL once in a while, so I'm very rusty. Thanks for the query.

Regards,

Martin

Mike Collins-249907 SSC Enthusiast Points: 122 More actions · Answer 13

Mike Collins-249907

SSC Enthusiast

Points: 122

October 10, 2006 at 7:47 pm

#665115

Glad my post could help. I was surprised to see activity on this post a year later