Deleting duplicates from 37 million Records

Question

Post reply

Deleting duplicates from 37 million Records

DBTeam

SSC Eights!

Points: 956
More actions
November 20, 2009 at 2:50 am

#206692

Hi All,

We have one 37 million table having 20 + columns with all the person information, But almost, 7 million records having duplicate emails

But other fields are different.

Below is the way we have in our mind, Pls suggest me better than this.

Insert into New

select * from [37million]

where Rid in(select max(Rid) from [37million] group by email)

Pls suggest better than this..............

Thnx in advance

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply

Henrico Bekker One Orange Chip Points: 27652 More actions · Answer 1

We have one 37 million table having 20 + columns with all the person information, But almost, 7 million records having duplicate emails

But other fields are different.

Insert into New

select * from [37million]

where Rid in(select max(Rid) from [37million] group by email)

You'll need to do a select for distinct and count where the same address > 1...

select distinct from [37million]

group by

having count(distinct) > 1

note - this will identify all the emails appearing more than once, you will need an extra line of filtering when deleting it or inserting into, or else you'll move all the qualifying rows, not only the duplicates.

Use your RID to identify the rows to move...

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
This thing is addressing problems that dont exist. Its solution-ism at its worst. We are dumbing down machines that are inherently superior. - Gilfoyle

DBTeam SSC Eights! Points: 956 More actions · Answer 2

DBTeam

SSC Eights!

Points: 956

November 20, 2009 at 3:19 am

#1082066

Deleting duplicates in that will take long time i guess,

so i want create a fresh table with all the records including non duplicates. once it got inserted , after that i'll delete that table.

Please suggest me the best way to do that

DBTeam SSC Eights! Points: 956 More actions · Answer 3

Insert into New

select * from [37million]

where Rid in(select max(Rid) from [37million] group by email)

or

DELETE FROM [Table]

WHERE [RID] IN

(

SELECT a.[RID]

FROM [Table] a,

[Table] b

WHERE a.[RID]!= b.[RID]

and a.[RID]< b.[RID]

AND a.[Email]= b.[Email]

)

Please tell me which one is better among the above two.......

Lynn Pettis SSC Guru Points: 442447 More actions · Answer 4

Lynn Pettis

SSC Guru

Points: 442447

November 20, 2009 at 6:59 am

#1082184

Can you provide a partial DDL for the table, primary key(s) and e-mail columns, and some sample data for the table. Also, how do you want to identify which record of the duplicate e-mail address do you want to keep?

DBTeam SSC Eights! Points: 956 More actions · Answer 5

DBTeam

SSC Eights!

Points: 956

November 20, 2009 at 7:07 am

#1082194

Sample Data is

Id FirstName LastName Address1 Address2 City State Zip .......

20+ columns, Emailaddress

all having datatype nvarchar2(255) except ID Column

Any emailaddress is Ok No priority like that.......

Lynn Pettis SSC Guru Points: 442447 More actions · Answer 6



with EmailRecords as (

select

    row_number() over (partition by Emailaddress order by Id desc) as RowNumber -- Assuming Highest Id is most recent

    Id,

    FirstName,

    LastName[,...]

from

    dbo.[37MillionRecTable]

)

insert into dbo.[NewTable]

select

    Id,

    FirstName[,...] -- Note, do not include RowNumber

from

    EmailRecords

where

    RowNumber = 1;

/*

After running this, you can drop the original table then rename [NewTable] to the original table name.

*/

The above code is untested and truly incomplete as you didn't really provide us with anything to allow testing.

bwilliams-1049831 Mr or Mrs. 500 Points: 502 More actions · Answer 7

How many duplicates do you have?

This might do it for you. set @rcnt toi the number of duplicates.

I know this is a cursor. But I do come from the oracle world

You will need to change it for your tables but I just used this to delete over 600 duplicates in 600,000 records. You can use @@recordcount and not use the @rcnt.

declare @rcnt int

select @rcnt = 700

While @rcnt > 0

BEGIN

delete top (1)

from tmp_industry_stock

where customer in (Select Customer

from tmp_industry_stock

group BY Country,Sector,industry,Customer

having count(*) > 1);

select @rcnt = @rcnt - 1;

END

Vasc SSCarpal Tunnel Points: 4623 More actions · Answer 8

LOL had same issue (~100M) .... BUT do you have a proper way to say WHICH of those records are the VALID ones? It might not be the MAX(ID) take that into consideration...

Solution I used was to delete in batches the duplicates (not sure what servers other pple have but mine would just crawl if I'd try to move such big amounts in 1 transaction on live and the log file would just blow). For the starters try to remove the duplicates from every X MILS records (note that duplicate emails won't be too apart ID's speaking pple have the tendency to double register in short period of times not too far apart anyway, this is assuming no index on email column). When your done with this type of removing duplicates either go bigger XX or try the whole table. Watch your performance indicators before deciding to go bigger 🙂

Kindest Regards,

Vasc

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 9

bwilliams:

1) welcome to sql server and the sqlservercentral forums

2) PLEASE PLEASE PLEASE do yourself a favor and learn set-based operations. that code you posted would be an unbelievable DOG on sql server. massively inefficient compared to how sql server best operates. you are in a different world now and when in Rome . . . 🙂

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service