Removing partially duplicate records in SQL

Question

Post reply

Removing partially duplicate records in SQL

Coconut

Ten Centuries

Points: 1087
More actions
October 14, 2008 at 2:35 pm

#192227

Hello,

I have a table with 1 billion records and I'd like to delete some records that are partially duplicate.

Ex) the table below has three records

RECID COL1 COL2 COL3

---------------------------------------

1 A B C

2 A B 'blank'

3 A 'blank' C

As you can see, RECID 2 and RECID 3 are duplicates because RECID 1 has all the values that 2 and 3 have.

Since I have 1 Billion records, I don't want to go through row by row.

What would be the best way to eliminate the duplicates in this case?

Thanks a lot in advance.

Viewing 15 posts - 1 through 15 (of 21 total)

You must be logged in to reply to this topic. Login to reply

Jack Corbett SSC Guru Points: 184388 More actions · Answer 1

Jack Corbett

SSC Guru

Points: 184388

October 14, 2008 at 2:44 pm

#884904

What are your rules for determining if a row is a duplicate?

Jack Corbett
Consultant - Straight Path Solutions
Check out these links on how to get faster and more accurate answers:
Forum Etiquette: How to post data/code on a forum to get the best help
Need an Answer? Actually, No ... You Need a Question

Coconut Ten Centuries Points: 1087 More actions · Answer 2

Thanks Jack,

I already removed the pure duplicates (same values in all the columns), so the basic rule is if record A has values (excluding 'blank') that all exists in another record, it's a duplicate.

RECID COL1 COL2 COL3

---------------------------------------

1 A B C

2 A B 'blank'

3 A 'blank' C

So, RECID 2 and RECID 3 are duplicates because the values (A, B in RECID 2 and A, C in RECID 3 excluding 'blank') all exist in RECID 1.

below is to create the sample table

create table dbo.TEMP (RECID int identity(1,1), COL1 varchar(1), COL2 varchar(1), COL3 varchar(1))

insert into TEMP select 'A','B','C'

insert into TEMP select 'A','B',''

insert into TEMP select 'A','','C'

I need to find out how to select RECID 2 and RECID 3, so that I can delete them from the table.

Once again, unfortunately, I can't do row by row comparison just because the input table is too big.

Thanks a lot

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 3

Max,

Can there be a 4th possibility or not?

RECID COL1 COL2 COL3

---------------------------------------

1 A B C

2 A B 'blank'

3 A 'blank' C

4 'blank' B C

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

Garadin One Orange Chip Points: 29613 More actions · Answer 4

And if you have something like:

1 A B C

2 A D C

3 A E C

4 A B F

Do you want to keep all of them, or...? If you don't keep them all, how do you While using real data isn't necessary, it tends to be a lot more intuitive if you supply "dummy" data, rather than just random letters.

Seth Phelabaum

Consistency is only a virtue if you're not a screwup. 😉

Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

Garadin One Orange Chip Points: 29613 More actions · Answer 5

And if you have something like:

1 A B C

2 A D C

3 A E C

4 A B F

Do you want to keep all of them, or...? If you don't keep them all, how do you merge the different values? If you can just merge the values and they aren't relational to the rows, why is the table set up like that in the first place? While using real data isn't necessary, it tends to be a lot more intuitive if you supply "dummy" data, rather than just random letters.

Seth Phelabaum

Consistency is only a virtue if you're not a screwup. 😉

Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

Coconut Ten Centuries Points: 1087 More actions · Answer 6

thanks for your reply.

To answer Jeff's question, it's yes. The table contains about 1 billion records and about 10 differect columns (COL1 to COL10).

I'd like to delete the 4th record as well because Col2 = B and Col = C exist in the 1st record.

To answer Seth's question, yes, i have to keep them all since each record is unique.

For example)

[font="Courier New"]RECID SOURCE NAME PHONE SSN USER_ID

---------------------------------------------------------------------------

1 A4 JEFF 29489 98579 A535

2 B2 JEFF 98579 A535

3 G5 29489 98579 A535

4 H9 JEFF 98868 98579 A535[/font]

In the example above, there are four records for a same person (USER_ID = A535).

I can delete RECID 2 and 4, since all the information (NAME, PHONE, SSN) that I can get from them exists in RECID 1.

But I have to keep RECID 4 because it has new information about the person (a new phone number).

Unfortunately, I can't aggregate / propagate to keep only the unique values since I need to keep where the record came from (SOURCE).

Thanks a lot

Steve Jones - SSC Editor SSC Guru Points: 728159 More actions · Answer 7

I think you want to make multiple passes, remove those duplicates based on the combinations. Jeff might have a one pass method, but I think the simplicity might run quickly even with multiple passes.

Is this a one time thing or is it a recurring item?

Also, you have 1 and 2 as duplicates and 1 and 3 as duplicates, so are you saying that in your example, only row 1 remains?

Follow me on Twitter: http://www.twitter.com/way0utwest
Forum Etiquette: How to post data/code on a forum to get the best help
My Blog: www.voiceofthedba.com

Jack Corbett SSC Guru Points: 184388 More actions · Answer 8

Something I think needs to be asked, source is different in all the rows, why doesn't this matter? The data was stored for some reason originally, do you no longer care about source? If you do, why are you throwing out rows without considering the source column? If you don't why not remove the column?

Jack Corbett
Consultant - Straight Path Solutions
Check out these links on how to get faster and more accurate answers:
Forum Etiquette: How to post data/code on a forum to get the best help
Need an Answer? Actually, No ... You Need a Question

Garadin One Orange Chip Points: 29613 More actions · Answer 9

Oh yeah, that's much better than letters =). Jeff might have been doing this for so long that he just intuitively grasps it all, but I still need my training wheels. What is the feasibility of changing this table structure to keep 1 row in this table for items that will always be unique to the employee (such as SSN and EmployeeID), and creating a second table to store phone numbers that can be a 1 to many relationship? If you could apply that type of formatting, that would likely make a huge impact on the amount of rows that table contains, and give you more normalized data to boot.

Seth Phelabaum

Consistency is only a virtue if you're not a screwup. 😉

Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

Coconut Ten Centuries Points: 1087 More actions · Answer 10

Thanks Steve,

It should be a recurring process. Each week, I add about 3 million records to the big table but most of them are duplicates.

Yes, in the first example, I will have only one record (RECID = 1).

I am not quite sure what the pass method is, but since the table is big, the deduping process should be fast enough, so I can finish it over the weekend.

Below is another example,

create table CF_TEMP.dbo.TEMP (

RECID int identity(1,1),

COL1 varchar(1),

COL2 varchar(2),

COL3 varchar(3),

USERID int)

insert into CF_TEMP.dbo.TEMP select 'A', 'B', 'C', 1

insert into CF_TEMP.dbo.TEMP select 'A', 'B', '', 1

insert into CF_TEMP.dbo.TEMP select 'A', '', 'C', 1

insert into CF_TEMP.dbo.TEMP select 'F', '', 'C', 1

insert into CF_TEMP.dbo.TEMP select '', 'M', '', 2

insert into CF_TEMP.dbo.TEMP select 'T', 'M', 'O', 2

insert into CF_TEMP.dbo.TEMP select '', 'M', 'O', 2

insert into CF_TEMP.dbo.TEMP select 'X', 'M', 'O', 2

USER_ID indicates they are for a same person. So I remove the duplicate within a same user_ID.

RECID COL1 COL2 COL3 USERID

-----------------------------------------------

1 A B C 1

2 A B 1 -- DUPLICATE to RECID 1

3 A C 1 -- DUPLICATE to RECID 1

4 F C 1

5 M 2 -- DUPLICATE to RECID 6

6 T M O 2

7 M O 2 -- DUPLICATE to RECID 7

8 X M O 2

In the end, I'd like to have only 4 records as below

RECID COL1 COL2 COL3 USERID

-----------------------------------------------

1 A B C 1

4 F C 1

6 T M O 2

8 X M O 2

Again, thanks a lot

Steve Jones - SSC Editor SSC Guru Points: 728159 More actions · Answer 11

How do you know which one to keep? The first one? Or the one that has all the fields?

That's important otherwise you could say row 2 duplicates row 1.

Multiple passes would be looking for a pattern like Row 2, where Col1 and 2 match, but Col3 is a null/blank. Then a second pattern would be col1 and 3 match, but 2 has a null/blank.

Follow me on Twitter: http://www.twitter.com/way0utwest
Forum Etiquette: How to post data/code on a forum to get the best help
My Blog: www.voiceofthedba.com

Coconut Ten Centuries Points: 1087 More actions · Answer 12

Thanks Steve,

I need to keep a record that has more informationl like RECID 1.

I need to keep both records in the example below.

RECID COL1 COL2 COL3

--------------------------------------

1 A B (blank)

2 (blank) B C

It seems I can't use the multiple pass method because it will take too much time to process the big table.

Thanks Jack and Seth,

Unfortunately, I can't normalize the table.

It should contain all the records. The purpose of this table is to assign the unique person ID.

Let's say I have a record as below

MARY JANE 123 MAIN ST 203-498-4982 498489889 1999-01-02 ... and many more

later I receive another record

MARY SMITH 456 SOUTH ST 245-364-4112 498489889 1999-01-02 ... and many more

I need to find the records above are for a same person even though she changed her name, address and phone number and assign a same unique ID for both records.

But most of cases, the records I recieve are duplicates or have lesser values as show below

MARY JANE 123 MAIN ST 203-498-4982 1999-01-02 ... and many more

It's not a pure duplicate since the third record doesn't have SSN populated but I have all the information from the 3rd record in the 1st record, so I don't have to keep it in the table.

The reason why I need to keep the source column and keep every record separately is I need to do some audit and I should be able to recreate / refresh the person ID in case there is a problem with the input data.

For exmple, the ETL group messed up a file and the first name and last name are reversed in the file received yesterday. I need to go back to the big table and should be able to correct the issue.

thanks for your help! I really appreciate it.

SwePeso SSC-Dedicated Points: 39725 More actions · Answer 13

It's doable in one step.

See here...



DECLARE	@Sample TABLE

	(

		recID INT IDENTITY(1, 1),

		col1 VARCHAR(1),

		col2 VARCHAR(2),

		col3 VARCHAR(3),

		userID INT

	)

INSERT	@Sample

SELECT	'A', 'B', 'C', 1 UNION ALL

SELECT	'A', 'B', '', 1 UNION ALL

SELECT	'A', '', 'C', 1 UNION ALL

SELECT	'F', '', 'C', 1 UNION ALL

SELECT	'', 'M', '', 2 UNION ALL

SELECT	'T', 'M', 'O', 2 UNION ALL

SELECT	'', 'M', 'O', 2 UNION ALL

SELECT	'X', 'M', 'O', 2

SELECT	*

FROM	@Sample

DELETE		s

FROM		@Sample AS s

INNER JOIN	(

			SELECT		MIN(recID) AS minID,

					col1,

					col2,

					col3,

					userID

			FROM		@Sample

			WHERE		'' NOT IN (col1, col2, col3)

			GROUP BY	col1,

					col2,

					col3,

					userID

		) AS w ON w.userID = s.userID

			AND w.minID <> s.recID

WHERE		w.col1 = COALESCE(NULLIF(s.col1, ''), w.col1)

		AND w.col2 = COALESCE(NULLIF(s.col2, ''), w.col2)

		AND w.col3 = COALESCE(NULLIF(s.col3, ''), w.col3)

SELECT	*

FROM	@Sample

N 56°04'39.16"
E 12°55'05.25"

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 14

Did some work for a mailing shop some years ago, we used a method like this for the same thing:

[font="Courier New"]DROP TABLE #Test



CREATE TABLE #Test (RECID INT, COL1 CHAR(1), COL2 CHAR(1), COL3 CHAR(1))

INSERT INTO #Test (RECID, COL1, COL2, COL3)

SELECT 1, 'A', 'B', 'C' UNION ALL

SELECT 2, 'A', 'B', ' ' UNION ALL

SELECT 3, 'A', ' ', 'C' UNION ALL

SELECT 4, ' ', 'B', 'C' UNION ALL

SELECT 5, ' ', 'B', 'C' 

SELECT a.RECID AS RECIDmaster, b.RECID AS COL1BlankDupe, c.RECID AS COL2BlankDupe, d.RECID AS COL3BlankDupe

FROM #Test a

INNER JOIN #Test b

   ON a.RECID <> b.RECID

   AND (b.COL1 = ' ' AND a.COL1 <> ' ')

   AND b.COL2 = a.COL2

   AND b.COL3 = a.COL3

INNER JOIN #Test c

   ON a.RECID <> c.RECID

   AND c.COL1 = a.COL1

   AND (c.COL2 = ' ' AND a.COL2 <> ' ')

   AND c.COL3 = a.COL3

INNER JOIN #Test d

   ON a.RECID <> d.RECID

   AND d.COL1 = a.COL1

   AND d.COL2 = a.COL2

   AND (d.COL3 = ' ' AND a.COL3 <> ' ')

ORDER BY a.RECID

[/font]

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden