Removing partially duplicate records in SQL

Question

Post reply

Removing partially duplicate records in SQL

Viewing 7 posts - 16 through 21 (of 21 total)

You must be logged in to reply to this topic. Login to reply

SwePeso SSC-Dedicated Points: 39725 More actions · Answer 1

SwePeso

SSC-Dedicated

Points: 39725

October 15, 2008 at 9:04 am

#885299

What about this sample data?



SELECT	'', '', 'Y', 3 UNION ALL

SELECT	'', 'X', 'Y', 3

Should first line be deleted since it is a subset of second line?

N 56°04'39.16"
E 12°55'05.25"

Coconut Ten Centuries Points: 1087 More actions · Answer 2

Coconut

Ten Centuries

Points: 1087

October 15, 2008 at 9:10 am

#885303

Thanks Peso,

Yes, The first line should be deleted. Thanks for your codes. I am still trying to understand it.

SwePeso SSC-Dedicated Points: 39725 More actions · Answer 3

Steve Jones - Editor (10/15/2008)

Multiple passes would be looking for a pattern like Row 2, where Col1 and 2 match, but Col3 is a null/blank. Then a second pattern would be col1 and 3 match, but 2 has a null/blank.

Max (10/15/2008)

Yes, The first line should be deleted. Thanks for your codes. I am still trying to understand it.

In that case, use this more generic one step delete algorithm



DECLARE	@Sample TABLE

	(

		recID INT IDENTITY(1, 1),

		col1 VARCHAR(1),

		col2 VARCHAR(2),

		col3 VARCHAR(3),

		userID INT

	)

INSERT	@Sample

SELECT	'A', 'B', 'C', 1 UNION ALL

SELECT	'A', 'B', '', 1 UNION ALL

SELECT	'A', '', 'C', 1 UNION ALL

SELECT	'F', '', 'C', 1 UNION ALL

SELECT	'', 'M', '', 2 UNION ALL

SELECT	'T', 'M', 'O', 2 UNION ALL

SELECT	'', 'M', 'O', 2 UNION ALL

SELECT	'X', 'M', 'O', 2 UNION ALL

SELECT	'', '', 'Y', 3 UNION ALL

SELECT	'', 'X', 'Y', 3

SELECT	*

FROM	@Sample

DELETE		s

FROM		(

			SELECT	recID,

				col1,

				col2,

				col3,

				userID,

				CASE WHEN col1 = '' THEN 0 ELSE 1 END

				+ CASE WHEN col2 = '' THEN 0 ELSE 1 END

				+ CASE WHEN col3 = '' THEN 0 ELSE 1 END AS theItems

			FROM	@Sample

		) AS s

INNER JOIN	(

			SELECT		MIN(recID) AS minID,

					col1,

					col2,

					col3,

					userID,

					CASE WHEN col1 = '' THEN 0 ELSE 1 END

					+ CASE WHEN col2 = '' THEN 0 ELSE 1 END

					+ CASE WHEN col3 = '' THEN 0 ELSE 1 END AS theItems

			FROM		@Sample

			GROUP BY	col1,

					col2,

					col3,

					userID,

					CASE WHEN col1 = '' THEN 0 ELSE 1 END

					+ CASE WHEN col2 = '' THEN 0 ELSE 1 END

					+ CASE WHEN col3 = '' THEN 0 ELSE 1 END

		) AS w ON w.userID = s.userID

			AND w.minID <> s.recID

			AND w.theItems >= s.theItems

WHERE		w.col1 = COALESCE(NULLIF(s.col1, ''), w.col1)

		AND w.col2 = COALESCE(NULLIF(s.col2, ''), w.col2)

		AND w.col3 = COALESCE(NULLIF(s.col3, ''), w.col3)

SELECT	*

FROM	@Sample

Also see http://weblogs.sqlteam.com/peterl/archive/2008/10/15/Delete-all-subset-records.aspx

N 56°04'39.16"
E 12°55'05.25"

Coconut Ten Centuries Points: 1087 More actions · Answer 4

Coconut

Ten Centuries

Points: 1087

October 15, 2008 at 9:56 am

#885323

Thanks Peso. You're brilliant. The codes look really great. I haven't fully tested it but I believe it will work.

Thanks a lot all others who participated. Have a good day!!

SwePeso SSC-Dedicated Points: 39725 More actions · Answer 5

It turns out you can write the algorithm even cleaner.

I got a comment on my blog to use an algorithm like this instead.



DELETE		a

FROM		@Sample AS a

INNER JOIN	@Sample AS b ON b.userID = a.userID

			AND b.recID <> a.recID

WHERE		(a.col1 = b.col1 OR a.col1 = '')

		AND (a.col2 = b.col2 OR a.col2 = '')

		AND (a.col3 = b.col3 OR a.col3 = '')

It's very nice but it creates several cross joins for each userID (n * m).

I tried to create a triangular join with help of the "count of values" column ( n * m / 2).

N 56°04'39.16"
E 12°55'05.25"

Coconut Ten Centuries Points: 1087 More actions · Answer 6

Thanks Peso,

The new algorithm looks good, but I need to make sure There are no pure duplicates. If there are, it will delete them all.

INSERT @SAMPLE

SELECT 'A', 'B', 'C', 1 UNION ALL <-- duplicate

SELECT 'A', 'B', ' ', 1 UNION ALL

SELECT 'A', ' ', 'C', 1 UNION ALL

SELECT 'F', ' ', 'C', 1 UNION ALL

SELECT ' ', 'M', ' ', 2 UNION ALL

SELECT 'T', 'M', 'O', 2 UNION ALL

SELECT ' ', 'M', 'O', 2 UNION ALL

SELECT 'X', 'M', 'O', 2 UNION ALL

SELECT ' ', ' ', 'Y', 3 UNION ALL

SELECT ' ', 'X', 'Y', 3

Thanks.

p.s. Maybe I can group it by col1, col2, and col3 again and select only one of them to delete. I will run some tests. Thanks.

SwePeso SSC-Dedicated Points: 39725 More actions · Answer 7

Easy fixed when you understand the code.



DELETE		a

FROM		@Sample AS a

INNER JOIN	(

			SELECT		MIN(recID) AS recID,

					col1,

					col2,

					col3,

					userID

			FROM		@Sample

			GROUP BY	col1,

					col2,

					col3,

					userID

		) AS b ON b.userID = a.userID

			AND b.recID <> a.recID

WHERE		(a.col1 = b.col1 OR a.col1 = '')

		AND (a.col2 = b.col2 OR a.col2 = '')

		AND (a.col3 = b.col3 OR a.col3 = '')

And now the code looks very similar to my first suggestion. The thing missing is the WHERE '' NOT IN part 😀

N 56°04'39.16"
E 12°55'05.25"