T-SQL Puzzler - identify sets that have the same items (where set ID and items in same table)

Question

Post reply

T-SQL Puzzler - identify sets that have the same items (where set ID and items in same table)

Laurence Neville

SSC Eights!

Points: 809
More actions
February 25, 2015 at 11:22 am

#298044

I am struggling to come up with a set-based solution for this problem (i.e. that doesn't involve loops/cursors) .... any help gratefully appreciated!

A table contains items (identified by an ItemCode) and the set they belong to (identified by a SetId). Here is some sample data:

SetId ItemCode

1 A

1 B

2 4

2 8

2 6

3 10

3 12

4 10

4 18

5 8

6 12

7 16

8 16

9 16

10 A

10 B

11 4

11 6

11 8

You can see that there are some sets that have the same members:

- 1 and 10

- 2 and 11

- 7, 8 & 9

What I want to do is identify the sets that have the same members, by giving them the same ID in another column called UniqueSetId.

Any ideas?

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply

Luis Cazares SSC Guru Points: 183686 More actions · Answer 1

This is a possible option.



CREATE TABLE SetsItems(

    SetId   int,

    ItemCode varchar(10))

INSERT INTO SetsItems

    SELECT 1,	'A' UNION ALL

    SELECT 1,	'B' UNION ALL

    SELECT 2,	'4' UNION ALL

    SELECT 2,	'8' UNION ALL

    SELECT 2,	'6' UNION ALL

    SELECT 3,	'10' UNION ALL

    SELECT 3,	'12' UNION ALL

    SELECT 4,	'10' UNION ALL

    SELECT 4,	'18' UNION ALL

    SELECT 5,	'8' UNION ALL

    SELECT 6,	'12' UNION ALL

    SELECT 7,	'16' UNION ALL

    SELECT 8,	'16' UNION ALL

    SELECT 9,	'16' UNION ALL

    SELECT 10,	'A' UNION ALL

    SELECT 10,	'B' UNION ALL

    SELECT 11,	'4' UNION ALL

    SELECT 11,	'6' UNION ALL

    SELECT 11,	'8';

WITH cteSets AS(

    SELECT DISTINCT SetID,

        CAST( (SELECT ',' + ItemCode

                FROM SetsItems i

                WHERE s.SetId = i.SetId

                ORDER BY ItemCode

                FOR XML PATH('')) AS varchar(max)) Items

    FROM SetsItems s

)

SELECT *,

    RANK() OVER( ORDER BY Items) Option1,

    DENSE_RANK() OVER( ORDER BY Items) Option2

FROM cteSets

ORDER BY Option1

GO

DROP TABLE SetsItems

Luis C.
General Disclaimer:
Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

How to post data/code on a forum to get the best help: Option 1 / Option 2

Mark Cowne One Orange Chip Points: 26865 More actions · Answer 2

Here's another (using Luis's setup)



WITH Results(SetId1,SetId2) AS (

SELECT a.SetId,b.SetId

FROM SetsItems a

INNER JOIN SetsItems b ON b.SetId > a.SetId

                 AND b.ItemCode = a.ItemCode

GROUP BY a.SetId,b.SetId

HAVING COUNT(*) = (SELECT COUNT(*) FROM SetsItems c WHERE c.SetId = a.SetId)

   AND COUNT(*) = (SELECT COUNT(*) FROM SetsItems d WHERE d.SetId = b.SetId))

SELECT DISTINCT ca.SetId,

       DENSE_RANK() OVER(ORDER BY a.SetId1) AS UniqueSetId

FROM Results a

CROSS APPLY(VALUES(a.SetId1),(a.SetId2)) ca(SetId)

WHERE NOT EXISTS(SELECT * FROM Results b WHERE b.SetId2 = a.SetId1)

ORDER BY UniqueSetId,SetId;

____________________________________________________

Deja View - The strange feeling that somewhere, sometime you've optimised this query before

How to get the best help on a forum

http://www.sqlservercentral.com/articles/Best+Practices/61537

Laurence Neville SSC Eights! Points: 809 More actions · Answer 3

Laurence Neville

SSC Eights!

Points: 809

February 26, 2015 at 12:20 pm

#1780361

Both solutions worked, thanks very much. For some reason Luis's is much faster on my large data set (millions of rows).

siggemannen SSCrazy Points: 2436 More actions · Answer 4

siggemannen

SSCrazy

Points: 2436

February 28, 2015 at 4:15 am

#1780868

I usually do something like this:



SELECT	SetId, CHECKSUM_AGG(CHECKSUM(ItemCode)) ch

FROM	SetsItems

GROUP BY SetId

ORDER BY ch

If you're **really** unlucky you might get a false positive because CHECKSUM isn't perfect, but i doubt it. You can do a HASHBYTES on the ItemCode to pretty much avoid any collisions.

SwePeso SSC-Dedicated Points: 39725 More actions · Answer 5

This is a simple case of Relational Division, of which I am particular fond of.

The result isSetID SetID

8 9By using this piece of codeSELECT kc.SetID,



		nc.SetID

FROM		(

			SELECT		SetID,

					COUNT(*) AS cnt

			FROM		dbo.SetsItems

			GROUP BY	SetID

		) AS kc

INNER JOIN	(

			SELECT		SetID,

					COUNT(*) AS cnt

			FROM		dbo.SetsItems

			GROUP BY	SetID

		) AS nc ON nc.cnt = kc.cnt

			AND nc.SetID > kc.SetID

INNER JOIN	dbo.SetsItems AS t ON t.SetID = kc.SetID

INNER JOIN	dbo.SetsItems AS n ON n.SetID = nc.SetID

WHERE		t.ItemCode = n.ItemCode

GROUP BY	kc.SetID,

		nc.SetID

HAVING COUNT(*) = MIN(nc.cnt);

N 56°04'39.16"
E 12°55'05.25"

SwePeso SSC-Dedicated Points: 39725 More actions · Answer 6

-- SwePeso - Preaggregation/filtering step



SELECT		SetID,

		COUNT(*) AS cnt,

		MIN(ItemCode) AS mn,

		MAX(ItemCode) AS mx,

		CHECKSUM_AGG(CHECKSUM(ItemCode)) AS chk

INTO		#Temp

FROM		dbo.SetsItems WITH (NOLOCK)

GROUP BY	SetID;

-- SwePeso - projection step

SELECT		ROW_NUMBER() OVER (ORDER BY w.chk) AS UniqueSetID,

		STUFF(f.Data, 1, 2, N'') AS SetIDs

FROM		(

			SELECT		cnt,

					mn,

					mx,

					chk

			FROM		#Temp

			GROUP BY	cnt,

					mn,

					mx,

					chk

			HAVING		COUNT(*) >= 2

		) AS w

CROSS APPLY	(

			SELECT		', ' + CAST(x.SetID AS NVARCHAR(12))

			FROM		#Temp AS x

			WHERE		x.cnt = w.cnt

					AND x.mn = w.mn

					AND x.mx = w.mx

					AND x.chk = w.chk

			FOR XML		PATH('')

) AS f(Data);

N 56°04'39.16"
E 12°55'05.25"

Laurence Neville SSC Eights! Points: 809 More actions · Answer 7

Laurence Neville

SSC Eights!

Points: 809

March 1, 2015 at 7:49 am

#1780958

Thanks for the extra solutions. Got me reading about relational division: https://www.simple-talk.com/sql/learn-sql-server/high-performance-relational-division-in-sql-server

Dwain Camps SSC Guru Points: 86913 More actions · Answer 8

Laurence Neville (3/1/2015)

Thanks for the extra solutions. Got me reading about relational division: https://www.simple-talk.com/sql/learn-sql-server/high-performance-relational-division-in-sql-server

:blush:

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St