February 25, 2015 at 11:22 am
I am struggling to come up with a set-based solution for this problem (i.e. that doesn't involve loops/cursors) .... any help gratefully appreciated!
A table contains items (identified by an ItemCode) and the set they belong to (identified by a SetId). Here is some sample data:
SetId ItemCode
1 A
1 B
2 4
2 8
2 6
3 10
3 12
4 10
4 18
5 8
6 12
7 16
8 16
9 16
10 A
10 B
11 4
11 6
11 8
You can see that there are some sets that have the same members:
- 1 and 10
- 2 and 11
- 7, 8 & 9
What I want to do is identify the sets that have the same members, by giving them the same ID in another column called UniqueSetId.
Any ideas?
February 25, 2015 at 12:14 pm
This is a possible option.
CREATE TABLE SetsItems(
SetId int,
ItemCode varchar(10))
INSERT INTO SetsItems
SELECT 1, 'A' UNION ALL
SELECT 1, 'B' UNION ALL
SELECT 2, '4' UNION ALL
SELECT 2, '8' UNION ALL
SELECT 2, '6' UNION ALL
SELECT 3, '10' UNION ALL
SELECT 3, '12' UNION ALL
SELECT 4, '10' UNION ALL
SELECT 4, '18' UNION ALL
SELECT 5, '8' UNION ALL
SELECT 6, '12' UNION ALL
SELECT 7, '16' UNION ALL
SELECT 8, '16' UNION ALL
SELECT 9, '16' UNION ALL
SELECT 10, 'A' UNION ALL
SELECT 10, 'B' UNION ALL
SELECT 11, '4' UNION ALL
SELECT 11, '6' UNION ALL
SELECT 11, '8';
WITH cteSets AS(
SELECT DISTINCT SetID,
CAST( (SELECT ',' + ItemCode
FROM SetsItems i
WHERE s.SetId = i.SetId
ORDER BY ItemCode
FOR XML PATH('')) AS varchar(max)) Items
FROM SetsItems s
)
SELECT *,
RANK() OVER( ORDER BY Items) Option1,
DENSE_RANK() OVER( ORDER BY Items) Option2
FROM cteSets
ORDER BY Option1
GO
DROP TABLE SetsItems
February 25, 2015 at 3:12 pm
Here's another (using Luis's setup)
WITH Results(SetId1,SetId2) AS (
SELECT a.SetId,b.SetId
FROM SetsItems a
INNER JOIN SetsItems b ON b.SetId > a.SetId
AND b.ItemCode = a.ItemCode
GROUP BY a.SetId,b.SetId
HAVING COUNT(*) = (SELECT COUNT(*) FROM SetsItems c WHERE c.SetId = a.SetId)
AND COUNT(*) = (SELECT COUNT(*) FROM SetsItems d WHERE d.SetId = b.SetId))
SELECT DISTINCT ca.SetId,
DENSE_RANK() OVER(ORDER BY a.SetId1) AS UniqueSetId
FROM Results a
CROSS APPLY(VALUES(a.SetId1),(a.SetId2)) ca(SetId)
WHERE NOT EXISTS(SELECT * FROM Results b WHERE b.SetId2 = a.SetId1)
ORDER BY UniqueSetId,SetId;
____________________________________________________
Deja View - The strange feeling that somewhere, sometime you've optimised this query before
How to get the best help on a forum
http://www.sqlservercentral.com/articles/Best+Practices/61537February 26, 2015 at 12:20 pm
Both solutions worked, thanks very much. For some reason Luis's is much faster on my large data set (millions of rows).
February 28, 2015 at 4:15 am
I usually do something like this:
SELECT SetId, CHECKSUM_AGG(CHECKSUM(ItemCode)) ch
FROM SetsItems
GROUP BY SetId
ORDER BY ch
If you're **really** unlucky you might get a false positive because CHECKSUM isn't perfect, but i doubt it. You can do a HASHBYTES on the ItemCode to pretty much avoid any collisions.
March 1, 2015 at 2:18 am
This is a simple case of Relational Division, of which I am particular fond of.
The result isSetID SetID
1 10
2 11
7 8
7 9
8 9By using this piece of codeSELECT kc.SetID,
nc.SetID
FROM (
SELECT SetID,
COUNT(*) AS cnt
FROM dbo.SetsItems
GROUP BY SetID
) AS kc
INNER JOIN (
SELECT SetID,
COUNT(*) AS cnt
FROM dbo.SetsItems
GROUP BY SetID
) AS nc ON nc.cnt = kc.cnt
AND nc.SetID > kc.SetID
INNER JOIN dbo.SetsItems AS t ON t.SetID = kc.SetID
INNER JOIN dbo.SetsItems AS n ON n.SetID = nc.SetID
WHERE t.ItemCode = n.ItemCode
GROUP BY kc.SetID,
nc.SetID
HAVING COUNT(*) = MIN(nc.cnt);
N 56°04'39.16"
E 12°55'05.25"
March 1, 2015 at 2:27 am
-- SwePeso - Preaggregation/filtering step
SELECT SetID,
COUNT(*) AS cnt,
MIN(ItemCode) AS mn,
MAX(ItemCode) AS mx,
CHECKSUM_AGG(CHECKSUM(ItemCode)) AS chk
INTO #Temp
FROM dbo.SetsItems WITH (NOLOCK)
GROUP BY SetID;
-- SwePeso - projection step
SELECT ROW_NUMBER() OVER (ORDER BY w.chk) AS UniqueSetID,
STUFF(f.Data, 1, 2, N'') AS SetIDs
FROM (
SELECT cnt,
mn,
mx,
chk
FROM #Temp
GROUP BY cnt,
mn,
mx,
chk
HAVING COUNT(*) >= 2
) AS w
CROSS APPLY (
SELECT ', ' + CAST(x.SetID AS NVARCHAR(12))
FROM #Temp AS x
WHERE x.cnt = w.cnt
AND x.mn = w.mn
AND x.mx = w.mx
AND x.chk = w.chk
FOR XML PATH('')
) AS f(Data);
N 56°04'39.16"
E 12°55'05.25"
March 1, 2015 at 7:49 am
Thanks for the extra solutions. Got me reading about relational division: https://www.simple-talk.com/sql/learn-sql-server/high-performance-relational-division-in-sql-server
March 2, 2015 at 5:03 pm
Laurence Neville (3/1/2015)
Thanks for the extra solutions. Got me reading about relational division: https://www.simple-talk.com/sql/learn-sql-server/high-performance-relational-division-in-sql-server
:blush:
My thought question: Have you ever been told that your query runs too fast?
My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.
Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St
Viewing 9 posts - 1 through 8 (of 8 total)
You must be logged in to reply to this topic. Login to reply