T-SQL Puzzler - identify sets that have the same items (where set ID and items in same table)

  • I am struggling to come up with a set-based solution for this problem (i.e. that doesn't involve loops/cursors) .... any help gratefully appreciated!

    A table contains items (identified by an ItemCode) and the set they belong to (identified by a SetId). Here is some sample data:

    SetId ItemCode

    1 A

    1 B

    2 4

    2 8

    2 6

    3 10

    3 12

    4 10

    4 18

    5 8

    6 12

    7 16

    8 16

    9 16

    10 A

    10 B

    11 4

    11 6

    11 8

    You can see that there are some sets that have the same members:

    - 1 and 10

    - 2 and 11

    - 7, 8 & 9

    What I want to do is identify the sets that have the same members, by giving them the same ID in another column called UniqueSetId.

    Any ideas?

  • This is a possible option.

    CREATE TABLE SetsItems(

    SetId int,

    ItemCode varchar(10))

    INSERT INTO SetsItems

    SELECT 1, 'A' UNION ALL

    SELECT 1, 'B' UNION ALL

    SELECT 2, '4' UNION ALL

    SELECT 2, '8' UNION ALL

    SELECT 2, '6' UNION ALL

    SELECT 3, '10' UNION ALL

    SELECT 3, '12' UNION ALL

    SELECT 4, '10' UNION ALL

    SELECT 4, '18' UNION ALL

    SELECT 5, '8' UNION ALL

    SELECT 6, '12' UNION ALL

    SELECT 7, '16' UNION ALL

    SELECT 8, '16' UNION ALL

    SELECT 9, '16' UNION ALL

    SELECT 10, 'A' UNION ALL

    SELECT 10, 'B' UNION ALL

    SELECT 11, '4' UNION ALL

    SELECT 11, '6' UNION ALL

    SELECT 11, '8';

    WITH cteSets AS(

    SELECT DISTINCT SetID,

    CAST( (SELECT ',' + ItemCode

    FROM SetsItems i

    WHERE s.SetId = i.SetId

    ORDER BY ItemCode

    FOR XML PATH('')) AS varchar(max)) Items

    FROM SetsItems s

    )

    SELECT *,

    RANK() OVER( ORDER BY Items) Option1,

    DENSE_RANK() OVER( ORDER BY Items) Option2

    FROM cteSets

    ORDER BY Option1

    GO

    DROP TABLE SetsItems

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Here's another (using Luis's setup)

    WITH Results(SetId1,SetId2) AS (

    SELECT a.SetId,b.SetId

    FROM SetsItems a

    INNER JOIN SetsItems b ON b.SetId > a.SetId

    AND b.ItemCode = a.ItemCode

    GROUP BY a.SetId,b.SetId

    HAVING COUNT(*) = (SELECT COUNT(*) FROM SetsItems c WHERE c.SetId = a.SetId)

    AND COUNT(*) = (SELECT COUNT(*) FROM SetsItems d WHERE d.SetId = b.SetId))

    SELECT DISTINCT ca.SetId,

    DENSE_RANK() OVER(ORDER BY a.SetId1) AS UniqueSetId

    FROM Results a

    CROSS APPLY(VALUES(a.SetId1),(a.SetId2)) ca(SetId)

    WHERE NOT EXISTS(SELECT * FROM Results b WHERE b.SetId2 = a.SetId1)

    ORDER BY UniqueSetId,SetId;

    ____________________________________________________

    Deja View - The strange feeling that somewhere, sometime you've optimised this query before

    How to get the best help on a forum

    http://www.sqlservercentral.com/articles/Best+Practices/61537
  • Both solutions worked, thanks very much. For some reason Luis's is much faster on my large data set (millions of rows).

  • I usually do something like this:

    SELECT SetId, CHECKSUM_AGG(CHECKSUM(ItemCode)) ch

    FROM SetsItems

    GROUP BY SetId

    ORDER BY ch

    If you're **really** unlucky you might get a false positive because CHECKSUM isn't perfect, but i doubt it. You can do a HASHBYTES on the ItemCode to pretty much avoid any collisions.

  • This is a simple case of Relational Division, of which I am particular fond of.

    The result isSetID SetID

    1 10

    2 11

    7 8

    7 9

    8 9By using this piece of codeSELECT kc.SetID,

    nc.SetID

    FROM (

    SELECT SetID,

    COUNT(*) AS cnt

    FROM dbo.SetsItems

    GROUP BY SetID

    ) AS kc

    INNER JOIN (

    SELECT SetID,

    COUNT(*) AS cnt

    FROM dbo.SetsItems

    GROUP BY SetID

    ) AS nc ON nc.cnt = kc.cnt

    AND nc.SetID > kc.SetID

    INNER JOIN dbo.SetsItems AS t ON t.SetID = kc.SetID

    INNER JOIN dbo.SetsItems AS n ON n.SetID = nc.SetID

    WHERE t.ItemCode = n.ItemCode

    GROUP BY kc.SetID,

    nc.SetID

    HAVING COUNT(*) = MIN(nc.cnt);


    N 56°04'39.16"
    E 12°55'05.25"

  • -- SwePeso - Preaggregation/filtering step

    SELECT SetID,

    COUNT(*) AS cnt,

    MIN(ItemCode) AS mn,

    MAX(ItemCode) AS mx,

    CHECKSUM_AGG(CHECKSUM(ItemCode)) AS chk

    INTO #Temp

    FROM dbo.SetsItems WITH (NOLOCK)

    GROUP BY SetID;

    -- SwePeso - projection step

    SELECT ROW_NUMBER() OVER (ORDER BY w.chk) AS UniqueSetID,

    STUFF(f.Data, 1, 2, N'') AS SetIDs

    FROM (

    SELECT cnt,

    mn,

    mx,

    chk

    FROM #Temp

    GROUP BY cnt,

    mn,

    mx,

    chk

    HAVING COUNT(*) >= 2

    ) AS w

    CROSS APPLY (

    SELECT ', ' + CAST(x.SetID AS NVARCHAR(12))

    FROM #Temp AS x

    WHERE x.cnt = w.cnt

    AND x.mn = w.mn

    AND x.mx = w.mx

    AND x.chk = w.chk

    FOR XML PATH('')

    ) AS f(Data);


    N 56°04'39.16"
    E 12°55'05.25"

  • Thanks for the extra solutions. Got me reading about relational division: https://www.simple-talk.com/sql/learn-sql-server/high-performance-relational-division-in-sql-server

  • Laurence Neville (3/1/2015)


    Thanks for the extra solutions. Got me reading about relational division: https://www.simple-talk.com/sql/learn-sql-server/high-performance-relational-division-in-sql-server

    :blush:


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply