Compare Lists within a Table

  • I am trying to figure out how to compare lists in a table.

    I want to find if any two Lists have the exact same members.

    Here is some sample code. The real data is proprietary.

    DECLARE @Color TABLE

    (

    ID INT IDENTITY(1,1)

    ,Name VARCHAR(50)

    )

    INSERT INTO @Color

    SELECT 'White'

    UNION ALL SELECT 'Black'

    UNION ALL SELECT 'Purple'

    UNION ALL SELECT 'Pink'

    UNION ALL SELECT 'Red'

    UNION ALL SELECT 'Blue'

    UNION ALL SELECT 'Green'

    UNION ALL SELECT 'Yellow'

    UNION ALL SELECT 'Orange'

    UNION ALL SELECT 'Brown'

    DECLARE @List TABLE

    (

    ID INT IDENTITY(1,1)

    ,Code CHAR(3)

    )

    INSERT INTO @List

    SELECT 'ABC'

    UNION ALL SELECT 'XYZ'

    UNION ALL SELECT 'JKL'

    DECLARE @ColorList TABLE

    (

    ColorID INT

    ,ListID INT

    )

    INSERT INTO @ColorList(ListID, ColorID)

    SELECT 1,1

    UNION ALL SELECT 1,5

    UNION ALL SELECT 1,8

    UNION ALL SELECT 1,3

    UNION ALL SELECT 1,4

    UNION ALL SELECT 2,4

    UNION ALL SELECT 2,6

    UNION ALL SELECT 2,7

    UNION ALL SELECT 2,8

    UNION ALL SELECT 2,10

    UNION ALL SELECT 3,4

    UNION ALL SELECT 3,6

    UNION ALL SELECT 3,7

    UNION ALL SELECT 3,8

    UNION ALL SELECT 3,10

    SELECT *

    FROM @List L

    JOIN @ColorList CL ON L.ID = CL.ListID

    JOIN @Color C ON CL.ColorID = C.ID

    -- expected results: Lists 2 & 3 are identical

  • Here's the same issue with some answers

    http://qa.sqlservercentral.com/Forums/Topic1101072-392-1.aspx

    For better, quicker answers, click on the following...
    http://www.sqlservercentral.com/articles/Best+Practices/61537/

    For better answers on performance questions, click on the following...
    http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • Thanks for pointing me in the right direction, Mike!

    SwePeso's solution worked for me.

    Posting the solution here in case this helps others:

    --10 times the record, 7 times the duration.

    /* Total time

    10k 1 sec

    100k 7 sec

    1000k 50 sec

    */

    -- 0 sec (10k), 4 sec (100k), 39 sec (1000k)

    SELECT x.ListID AS LowID,

    y.ListID AS HighID,

    MIN(x.ProductItems) AS ProductItems

    INTO #Stage

    FROM (

    SELECT ListID,

    COUNT(*) AS ProductItems

    FROM @ColorList

    GROUP BY ListID

    ) AS x

    INNER JOIN (

    SELECT ListID,

    COUNT(*) AS ProductItems

    FROM @ColorList

    GROUP BY ListID

    ) AS y ON y.ProductItems = x.ProductItems

    INNER JOIN @ColorList AS x1 ON x1.ListID = x.ListID

    INNER JOIN @ColorList AS y1 ON y1.ListID = y.ListID

    WHERE x1.ColorID = y1.ColorID

    AND x.ListID < y.ListID

    GROUP BY x.ListID,

    y.ListID

    HAVING COUNT(*) = MIN(x.ProductItems)

    -- 0 sec (10k), 1 sec (100k), 2 sec (1000k)

    CREATE INDEX IX_Low ON #Stage (LowID) INCLUDE (HighID, ProductItems)

    CREATE INDEX IX_High ON #Stage (HighID) INCLUDE (LowID, ProductItems)

    -- 0 sec (10k), 1 sec (100k), 8 sec (1000k)

    ;WITH cteDuplicate

    AS (

    SELECT s.LowID AS theGrp,

    s.LowID AS ListID,

    MIN(s.ProductItems) AS ProductItems,

    1 AS SeqID

    FROM #Stage AS s

    WHERE NOT EXISTS (SELECT * FROM #Stage AS x WHERE x.HighID = s.LowID)

    GROUP BY s.LowID

    UNION ALL

    SELECT d.theGrp,

    f.ID AS ListID,

    d.ProductItems,

    d.SeqID + 1 AS SeqID

    FROM cteDuplicate AS d

    CROSS APPLY (

    SELECT s.HighID,

    ROW_NUMBER() OVER (ORDER BY s.HighID) AS SeqID

    FROM #Stage AS s

    WHERE s.LowID = d.ListID

    ) AS f(ID, SeqID)

    WHERE f.SeqID = 1

    )

    SELECT DENSE_RANK() OVER (ORDER BY theGrp) AS theGroup,

    SeqID,

    ListID

    INTO #Temp

    FROM cteDuplicate

    OPTION (MAXRECURSION 0)

    SELECT *

    FROM #Temp

    -- 0 sec

    DROP TABLE #Stage

    DROP TABLE #Temp

  • Hi Goldie,

    Before we start on a solution, let's make more than just a handful of rows of data to test with so we can see what happens when scale increases.

    --===== Conditionally drop the test tables to make reruns easier in SSMS

    IF OBJECT_ID('tempdb..#Color' ,'U') IS NOT NULL DROP TABLE #Color;

    IF OBJECT_ID('tempdb..#List' ,'U') IS NOT NULL DROP TABLE #List;

    IF OBJECT_ID('tempdb..#ColorList' ,'U') IS NOT NULL DROP TABLE #ColorList;

    --===== Create and populate the color table on the fly

    SELECT ID = IDENTITY(INT,1,1),

    Color = d.Color

    INTO #Color

    FROM (

    SELECT 'White' UNION ALL

    SELECT 'Black' UNION ALL

    SELECT 'Purple' UNION ALL

    SELECT 'Pink' UNION ALL

    SELECT 'Red' UNION ALL

    SELECT 'Blue' UNION ALL

    SELECT 'Green' UNION ALL

    SELECT 'Yellow' UNION ALL

    SELECT 'Orange' UNION ALL

    SELECT 'Brown'

    ) d (Color)

    ;

    --===== Create and populate the list of codes on the fly

    -- This creates all 17,576 possibilities

    WITH

    cteTally AS

    (

    SELECT TOP (POWER(26,3)) N = ROW_NUMBER() OVER (ORDER BY (SELECT NULL))-1

    FROM sys.all_columns ac1

    CROSS JOIN sys.all_columns ac2

    )

    SELECT ID = IDENTITY(INT,1,1),

    Code = CHAR((N/26/26)%26+65) --26^2

    + CHAR((N/26) %26+65) --26^1

    + CHAR((N) %26+65) --26^0

    INTO #List

    FROM cteTally

    ;

    --===== Create and populate the ColorList junction table on the fly

    -- This creates approximately 20,000

    SELECT TOP 100000

    ColorID = ABS(CHECKSUM(NEWID()))%10+1,

    ListID = ABS(CHECKSUM(NEWID()))%POWER(26,3)+1

    INTO #ColorList

    FROM sys.all_columns ac1

    CROSS JOIN sys.all_columns ac2

    ;

    We don't actually need the Color table or the List table at this point. What we're really interested in is figuring out the answer to your question of "find if any two Lists have the exact same members". Now, there are a whole lot of ways to answer that single question and some of them can be pretty darned slow. It turns out that one of the easiest ways to find out is also one of the fastest ways. The easy way is to simply concatenate each color into an ordered CSV list for each ListID and then number the lists using DENSE_RANK.

    The following code only takes 5 seconds on my 9 year old single CPU desktop...

    --===== Conditionally drop the test tables to make reruns easier in SSMS

    IF OBJECT_ID('tempdb..#ColorGroup','U') IS NOT NULL DROP TABLE #ColorGroup;

    --===== Group lists that have precisely the same colors

    WITH

    cteAggColors AS

    (

    SELECT cl2.ListID,

    ItemCount = COUNT(*),

    Colors =

    (--==== Create an ordered CSV list of colors

    -- for the current cl2.ListID.

    SELECT ','+CAST(cl1.ColorID AS VARCHAR(10))

    FROM #ColorList cl1

    WHERE cl1.ListID = cl2.ListID

    ORDER BY cl1.ColorID

    FOR XML PATH('')

    )

    + ','

    FROM #ColorList cl2

    GROUP BY cl2.ListID

    )

    SELECT ColorGroup = DENSE_RANK() OVER (ORDER BY Colors),

    ColorGroupCount = CAST(NULL AS INT),

    ListID,

    ItemCount,

    Colors

    INTO #ColorGroup

    FROM cteAggColors

    ORDER BY ColorGroup, ListID

    ;

    --===== Count the members of each "color group" and store that

    -- in the table, as well.

    WITH

    cteCountMembers AS

    (

    SELECT ColorGroup, ColorGroupCount = COUNT(*)

    FROM #ColorGroup

    GROUP BY ColorGroup

    )

    UPDATE cg

    SET ColorGroupCount = cm.ColorGroupCount

    FROM #ColorGroup cg

    LEFT JOIN cteCountMembers cm

    ON cg.ColorGroup = cm.ColorGroup

    ;

    Of course, you can then easily find ListIDs that have precisely the same colors very easily...

    SELECT * FROM #ColorGroup WHERE ColorGroupCount > 1

    Then, unlike many of the other methods, we can now do some pretty fancy stuff. For example, we can now easily answer the question of "Which ListID's have unique color combinations compared to all other lists"?

    SELECT * FROM #ColorGroup WHERE ColorGroupCount = 1

    Because of the large amount of random data we used, it turns out that the answer to the question above is "None" but now we have the proof.

    If the "lists" were purchases of something like "flowers" and each list represented a purchase of a small group of flowers, we can also easily answer questions like "What are the most popular color combinations of 4 flowers and which lists have those combinations"?

    WITH

    cteFindMostPopular AS

    (

    SELECT Popularity = MAX(ColorGroupCount) FROM #ColorGroup WHERE ItemCount = 4

    )

    SELECT color.*

    FROM #ColorGroup color

    INNER JOIN cteFindMostPopular popular

    ON color.ColorGroupCount = popular.Popularity

    WHERE color.ItemCount = 4

    ;

    If you did, indeed, own a flower shop, let's say you're getting ready to make up flower-sets for the 4th of July. What are the 2 most popular combinations of ONLY Red, White, and Blue flowers that you sold for the previous 4th of July and which lists are they on?

    WITH

    cteFindMostPopular AS

    (

    SELECT DISTINCT TOP 2 Popularity = ColorGroupCount

    FROM #ColorGroup

    WHERE DistinctColors = ',1,5,6,'

    ORDER BY ColorGroupCount DESC

    )

    SELECT color.*

    FROM #ColorGroup color

    INNER JOIN cteFindMostPopular popular

    ON color.ColorGroupCount = popular.Popularity

    WHERE color.DistinctColors = ',1,5,6,'

    ;

    How about this. You got a huge deal on Pink flowers and bought a lot of them. What are the top selling color combinations that have at least 2 Pink flowers in them?

    WITH

    cteFindMostPopular AS

    (

    SELECT DISTINCT TOP 3 Popularity = ColorGroupCount

    FROM #ColorGroup

    WHERE Colors LIKE '%,4,4,%'

    ORDER BY ColorGroupCount DESC

    )

    SELECT color.*

    FROM #ColorGroup color

    INNER JOIN cteFindMostPopular popular

    ON color.ColorGroupCount = popular.Popularity

    WHERE color.Colors LIKE '%,4,4,%'

    ORDER BY ColorGroupCount DESC, ColorGroup, ListID

    ;

    With a little imagination, the combinations are just about endless.

    The other thing to note on this method is the performance. For example, the code you posted as the solution you used takes more than 25 MINUTES to run on my machine where the method I used above takes only seconds to run and leaves behind a table where many, many different and "fairly odd" questions asked of the data can be easily and quickly answered.

    Of course, using the "1960's style pointers" to do it all made it a whole lot easier to demonstrate and use. 😛 You could, however, use the color name and the list name to do the same thing.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • CELKO (5/13/2011)


    I know this is a skeleton, but you got so much so wrong in so little code. Colors would be encoded with an industry standard (Land numbers, Pantone, etc), but never a count of physical insertion attempts like you did. What you are doing is faking 1960's pointer chains and not yet writing RDBMS at all

    It would also help if you knew ISO-11179 naming rules and did not use vague names

    T-SQL has row constructors so stop doing dialect. It is both ugly and slow.

    [...]

    As I mentioned, these tables are just samples to try to figure out how to solve an issue.

    The real table structure data is proprietary and has nothing to do with colors. I have two tables with primary keys, and another "intersection" table that joins them together. Primary & foreign keys are in place, I just did not include them in this small example.

    I was trying to figure out how to find out if any two "lists" were identical.

    This is a kind of relational division. You can do it with proprietary APPLY and other fancy stuff. Let's make it easy. If two sets are equal, then their cardinality is the same. That is a great filter.

    I'm not sure what you are saying here. Just because the counts are the same, does not mean that the lists are identical.

  • ...

    We don't actually need the Color table or the List table at this point. What we're really interested in is figuring out the answer to your question of "find if any two Lists have the exact same members". Now, there are a whole lot of ways to answer that single question and some of them can be pretty darned slow. It turns out that one of the easiest ways to find out is also one of the fastest ways. The easy way is to simply concatenate each color into an ordered CSV list for each ListID and then number the lists using DENSE_RANK.

    ...

    Thanks for your help, Jeff.

    This is a really straightforward approach. A lot easier to understand than the solution I posted.

    I am surprised that it is so fast. I would have thought that all of that string concatenation would end up slow.

    (It's a shame I don't run a flower shop :hehe: Sounds interesting!)

  • Thanks for the feedback, Goldie. I needed to do such a thing with 300,000 lists that had 1 to 12 items in each list. I tried other folks methods and they just didn't cut it performance-wise. I even tried Celko's method of building 300,000 "micro" Triangular Joins which wasn't bad but was still too slow, IMHO. To be honest, I was surprised at how fast the concatenate and group-counting actually was. Having it be simple to understand was definitely a bonus.

    Then I starting playing with scenarios such as "People who bought x also bought y" and was amazed at how simple it was to do so (used an inner join back to the original table for a different take on that) and I was hooked.

    Since I knew your data was a made up scenario, I went to the "flower" example just to show you that, in the absence of my knowledge of your data, that other things were possible with your data if you needed to do something different.

    Again, thank you very much for the feedback. I'm glad I could help.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply