Compare Lists within a Table

Question

Post reply

Compare Lists within a Table

Goldie Lesser

SSCertifiable

Points: 6155
More actions
May 12, 2011 at 9:33 am

#237284

I am trying to figure out how to compare lists in a table.

I want to find if any two Lists have the exact same members.

Here is some sample code. The real data is proprietary.

DECLARE @Color TABLE ( ID INT IDENTITY(1,1) ,Name VARCHAR(50) ) INSERT INTO @Color SELECT 'White' UNION ALL SELECT 'Black' UNION ALL SELECT 'Purple' UNION ALL SELECT 'Pink' UNION ALL SELECT 'Red' UNION ALL SELECT 'Blue' UNION ALL SELECT 'Green' UNION ALL SELECT 'Yellow' UNION ALL SELECT 'Orange' UNION ALL SELECT 'Brown' DECLARE @List TABLE ( ID INT IDENTITY(1,1) ,Code CHAR(3) ) INSERT INTO @List SELECT 'ABC' UNION ALL SELECT 'XYZ' UNION ALL SELECT 'JKL' DECLARE @ColorList TABLE ( ColorID INT ,ListID INT ) INSERT INTO @ColorList(ListID, ColorID) SELECT 1,1 UNION ALL SELECT 1,5 UNION ALL SELECT 1,8 UNION ALL SELECT 1,3 UNION ALL SELECT 1,4 UNION ALL SELECT 2,4 UNION ALL SELECT 2,6 UNION ALL SELECT 2,7 UNION ALL SELECT 2,8 UNION ALL SELECT 2,10 UNION ALL SELECT 3,4 UNION ALL SELECT 3,6 UNION ALL SELECT 3,7 UNION ALL SELECT 3,8 UNION ALL SELECT 3,10 SELECT * FROM @List L JOIN @ColorList CL ON L.ID = CL.ListID JOIN @Color C ON CL.ColorID = C.ID -- expected results: Lists 2 & 3 are identical

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply

Mike01 SSChampion Points: 11381 More actions · Answer 1

Here's the same issue with some answers

http://qa.sqlservercentral.com/Forums/Topic1101072-392-1.aspx

For better, quicker answers, click on the following...
http://www.sqlservercentral.com/articles/Best+Practices/61537/

For better answers on performance questions, click on the following...
http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

Goldie Lesser SSCertifiable Points: 6155 More actions · Answer 2

Thanks for pointing me in the right direction, Mike!

SwePeso's solution worked for me.

Posting the solution here in case this helps others:



--10 times the record, 7 times the duration.

/*  Total time

  10k    1 sec

 100k    7 sec

1000k   50 sec

*/

-- 0 sec (10k), 4 sec (100k), 39 sec (1000k)

SELECT          x.ListID AS LowID,

                y.ListID AS HighID,

                MIN(x.ProductItems) AS ProductItems

INTO            #Stage

FROM            (

                        SELECT          ListID,

                                        COUNT(*) AS ProductItems

                        FROM            @ColorList

                        GROUP BY        ListID

                ) AS x

INNER JOIN      (

                        SELECT          ListID,

                                        COUNT(*) AS ProductItems

                        FROM            @ColorList

                        GROUP BY        ListID

                ) AS y ON y.ProductItems = x.ProductItems

INNER JOIN      @ColorList AS x1 ON x1.ListID = x.ListID

INNER JOIN      @ColorList AS y1 ON y1.ListID = y.ListID

WHERE           x1.ColorID = y1.ColorID

                AND x.ListID < y.ListID

GROUP BY        x.ListID,

                y.ListID

HAVING          COUNT(*) = MIN(x.ProductItems)

-- 0 sec (10k), 1 sec (100k), 2 sec (1000k)

CREATE INDEX IX_Low ON #Stage (LowID) INCLUDE (HighID, ProductItems)

CREATE INDEX IX_High ON #Stage (HighID) INCLUDE (LowID, ProductItems)

-- 0 sec (10k), 1 sec (100k), 8 sec (1000k)

;WITH cteDuplicate

AS (

        SELECT          s.LowID AS theGrp,

                        s.LowID AS ListID,

                        MIN(s.ProductItems) AS ProductItems,

                        1 AS SeqID

        FROM            #Stage AS s

        WHERE           NOT EXISTS (SELECT * FROM #Stage AS x WHERE x.HighID = s.LowID)

        GROUP BY        s.LowID

        UNION ALL

        SELECT          d.theGrp,

                        f.ID AS ListID,

                        d.ProductItems,

                        d.SeqID + 1 AS SeqID

        FROM            cteDuplicate AS d

        CROSS APPLY     (

                                SELECT  s.HighID,

                                        ROW_NUMBER() OVER (ORDER BY s.HighID) AS SeqID

                                FROM    #Stage AS s

                                WHERE   s.LowID = d.ListID

                        ) AS f(ID, SeqID)

        WHERE           f.SeqID = 1

)

SELECT          DENSE_RANK() OVER (ORDER BY theGrp) AS theGroup,

                SeqID,

                ListID

INTO            #Temp

FROM            cteDuplicate

OPTION          (MAXRECURSION 0)

SELECT *

FROM #Temp

-- 0 sec

DROP TABLE      #Stage

DROP TABLE		#Temp

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 3

Hi Goldie,

Before we start on a solution, let's make more than just a handful of rows of data to test with so we can see what happens when scale increases.

--===== Conditionally drop the test tables to make reruns easier in SSMS



     IF OBJECT_ID('tempdb..#Color'     ,'U') IS NOT NULL DROP TABLE #Color;

     IF OBJECT_ID('tempdb..#List'      ,'U') IS NOT NULL DROP TABLE #List;

     IF OBJECT_ID('tempdb..#ColorList' ,'U') IS NOT NULL DROP TABLE #ColorList;

--===== Create and populate the color table on the fly

 SELECT ID = IDENTITY(INT,1,1),

        Color = d.Color

   INTO #Color

   FROM (

         SELECT 'White'  UNION ALL

         SELECT 'Black'  UNION ALL

         SELECT 'Purple' UNION ALL

         SELECT 'Pink'   UNION ALL

         SELECT 'Red'    UNION ALL

         SELECT 'Blue'   UNION ALL

         SELECT 'Green'  UNION ALL

         SELECT 'Yellow' UNION ALL

         SELECT 'Orange' UNION ALL

         SELECT 'Brown'

        ) d (Color)

;

--===== Create and populate the list of codes on the fly

     -- This creates all 17,576 possibilities

WITH

cteTally AS

(

 SELECT TOP (POWER(26,3)) N = ROW_NUMBER() OVER (ORDER BY (SELECT NULL))-1

   FROM sys.all_columns ac1

  CROSS JOIN sys.all_columns ac2

)

 SELECT ID   = IDENTITY(INT,1,1),

        Code = CHAR((N/26/26)%26+65) --26^2

             + CHAR((N/26)   %26+65) --26^1

             + CHAR((N)      %26+65) --26^0

   INTO #List

   FROM cteTally

;

--===== Create and populate the ColorList junction table on the fly

     -- This creates approximately 20,000

 SELECT TOP 100000

        ColorID = ABS(CHECKSUM(NEWID()))%10+1,

        ListID  = ABS(CHECKSUM(NEWID()))%POWER(26,3)+1

   INTO #ColorList

   FROM sys.all_columns ac1

  CROSS JOIN sys.all_columns ac2

;

We don't actually need the Color table or the List table at this point. What we're really interested in is figuring out the answer to your question of "find if any two Lists have the exact same members". Now, there are a whole lot of ways to answer that single question and some of them can be pretty darned slow. It turns out that one of the easiest ways to find out is also one of the fastest ways. The easy way is to simply concatenate each color into an ordered CSV list for each ListID and then number the lists using DENSE_RANK.

The following code only takes 5 seconds on my 9 year old single CPU desktop...

--===== Conditionally drop the test tables to make reruns easier in SSMS



     IF OBJECT_ID('tempdb..#ColorGroup','U') IS NOT NULL DROP TABLE #ColorGroup;

--===== Group lists that have precisely the same colors

WITH

cteAggColors AS

(

 SELECT cl2.ListID,

        ItemCount = COUNT(*),

        Colors =

                 (--==== Create an ordered CSV list of colors

                      -- for the current cl2.ListID.

                  SELECT ','+CAST(cl1.ColorID AS VARCHAR(10))

                    FROM #ColorList cl1

                   WHERE cl1.ListID = cl2.ListID

                   ORDER BY cl1.ColorID

                     FOR XML PATH('')

                 )

               + ','

   FROM #ColorList cl2

  GROUP BY cl2.ListID

)

 SELECT ColorGroup      = DENSE_RANK() OVER (ORDER BY Colors),

        ColorGroupCount = CAST(NULL AS INT),

        ListID,

        ItemCount,

        Colors

   INTO #ColorGroup

   FROM cteAggColors

  ORDER BY ColorGroup, ListID

;

--===== Count the members of each "color group" and store that

     -- in the table, as well.

WITH

cteCountMembers AS

(

 SELECT ColorGroup, ColorGroupCount = COUNT(*)

   FROM #ColorGroup

  GROUP BY ColorGroup

)

 UPDATE cg

    SET ColorGroupCount = cm.ColorGroupCount

   FROM #ColorGroup cg

   LEFT JOIN cteCountMembers cm

     ON cg.ColorGroup = cm.ColorGroup

;

Of course, you can then easily find ListIDs that have precisely the same colors very easily...

SELECT * FROM #ColorGroup WHERE ColorGroupCount > 1

Then, unlike many of the other methods, we can now do some pretty fancy stuff. For example, we can now easily answer the question of "Which ListID's have unique color combinations compared to all other lists"?

SELECT * FROM #ColorGroup WHERE ColorGroupCount = 1

Because of the large amount of random data we used, it turns out that the answer to the question above is "None" but now we have the proof.

If the "lists" were purchases of something like "flowers" and each list represented a purchase of a small group of flowers, we can also easily answer questions like "What are the most popular color combinations of 4 flowers and which lists have those combinations"?

WITH



cteFindMostPopular AS

(

 SELECT Popularity = MAX(ColorGroupCount) FROM #ColorGroup WHERE ItemCount = 4

)

 SELECT color.*

   FROM #ColorGroup color

  INNER JOIN cteFindMostPopular popular

     ON color.ColorGroupCount = popular.Popularity

  WHERE color.ItemCount = 4

;

If you did, indeed, own a flower shop, let's say you're getting ready to make up flower-sets for the 4th of July. What are the 2 most popular combinations of ONLY Red, White, and Blue flowers that you sold for the previous 4th of July and which lists are they on?

WITH



cteFindMostPopular AS

(

 SELECT DISTINCT TOP 2 Popularity = ColorGroupCount

   FROM #ColorGroup

  WHERE DistinctColors = ',1,5,6,'

 ORDER BY ColorGroupCount DESC

)

 SELECT color.*

   FROM #ColorGroup color

  INNER JOIN cteFindMostPopular popular

     ON color.ColorGroupCount = popular.Popularity

  WHERE color.DistinctColors = ',1,5,6,'

;

How about this. You got a huge deal on Pink flowers and bought a lot of them. What are the top selling color combinations that have at least 2 Pink flowers in them?

WITH



cteFindMostPopular AS

(

 SELECT DISTINCT TOP 3 Popularity = ColorGroupCount

   FROM #ColorGroup

  WHERE Colors LIKE '%,4,4,%'

 ORDER BY ColorGroupCount DESC

)

 SELECT color.*

   FROM #ColorGroup color

  INNER JOIN cteFindMostPopular popular

     ON color.ColorGroupCount = popular.Popularity

  WHERE color.Colors  LIKE '%,4,4,%'

 ORDER BY ColorGroupCount DESC, ColorGroup, ListID

;

With a little imagination, the combinations are just about endless.

The other thing to note on this method is the performance. For example, the code you posted as the solution you used takes more than 25 MINUTES to run on my machine where the method I used above takes only seconds to run and leaves behind a table where many, many different and "fairly odd" questions asked of the data can be easily and quickly answered.

Of course, using the "1960's style pointers" to do it all made it a whole lot easier to demonstrate and use. 😛 You could, however, use the color name and the list name to do the same thing.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

Goldie Lesser SSCertifiable Points: 6155 More actions · Answer 4

CELKO (5/13/2011)

I know this is a skeleton, but you got so much so wrong in so little code. Colors would be encoded with an industry standard (Land numbers, Pantone, etc), but never a count of physical insertion attempts like you did. What you are doing is faking 1960's pointer chains and not yet writing RDBMS at all

It would also help if you knew ISO-11179 naming rules and did not use vague names

T-SQL has row constructors so stop doing dialect. It is both ugly and slow.

[...]

As I mentioned, these tables are just samples to try to figure out how to solve an issue.

The real table structure data is proprietary and has nothing to do with colors. I have two tables with primary keys, and another "intersection" table that joins them together. Primary & foreign keys are in place, I just did not include them in this small example.

I was trying to figure out how to find out if any two "lists" were identical.

This is a kind of relational division. You can do it with proprietary APPLY and other fancy stuff. Let's make it easy. If two sets are equal, then their cardinality is the same. That is a great filter.

I'm not sure what you are saying here. Just because the counts are the same, does not mean that the lists are identical.

Goldie Lesser SSCertifiable Points: 6155 More actions · Answer 5

...

We don't actually need the Color table or the List table at this point. What we're really interested in is figuring out the answer to your question of "find if any two Lists have the exact same members". Now, there are a whole lot of ways to answer that single question and some of them can be pretty darned slow. It turns out that one of the easiest ways to find out is also one of the fastest ways. The easy way is to simply concatenate each color into an ordered CSV list for each ListID and then number the lists using DENSE_RANK.

...

Thanks for your help, Jeff.

This is a really straightforward approach. A lot easier to understand than the solution I posted.

I am surprised that it is so fast. I would have thought that all of that string concatenation would end up slow.

(It's a shame I don't run a flower shop :hehe: Sounds interesting!)

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 6

Thanks for the feedback, Goldie. I needed to do such a thing with 300,000 lists that had 1 to 12 items in each list. I tried other folks methods and they just didn't cut it performance-wise. I even tried Celko's method of building 300,000 "micro" Triangular Joins which wasn't bad but was still too slow, IMHO. To be honest, I was surprised at how fast the concatenate and group-counting actually was. Having it be simple to understand was definitely a bonus.

Then I starting playing with scenarios such as "People who bought x also bought y" and was amazed at how simple it was to do so (used an inner join back to the original table for a different take on that) and I was hooked.

Since I knew your data was a made up scenario, I went to the "flower" example just to show you that, in the absence of my knowledge of your data, that other things were possible with your data if you needed to do something different.

Again, thank you very much for the feedback. I'm glad I could help.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions