How To Find or Remove Duplicate Groups of Records

  • I have table I am "processing" from imported text data:

     Create Table #TempGroups (GroupNum Int, Product Varchar(10))

    I must export text data dealing with these "product" groups. I want to

    greatly reduce the number of records in this table. There are many groups of

    products that are identical. i.e. Group #1 contains the same 100 products as

    in Group #55, so I just want to keep Group #1, and have other "parent" records

    using Group #55 to point to Group #1.

    How do I (quickly) identify the Groups that are the same?

    I do not need to DELETE any rows, just need results so that when data is exported I can JOIN with

    these results to substitute remaining Group #s.

    My example Table Stats:

     ~37000 recs, ~1000 Groups, 260 Products in largest Group,

     85 different Group sizes, 716 "Duplicate" Groups with ~28000 recs

     Create Clustered Index #TempGroups on #TempGroups (GroupNum, Product)

    My 1st step was to...

     Create Table #TempGroupCount (GroupNum Int, Many Int)

     Insert into #TempGroupCount Select GroupNum, Count(*) as Many from #TempGroups Group by GroupNum

     Create Clustered Index #TempGroupCount on #TempGroupCount (GroupNum)

    2nd Step to identify the Groups where ALL the Groups of the same size had the same products...

     Select Outside.GroupNum, Inside.Many, Inside.MinGroup as GroupToKeep

     From ( Select Many, Min(TG.GroupNum) as MinGroup,

       from #TempGroups TG

       Join #TempGroupCount TGC

        On TG.GroupNum = TGC.GroupNum

      Group by Many) Inside

     Join #TempGroupCount Outside

      On Inside.Many = Outside.Many and Inside.MinGroup <> Outside.GroupNum

     Where Inside.Many = Inside.Prods

    This results in identifying 87 of the 716 duplicate Groups, and would let me eliminate ~10000 recs,

    and runs quickly (~1 sec.).

    3rd step. This one I want to be able to deal with in a "set" operation instead of a cursor / loop...

    Need to identify most if not all the (remaining) duplicate groups.

    I'm currently using a cursor driven by:

     Select G1.GroupNum, G2.GroupNum, G1.Many

     From  (Select GroupNum, Many

       from #TempGroupCount

       Where GroupNum not In (Select GroupNum From [Those Groups in Step#2]) ) G1

     Join  (Select GroupNum, Many

       from #TempGroupCount

       Where GroupNum not In (Select GroupNum From [Those Groups in Step#2]) ) G2

      On G1.GroupNum < G2.GroupNum and G1.Many = G2.Many

     Where G1.Many > 1

    Then looping with Dynamic SQL Comparing Count([UNION of 2 Groups of Products]) with Count([Single Group]) etc.

    The "set" SELECTS I've tried are all slower than the cursor-loop-dynamic code. They do however eliminate Step 2.

    I've had to limit my loop to "Where G1.Many > 50" to remain within my 3-5 second desired processing time. The

    "set" SQLs are taking 15-25+ sec. or so, so not fast enough .

    So I'm looking for some forum help for ideas.

    I'll email dump of my example data, just "Private Message" me.

    Thanks in advance.



    Once you understand the BITs, all the pieces come together

  • I found a solution that works for what I need.

    Since the "Product" data in the #TempGroups table is numeric,

    I could use this to my advantage:

    Instead of (1st Step)

     Insert into #TempGroupCount Select GroupNum, Count(*) as Many from #TempGroups Group by GroupNum

    I used

     Insert into #TempGroupCount Select GroupNum, Count(*) as Many, Sum(Convert(Int, Product)) as AllProd from #TempGroups Group by GroupNum Order by GroupNum

    This resulted in more descerning records since now Groups with

    differing products could most likely have rows where AllProd value was different.

    Then I could change Step 2 to something like

     Select Outside.GroupNum, Inside.Many, Inside.MinGroup as GroupToKeep

     From ( Select Many, AllProd, Min(TG.GroupNum) as MinGroup, Count(Distinct TG.GroupNum) as Groups

       from #TempGroups TG

       Join #TempGroupCount TGC

        On TG.GroupNum = TGC.GroupNum and Many <> 1

      Group by Many, AllProd Having Count(Distinct TG.GroupNum) <> 1

     &nbsp Inside

     Join #TempGroupCount Outside

      On Inside.Many = Outside.Many and Inside.AllProd = Outside.AllProd and Inside.MinGroup <> Outside.GroupNum

     Where Inside.Many = Inside.Prods

    This results in catching nearly all the duplicate Groups in my data situation,

    and takes no more time to execute than the original Step 2 .

    I guess the key was to try to make a #Temp... data a "better" test of difference. Even

    if my Product data was not pure numeric, I could have still done something to fill my

    AllProd column of data, such as maybe concatenating the MIN() and MAX() Product etc.,

    anything to be able to better ID the differnces.

     



    Once you understand the BITs, all the pieces come together

Viewing 2 posts - 1 through 1 (of 1 total)

You must be logged in to reply to this topic. Login to reply