INSERT INTO Statement w/ Duplicate Data (Recurring)

  • Here is the basis of my problem: what I'm looking for is to INSERT duplicate data ID's from one table to another table on a daily basis.

    I have a table, say tblData, that contains millions of records and has a Primary Key named DataID.  I have another table, say tblDuplicateData w/ a Primary Key named DuplicateDataID and a Foreign Key named DataID, where I would like all NEW duplicate data from tblData to be dumped into tblDuplicateData on a daily basis.

    Here are more details: I have initially imported old data into tblData and it currently contains over 3 million rows.  On a daily basis, I will be importing a text file from Oracle that will add around 6,000 records via a DTS package. 

    I have used the following code to locate all of the duplicate data (which works nicely):

    SELECT Field1, Field2, Field3, count(*) as [How Many?]

    FROM tblData

    GROUP BY Field1, Field2, Field3

    HAVING COUNT(*) > 1

    Where I am having trouble is my second table, tblDuplicateData, which has DataID as a link (foreign key) to tblData, so I am having difficulty trying to figure out how to populate that table given the above SQL statement. 

    So I was thinking of enhancing my DTS package that has the daily file Oracle upload to include 'ON SUCCESS', an INSERT statement something like this (psuedo code):

    INSERT INTO tblDuplicateData (DataID)

      SELECT DataID

      FROM tblData

      WHERE DataID EXISTS?

        (SELECT Field1, Field2, Field3, count(*) as [How Many?]

         FROM tblData

         GROUP BY Field1, Field2, Field3

         HAVING COUNT(*) > 1)

      AND DataID NOT IN tblDuplicateData

    So to iterate once more, what I'm basically looking for is to run a Stored Procedure after my daily file upload that will INSERT into tblDuplicateData all DataID's that are DUPLICATES AND do not yet belong to that table, i.e. tblDuplicateData.

    Thanks in advance for any assistance!

  • I think you need:

    insert into tblduplicatedata (dataid)

    select dataid from tbldata d

    left join tblduplicatedata dup on dup.dataid = d.dataid

    inner join (select field1, field2, field3, min(dataid) as dataid

                from tbldata group by field1, field2, field3

                having count(*) > 1) as f

                on d.field1 = f.field1 and d.field2 = f.field2 and d.field3 = f.field3 and d.dataid <> f.dataid

    WHERE

    and dup.dataid is null

    I haven't been able to test it and there is no guarentees it will run quickly against millions of records. But it should give you a guide.

    Regards

    Peter Tillotson

Viewing 2 posts - 1 through 1 (of 1 total)

You must be logged in to reply to this topic. Login to reply