Duplicates

  • I have an interesting scenerio. I am getting ascii files from our mainframe system that contains duplicate records(they say they can't weed them out without causing severe system slowdowns) So, I am looking for the best way to remove dups. I was reading an old article about the IGNORE_DUP_KEY option for the index and does this index automatically delete dups of newly added items or does it just throw an error saying I am violating the primary key. I think I am going to create a new table populate it with data and weed the extras out.

    Matt

  • I think you are in right direction.

  • You could try using code like this to weed out duplicates - You need to ensure that the table is ordered on the field that you are taking the variables from.

    It is usually recommended that cursors be avoided if possible - This is the only circumstance under which I would normally use them

    Hope this helps

    -- 1. Declare variables used to compare Vendor keys

    DECLARE @kdnr_var1 char(6), @kdnr_var2 char(6)

    -- 2. Declare and run cursor against Vendor work table

    DECLARE dedupe_wrkVM3 CURSOR

    FOR SELECT KDNR FROM wrkVM3 ORDER BY KDNR FOR UPDATE

    OPEN dedupe_wrkVM3

    FETCH NEXT FROM dedupe_wrkVM3 INTO @kdnr_var1

    WHILE @@FETCH_STATUS = 0

    BEGIN

    SELECT @kdnr_var2=@kdnr_var1

    FETCH NEXT FROM dedupe_wrkVM3 INTO @kdnr_var1

    IF @@FETCH_STATUS = 0 AND @kdnr_var2=@kdnr_var1

    BEGIN

    INSERT INTO errSourceData

    VALUES(GETDATE(),'wrkVM3','KDNR',@kdnr_var2,'Duplicate Vendor Record')

    DELETE FROM wrkVM3 WHERE CURRENT OF dedupe_wrkVM3

    END

    END

    CLOSE dedupe_wrkVM3

    DEALLOCATE dedupe_wrkVM3

    GO

  • Hi Matt101,

    IGNORE_DUP_KEY will check each attempted insert or update to verify that it its key(s) are unique. Violations of the constraint will be ignored, i.e. they will not be inserted in the table & they will raise an error but the batch will proceed. You will take a perf hit when using this option. Also you will be unable to acquire a BU lock on bulk inserts if there are indexes in the table.

    Deleting the duplicate records after the fact is an option. Depending on the duplicate record specifications (size & quantity) you may dramatically exercise the log.

    Another option might be to allow the duplicates. Following the load put an index on the table. Build a view using distinct for accessing the data in the table.

    And another option might be to load the data into a staging table and then use select distinct to insert into the final table. Remember to drop or truncate your staging table once done. Bulk loading w/out indexes & using “tablock” hint will be your fastest load option.

    -- “it depends” on your data (types, columns, record count)

    Regards,

  • Matt,

    I've had to work similarly and the staging table approached worked best for me. It's very clear, easy to validate, and easy to clean up.

    -Kel


    K Leb

  • I've used the stage table technique for a long time to weed out dupes. It is by far the easiest way to go. Plus you can archive the stage table for safe keeping in case you need to redo the import or validate that you got the correct data. I'm actually kind of in the process of doing exactly this scenario right now as I upgrade my database to no longer use managed identities. Such fun! 🙂

    Gary Johnson

    Microsoft Natural Language Group

    DBA, Sr. DB Engineer




    Gary Johnson
    Microsoft Natural Language Group
    DBA, Sr. DB Engineer

    This posting is provided "AS IS" with no warranties, and confers no rights. The opinions expressed in this post are my own and may not reflect that of my employer.

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply