Dupes issue

  • Hello,

    I have an app where I import txt (tab delimited) file(s) into tables. To avoid duplicates I have a constraint of 5 key fields. This works perfectly. But I need to identify duplicates and store them in a seperate table for auditing. To identify duplicates I do 2 things:

    1. I check the file (stored in a temporary table _A) against the existing table A using the 5 key fields. this works to identify if the data is already in ths system. This is the first thing that is done.

    2. I check _A to see if it has any duplicates contained in it. I use the following:

    INSERT INTO _tblDupes ( ....)

    SELECT * FROM _A a

    WHERE EXISTS  (

     SELECT NULL FROM _A b

     WHERE

      b.[fld_1] = a.[fld_1]

      AND b.[fld_2] = a.[fld_2]

      AND b.[fld_3] = a.[fld_3]

      AND b.[fld_4] = a.[fld_4]

     GROUP BY

      b.[fld_1], b.[fld_2],

      b.[fld_3], b.[fld_4]

     HAVING

      COUNT(b.[ID]) > 1

    )

    This also works perfectly. BUT the problem is, when I do an insert into A from _A it will take the 1st row and disregard the 2nd row as a duplicate because of the constraint. Where as the above query inserts both rows in _A in the _tbleDupes table. So when I go to sum up all the data I recieved from the file then the dupes are getting counted 2 or even 3 times. So what I want is:

    _A     A     _tblDupes

    ab    ab

    ac    ac

    ac               ac

    ad    ad

    ad               ad

    right now I am getting:

    _A     A     _tblDupes

    ab    ab

    ac    ac        ac

    ac               ac

    ad    ad        ad

    ad               ad

    any help will be greatly appreciated.

    Thank you

  • You can use the DISTINCT keywords like

    INSERT INTO _tblDupes ( ....)

    SELECT DISTINCT * FROM _A a

    WHERE EXISTS  (

     SELECT NULL FROM _A b

     WHERE

      b.[fld_1] = a.[fld_1]

      AND b.[fld_2] = a.[fld_2]

      AND b.[fld_3] = a.[fld_3]

      AND b.[fld_4] = a.[fld_4]

     GROUP BY

      b.[fld_1], b.[fld_2],

      b.[fld_3], b.[fld_4]

     HAVING

      COUNT(b.[ID]) > 1

    )

    Patrick Duflot

  • I have tried the DISTINCT select but to no avail. I thought that maybe because the duplicate rows dont have anything else common outside of the fields in the constraint but taht is not the case. I have 1 instance where every single field is identical and it still selected both. I have no idea how to fix this!!!

  • "Where exists (Select NULL From....)" should probably be changed to "Where Exists (Select 1 From...)", but I'm not sure that will change anything.

    I think your problem has to do with getting the unique row number for duplicates.  This can be done this way:

    Insert _tblDupes

    select a.UniqueID

    From _A a

    JOIN _A b on

      b.[fld_1] = a.[fld_1]

      AND b.[fld_2] = a.[fld_2]

      AND b.[fld_3] = a.[fld_3]

      AND b.[fld_4] = a.[fld_4]

    where a.UniqueID < b.UniqueID

    Update d

    set Field0 = a.Field0,

     Field1 = a.Field1

    From _tblDupes d

    JOIN _A a on d.UniqueID = a.UniqueID

    If you don't have "UniqueID" then you can add it to the table as a identity column.

    Signature is NULL

  • I have an idea on this. seems to work in my 3 three test cases. 1st I create a view X which selects the constraint fields and group them along wi COUNT(ID) > 1. This gives me the records that will show up as dupes. then I do a view Y where i join _A and X on the mobileID and select top 1 w/ id in desc order. this will give me the 2nd record. this assumes that the 1st one got in. But if there is more then 2 duplicate records then this will not work. any thoughts on how i can fix that?

    Let me know if this sounds like a good work around (sort of)?

     

     

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply