Remove duplicate records based on record id, annoying GROUP BY statement

  • mister.magoo (11/29/2010)


    @Brandie and GSquared,

    There were additional requirements in the OP's question - and while it is helpful to have your excellent input, you are not addressing the OP's problem.

    Explain, please, what it was we were missing.

    Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

  • Brandie Tarvin (11/30/2010)


    mister.magoo (11/29/2010)


    @Brandie and GSquared,

    There were additional requirements in the OP's question - and while it is helpful to have your excellent input, you are not addressing the OP's problem.

    Explain, please, what it was we were missing.

    It is all in the OP's post.

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

  • mister.magoo (11/30/2010)


    Brandie Tarvin (11/30/2010)


    mister.magoo (11/29/2010)


    @Brandie and GSquared,

    There were additional requirements in the OP's question - and while it is helpful to have your excellent input, you are not addressing the OP's problem.

    Explain, please, what it was we were missing.

    It is all in the OP's post.

    Do you mean the part where it says:

    What am I trying to achieve:

    A duplicate record is found and removed based on exact match on value in column "companyname"

    Or the part that says:

    1 duplicate record occurs ONLY in p table and none of the duplicate records have a reference to it in pic table:

    - all records, except for 1, are deleted from p table

    2 duplicate record occurs in p table and SOME of those duplicate records HAVE a reference to it in pic table:

    - all records that occur in p table are deleted EXCEPT the ones that have a reference in PIC table

    3 duplicate records ALL have reference in pic table:

    - no records are deleted

    Which can be satisfied by adding a very, very simple Where clause to the code-skeleton that I wrote?

    ;with DupeCheck as

    (select row_number() over (partition by company_name order by ID) as Row, ID

    from dbo.prospects)

    delete from DupeCheck

    where Row > 1

    and ID not in (select ID from WhateverTheOtherTableIs);

    Why all the added complexity in your solution? It's doing a bunch of aggregations and a lot of extra work, for no reason I can see. What am I missing?

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • @GSsquared: your query:

    ;with DupeCheck as

    (select row_number() over (partition by companyname order by ID) as Row, ID

    from dbo.prospects)

    delete from DupeCheck

    where Row > 1

    and ID not in (select ID from ProspectsInCampaigns);

    deletes 0 rows. MrMagoos query does delete the correct rows (he has also mentioned which rows should be deleted). Dont get me wrong: Im all for simplicity and short efficient code, but yours is (right now) not deleting anything as you can see. I have provided some sample data in earlier posts, did you see that? Also the ID col in prospects is a PK, perhaps that's causing issues with your code?

  • @mister.magoo: Love the comments in your code 🙂

    As you mentioned you do provide the ouput as well: output deleted.*

    But im still unsure which command to use to select the records that were deleted.

    I tried using the INTO keyword:

    DELETE p

    output deleted.*

    from dupes

    cross apply (

    SELECT

    COALESCE(

    case

    when totalpic =0 and rownum>1 then 'DELETE ME'

    else null

    end, --== no pic records = delete all but first (ordered by id)

    case

    when minpic>0 then NULL

    when totalpic>0 and pic=0 then 'DELETE ME'

    else null

    end --== minpic>0 means don't delete any, some pic records means delete records that have no pic records (pic=0)

    ) AS deleteFlag ) x

    join prospects p

    on p.id = dupes.id

    where counter>1

    AND deleteFlag='DELETE ME'

    INTO deletedrecords

    but that just gives me: "Incorrect syntax near the keyword 'INTO'."

    ps. deletedrecords table is just an exact copy of prospects table

  • petervdkerk (11/30/2010)


    @mister.magoo: Love the comments in your code 🙂

    As you mentioned you do provide the ouput as well: output deleted.*

    But im still unsure which command to use to select the records that were deleted.

    I tried using the INTO keyword:

    DELETE p

    output deleted.*

    INTO deletedrecords

    from dupes

    cross apply (

    SELECT

    COALESCE(

    case

    when totalpic =0 and rownum>1 then 'DELETE ME'

    else null

    end, --== no pic records = delete all but first (ordered by id)

    case

    when minpic>0 then NULL

    when totalpic>0 and pic=0 then 'DELETE ME'

    else null

    end --== minpic>0 means don't delete any, some pic records means delete records that have no pic records (pic=0)

    ) AS deleteFlag ) x

    join prospects p

    on p.id = dupes.id

    where counter>1

    AND deleteFlag='DELETE ME'

    but that just gives me: "Incorrect syntax near the keyword 'INTO'."

    ps. deletedrecords table is just an exact copy of prospects table

    Hi Peter, I have amended the code above, the INTO comes after the OUTPUT

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

  • GSquared (11/30/2010)


    ... the part that says:

    1 duplicate record occurs ONLY in p table and none of the duplicate records have a reference to it in pic table:

    - all records, except for 1, are deleted from p table

    2 duplicate record occurs in p table and SOME of those duplicate records HAVE a reference to it in pic table:

    - all records that occur in p table are deleted EXCEPT the ones that have a reference in PIC table

    3 duplicate records ALL have reference in pic table:

    - no records are deleted

    Which can be satisfied by adding a very, very simple Where clause to the code-skeleton that I wrote?

    ;with DupeCheck as

    (select row_number() over (partition by company_name order by ID) as Row, ID

    from dbo.prospects)

    delete from DupeCheck

    where Row > 1

    and ID not in (select ID from WhateverTheOtherTableIs);

    Why all the added complexity in your solution? It's doing a bunch of aggregations and a lot of extra work, for no reason I can see. What am I missing?

    Hi GSquared, I was referring to the second set of requirements, and the problem with the proposed solution you provide (or at least from a quick scan of it) is that you will never delete the companyname row that has [Row]=1 - which is a possibility with these requirements. 🙂

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

  • mister.magoo (11/30/2010)


    Brandie Tarvin (11/30/2010)


    mister.magoo (11/29/2010)


    @Brandie and GSquared,

    There were additional requirements in the OP's question - and while it is helpful to have your excellent input, you are not addressing the OP's problem.

    Explain, please, what it was we were missing.

    It is all in the OP's post.

    MM, but if you're going to call someone out for missing something in the OP's post, it's helpful if you point out exactly what was missed. Saying "it's all in the post" doesn't help. Most people will still read what they read into the post the first time.

    That's why I asked you to explain. Your non-response response just hit one of my pet peeves. Please don't do that again.

    Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

  • Brandie, I found your response to be curious - I was merely pointing out that you had not answered the question as asked - and I did not have time to go into the detail.

    Now that I have more time - here are the requirements - perhaps you could explain what part you do not understand??

    😉

    Ok, I think we've concluded now (you guys mostly;) my code is horrible 🙂

    Actually I just ripped it of some blog and used, since I didnt know cursors are that bad. Well...now I do know 🙂

    To answer all questions:

    - yes, im only removing duplicates on a SINGLE column value, because all other fields are the same (an earlier method has done that).

    - Please find the scripts to create tables and sample data below. In the sample data not all columsn are the same, but thats ok, since those other columns may be ignored in my SQL to remove the duplicates.

    - As you will see in the sample data and tables, some records occur only in "prospects" (called p from now), whilst others also have a reference in "prospectsincampaigns" (called pic from now)

    What am I trying to achieve:

    A duplicate record is found and removed based on exact match on value in column "companyname"

    Scenario's

    1 duplicate record occurs ONLY in p table and none of the duplicate records have a reference to it in pic table:

    - all records, except for 1, are deleted from p table

    2 duplicate record occurs in p table and SOME of those duplicate records HAVE a reference to it in pic table:

    - all records that occur in p table are deleted EXCEPT the ones that have a reference in PIC table

    3 duplicate records ALL have reference in pic table:

    - no records are deleted

    preferably the result of the quesries is the amount of records that were deleted, but if thats hard nevermind....

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

  • mister.magoo (11/30/2010)


    GSquared (11/30/2010)


    ... the part that says:

    1 duplicate record occurs ONLY in p table and none of the duplicate records have a reference to it in pic table:

    - all records, except for 1, are deleted from p table

    2 duplicate record occurs in p table and SOME of those duplicate records HAVE a reference to it in pic table:

    - all records that occur in p table are deleted EXCEPT the ones that have a reference in PIC table

    3 duplicate records ALL have reference in pic table:

    - no records are deleted

    Which can be satisfied by adding a very, very simple Where clause to the code-skeleton that I wrote?

    ;with DupeCheck as

    (select row_number() over (partition by company_name order by ID) as Row, ID

    from dbo.prospects)

    delete from DupeCheck

    where Row > 1

    and ID not in (select ID from WhateverTheOtherTableIs);

    Why all the added complexity in your solution? It's doing a bunch of aggregations and a lot of extra work, for no reason I can see. What am I missing?

    Hi GSquared, I was referring to the second set of requirements, and the problem with the proposed solution you provide (or at least from a quick scan of it) is that you will never delete the companyname row that has [Row]=1 - which is a possibility with these requirements. 🙂

    or Row = 1

    and exists (...another record for the same company that has rows in the sub-table...) and not exists (...row for this ID in the sub-table...);

    Fill in the blank to fit whatever rules you like.

    Actually, if this were "my" database, I'd probably do the whole thing completely differently. This isn't a straight-up dedupe. There's more data cleaning needed on this.

    First, I'd look into consolidating all records for any given company name onto a single ID for each one. That's more important, most likely, than cleaning up empty records and pseudo-duplicates. So, if "ACME Inc" has 10 IDs, with the even numbered ones having records in the sub-table, I'd look at updating the sub-table to put all of those records onto the first ID for ACME. Then I'd run the original dedupe I proposed. Then I'd work out a unique index and some other tools that would prevent the data from getting dirty again. Anything less is, most likely, a half-solution.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • @GSquared: its like you read the mind of my boss...

    He said the exact same, and after thinking of it...it does make way more sense! 🙂

    So the code Mr Magoo does exactly what my initial requirements were.

    However, now Im left with duplicate records in my prospects table when those records have a reference to them from the pic table.

    What I need is that the duplicate records in the prospects table are merged into 1, and the id of the record that remains in the prospects table is applied to all records in the pic table that had a reference to any of the duplicate records that are now merged....

    Are we saying the same? If so, could you help me achieve this? Or you MrMagoo?

    Thanks again!!!

    Solving this would be a great christmas present....ok, thats kinda sad 😉

  • What you'd need to do would be find the first ID for each company name, and then update all the records in the subtable to match those.

    For example:

    ;with PrimaryIDs (PrimaryID, Company) as

    (select min(ID), company_name

    from dbo.MyCompaniesTable

    group by company_name),

    IDs (PrimaryID, SecondaryID) as

    (select PrimaryID, ID

    from PrimaryIDs

    inner join dbo.MyCompaniesTable

    on PrimaryIDs.Company = MyCompaniesTable.company_name

    and PrimaryIDs.PrimaryID < MyCompaniesTable.ID)

    update Sub

    set CompanID = PrimaryID

    from dbo.MySubTable as Sub

    inner join IDs

    on Sub.CompanyID = IDs.SecondaryID;

    What that does is find the first ID for each company, then find all the secondary IDs for the same company name, then update all the IDs in the subtable to match the "primary ID" for the company. Clear enough?

    Once you've run that, then you use the same CTEs to delete from the companies table, thus getting rid of all of the duplicate companies.

    Then you probably want to add a unique index to the company_name field, or find an better primary key than an ID value (there are lots of candidates for that), which will prevent duplicates from being put in the system in the first place.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Yes, its clear and it also ALMOST works...:)

    my requirement is also that the records that will be deleted are inserted in another table (for backup purposes).

    So, what I did is:

    ;with PrimaryIDs (PrimaryID, Company) as

    (select min(ID), companyname

    from prospects

    group by companyname),

    IDs (PrimaryID, SecondaryID) as

    (select PrimaryID, ID

    from PrimaryIDs

    inner join prospects

    on PrimaryIDs.Company = prospects.companyname

    and PrimaryIDs.PrimaryID < prospects.ID)

    update Sub

    set prospectid = PrimaryID

    from ProspectsInCampaigns as Sub

    inner join IDs

    on Sub.prospectid = IDs.SecondaryID;

    ;with DupeCheck as

    (select row_number() over (partition by companyname order by ID) as Row, ID

    from dbo.prospects)

    delete from DupeCheck

    output deleted.*

    INTO prospects_copy

    where Row > 1

    But then I get:

    "Column name or number of supplied values does not match table definition." on this line: ";with DupeCheck as"

    What am I doing wrong?

    UPDATE: I also tried replacing ID with *, so:

    select row_number() over (partition by companyname order by ID) as Row, *

    But that gives the same error...so my guess is now the Row column is causing the problem...but what I'd rather not have to do is individually insert the column names that are to be selected and inserted.

  • That is a strange request - if you don't want to manually enter all the column names, just open the object explorer, find your table and drag the "Columns" node to the query pane - all the columns are dumped into the query pane just ready for a select....

    Otherwise, you would have to add a column for the row number to your backup table.

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

  • Sorry if I wasnt clear, but by this "what I'd rather not have to do is individually insert the column names that are to be selected and inserted.", I mean:

    I want the statement to be flexible and if I later add columns to the database, I dont want to have to alter the query to include the newly added columns.

    So thats why I want to use: *, instead of the actual column names.

  • Viewing 15 posts - 16 through 30 (of 33 total)

    You must be logged in to reply to this topic. Login to reply