Remove duplicate records based on record id, annoying GROUP BY statement

Question

Post reply

Remove duplicate records based on record id, annoying GROUP BY statement

Forum Etiquette: How to post Reporting Services problems

[/url]

Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
[/url]

How to Post Performance Problems - by Gail Shaw
[/url]

Brandie Tarvin SSC Guru Points: 172918 More actions · Answer 1

mister.magoo (11/29/2010)

@Brandie and GSquared,

There were additional requirements in the OP's question - and while it is helpful to have your excellent input, you are not addressing the OP's problem.

Explain, please, what it was we were missing.

Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

mister.magoo SSC-Forever Points: 47068 More actions · Answer 2

Brandie Tarvin (11/30/2010)

mister.magoo (11/29/2010)

@Brandie and GSquared,

There were additional requirements in the OP's question - and while it is helpful to have your excellent input, you are not addressing the OP's problem.

Explain, please, what it was we were missing.

It is all in the OP's post.

MM

select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

GSquared SSC Guru Points: 260824 More actions · Answer 3

mister.magoo (11/30/2010)

Brandie Tarvin (11/30/2010)

mister.magoo (11/29/2010)

@Brandie and GSquared,

There were additional requirements in the OP's question - and while it is helpful to have your excellent input, you are not addressing the OP's problem.

Explain, please, what it was we were missing.

It is all in the OP's post.

Do you mean the part where it says:

What am I trying to achieve:

A duplicate record is found and removed based on exact match on value in column "companyname"

Or the part that says:

1 duplicate record occurs ONLY in p table and none of the duplicate records have a reference to it in pic table:

- all records, except for 1, are deleted from p table

2 duplicate record occurs in p table and SOME of those duplicate records HAVE a reference to it in pic table:

- all records that occur in p table are deleted EXCEPT the ones that have a reference in PIC table

3 duplicate records ALL have reference in pic table:

- no records are deleted

Which can be satisfied by adding a very, very simple Where clause to the code-skeleton that I wrote?

;with DupeCheck as



  (select row_number() over (partition by company_name order by ID) as Row, ID

  from dbo.prospects)

delete from DupeCheck

where Row > 1

and ID not in (select ID from WhateverTheOtherTableIs);

Why all the added complexity in your solution? It's doing a bunch of aggregations and a lot of extra work, for no reason I can see. What am I missing?

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

petervdkerk SSCrazy Points: 2155 More actions · Answer 4

@GSsquared: your query:

;with DupeCheck as

(select row_number() over (partition by companyname order by ID) as Row, ID

from dbo.prospects)

delete from DupeCheck

where Row > 1

and ID not in (select ID from ProspectsInCampaigns);

deletes 0 rows. MrMagoos query does delete the correct rows (he has also mentioned which rows should be deleted). Dont get me wrong: Im all for simplicity and short efficient code, but yours is (right now) not deleting anything as you can see. I have provided some sample data in earlier posts, did you see that? Also the ID col in prospects is a PK, perhaps that's causing issues with your code?

petervdkerk SSCrazy Points: 2155 More actions · Answer 5

@mister.magoo: Love the comments in your code 🙂

As you mentioned you do provide the ouput as well: output deleted.*

But im still unsure which command to use to select the records that were deleted.

I tried using the INTO keyword:

DELETE p

output deleted.*

from dupes

cross apply (

SELECT

COALESCE(

case

when totalpic =0 and rownum>1 then 'DELETE ME'

else null

end, --== no pic records = delete all but first (ordered by id)

case

when minpic>0 then NULL

when totalpic>0 and pic=0 then 'DELETE ME'

else null

end --== minpic>0 means don't delete any, some pic records means delete records that have no pic records (pic=0)

) AS deleteFlag ) x

join prospects p

on p.id = dupes.id

where counter>1

AND deleteFlag='DELETE ME'

INTO deletedrecords

but that just gives me: "Incorrect syntax near the keyword 'INTO'."

ps. deletedrecords table is just an exact copy of prospects table

mister.magoo SSC-Forever Points: 47068 More actions · Answer 6

petervdkerk (11/30/2010)

@mister.magoo: Love the comments in your code 🙂

As you mentioned you do provide the ouput as well: output deleted.*

But im still unsure which command to use to select the records that were deleted.

I tried using the INTO keyword:

DELETE p

output deleted.*

INTO deletedrecords

from dupes

cross apply (

SELECT

COALESCE(

case

when totalpic =0 and rownum>1 then 'DELETE ME'

else null

end, --== no pic records = delete all but first (ordered by id)

case

when minpic>0 then NULL

when totalpic>0 and pic=0 then 'DELETE ME'

else null

end --== minpic>0 means don't delete any, some pic records means delete records that have no pic records (pic=0)

) AS deleteFlag ) x

join prospects p

on p.id = dupes.id

where counter>1

AND deleteFlag='DELETE ME'

but that just gives me: "Incorrect syntax near the keyword 'INTO'."

ps. deletedrecords table is just an exact copy of prospects table

Hi Peter, I have amended the code above, the INTO comes after the OUTPUT

MM

select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

mister.magoo SSC-Forever Points: 47068 More actions · Answer 7

GSquared (11/30/2010)

... the part that says:

1 duplicate record occurs ONLY in p table and none of the duplicate records have a reference to it in pic table:

- all records, except for 1, are deleted from p table

2 duplicate record occurs in p table and SOME of those duplicate records HAVE a reference to it in pic table:

- all records that occur in p table are deleted EXCEPT the ones that have a reference in PIC table

3 duplicate records ALL have reference in pic table:

- no records are deleted

Which can be satisfied by adding a very, very simple Where clause to the code-skeleton that I wrote?

;with DupeCheck as
(select row_number() over (partition by company_name order by ID) as Row, ID from dbo.prospects) delete from DupeCheck where Row > 1
and ID not in (select ID from WhateverTheOtherTableIs);

Why all the added complexity in your solution? It's doing a bunch of aggregations and a lot of extra work, for no reason I can see. What am I missing?

Hi GSquared, I was referring to the second set of requirements, and the problem with the proposed solution you provide (or at least from a quick scan of it) is that you will never delete the companyname row that has [Row]=1 - which is a possibility with these requirements. 🙂

MM

select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

Brandie Tarvin SSC Guru Points: 172918 More actions · Answer 8

mister.magoo (11/30/2010)

Brandie Tarvin (11/30/2010)

mister.magoo (11/29/2010)

@Brandie and GSquared,

There were additional requirements in the OP's question - and while it is helpful to have your excellent input, you are not addressing the OP's problem.

Explain, please, what it was we were missing.

It is all in the OP's post.

MM, but if you're going to call someone out for missing something in the OP's post, it's helpful if you point out exactly what was missed. Saying "it's all in the post" doesn't help. Most people will still read what they read into the post the first time.

That's why I asked you to explain. Your non-response response just hit one of my pet peeves. Please don't do that again.

Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

mister.magoo SSC-Forever Points: 47068 More actions · Answer 9

Brandie, I found your response to be curious - I was merely pointing out that you had not answered the question as asked - and I did not have time to go into the detail.

Now that I have more time - here are the requirements - perhaps you could explain what part you do not understand??

😉

Ok, I think we've concluded now (you guys mostly;) my code is horrible 🙂

Actually I just ripped it of some blog and used, since I didnt know cursors are that bad. Well...now I do know 🙂

To answer all questions:

- yes, im only removing duplicates on a SINGLE column value, because all other fields are the same (an earlier method has done that).

- Please find the scripts to create tables and sample data below. In the sample data not all columsn are the same, but thats ok, since those other columns may be ignored in my SQL to remove the duplicates.

- As you will see in the sample data and tables, some records occur only in "prospects" (called p from now), whilst others also have a reference in "prospectsincampaigns" (called pic from now)

What am I trying to achieve:

A duplicate record is found and removed based on exact match on value in column "companyname"

Scenario's

1 duplicate record occurs ONLY in p table and none of the duplicate records have a reference to it in pic table:

- all records, except for 1, are deleted from p table

2 duplicate record occurs in p table and SOME of those duplicate records HAVE a reference to it in pic table:

- all records that occur in p table are deleted EXCEPT the ones that have a reference in PIC table

3 duplicate records ALL have reference in pic table:

- no records are deleted

preferably the result of the quesries is the amount of records that were deleted, but if thats hard nevermind....

MM

select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

GSquared SSC Guru Points: 260824 More actions · Answer 10

mister.magoo (11/30/2010)

GSquared (11/30/2010)

... the part that says:

1 duplicate record occurs ONLY in p table and none of the duplicate records have a reference to it in pic table:

- all records, except for 1, are deleted from p table

2 duplicate record occurs in p table and SOME of those duplicate records HAVE a reference to it in pic table:

- all records that occur in p table are deleted EXCEPT the ones that have a reference in PIC table

3 duplicate records ALL have reference in pic table:

- no records are deleted

Which can be satisfied by adding a very, very simple Where clause to the code-skeleton that I wrote?

;with DupeCheck as
(select row_number() over (partition by company_name order by ID) as Row, ID from dbo.prospects) delete from DupeCheck where Row > 1
and ID not in (select ID from WhateverTheOtherTableIs);

Why all the added complexity in your solution? It's doing a bunch of aggregations and a lot of extra work, for no reason I can see. What am I missing?

Hi GSquared, I was referring to the second set of requirements, and the problem with the proposed solution you provide (or at least from a quick scan of it) is that you will never delete the companyname row that has [Row]=1 - which is a possibility with these requirements. 🙂

or Row = 1

and exists (...another record for the same company that has rows in the sub-table...) and not exists (...row for this ID in the sub-table...);

Fill in the blank to fit whatever rules you like.

Actually, if this were "my" database, I'd probably do the whole thing completely differently. This isn't a straight-up dedupe. There's more data cleaning needed on this.

First, I'd look into consolidating all records for any given company name onto a single ID for each one. That's more important, most likely, than cleaning up empty records and pseudo-duplicates. So, if "ACME Inc" has 10 IDs, with the even numbered ones having records in the sub-table, I'd look at updating the sub-table to put all of those records onto the first ID for ACME. Then I'd run the original dedupe I proposed. Then I'd work out a unique index and some other tools that would prevent the data from getting dirty again. Anything less is, most likely, a half-solution.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

petervdkerk SSCrazy Points: 2155 More actions · Answer 11

@GSquared: its like you read the mind of my boss...

He said the exact same, and after thinking of it...it does make way more sense! 🙂

So the code Mr Magoo does exactly what my initial requirements were.

However, now Im left with duplicate records in my prospects table when those records have a reference to them from the pic table.

What I need is that the duplicate records in the prospects table are merged into 1, and the id of the record that remains in the prospects table is applied to all records in the pic table that had a reference to any of the duplicate records that are now merged....

Are we saying the same? If so, could you help me achieve this? Or you MrMagoo?

Thanks again!!!

Solving this would be a great christmas present....ok, thats kinda sad 😉

GSquared SSC Guru Points: 260824 More actions · Answer 12

What you'd need to do would be find the first ID for each company name, and then update all the records in the subtable to match those.

For example:

;with PrimaryIDs (PrimaryID, Company) as



  (select min(ID), company_name

  from dbo.MyCompaniesTable

  group by company_name),

IDs (PrimaryID, SecondaryID) as

  (select PrimaryID, ID

  from PrimaryIDs

  inner join dbo.MyCompaniesTable

    on PrimaryIDs.Company = MyCompaniesTable.company_name

    and PrimaryIDs.PrimaryID < MyCompaniesTable.ID)

update Sub

set CompanID = PrimaryID

from dbo.MySubTable as Sub

inner join IDs

on Sub.CompanyID = IDs.SecondaryID;

What that does is find the first ID for each company, then find all the secondary IDs for the same company name, then update all the IDs in the subtable to match the "primary ID" for the company. Clear enough?

Once you've run that, then you use the same CTEs to delete from the companies table, thus getting rid of all of the duplicate companies.

Then you probably want to add a unique index to the company_name field, or find an better primary key than an ID value (there are lots of candidates for that), which will prevent duplicates from being put in the system in the first place.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

petervdkerk SSCrazy Points: 2155 More actions · Answer 13

Yes, its clear and it also ALMOST works...:)

my requirement is also that the records that will be deleted are inserted in another table (for backup purposes).

So, what I did is:

;with PrimaryIDs (PrimaryID, Company) as

(select min(ID), companyname

from prospects

group by companyname),

IDs (PrimaryID, SecondaryID) as

(select PrimaryID, ID

from PrimaryIDs

inner join prospects

on PrimaryIDs.Company = prospects.companyname

and PrimaryIDs.PrimaryID < prospects.ID)

update Sub

set prospectid = PrimaryID

from ProspectsInCampaigns as Sub

inner join IDs

on Sub.prospectid = IDs.SecondaryID;

;with DupeCheck as

(select row_number() over (partition by companyname order by ID) as Row, ID

from dbo.prospects)

delete from DupeCheck

output deleted.*

INTO prospects_copy

where Row > 1

But then I get:

"Column name or number of supplied values does not match table definition." on this line: ";with DupeCheck as"

What am I doing wrong?

UPDATE: I also tried replacing ID with *, so:

select row_number() over (partition by companyname order by ID) as Row, *

But that gives the same error...so my guess is now the Row column is causing the problem...but what I'd rather not have to do is individually insert the column names that are to be selected and inserted.

mister.magoo SSC-Forever Points: 47068 More actions · Answer 14

That is a strange request - if you don't want to manually enter all the column names, just open the object explorer, find your table and drag the "Columns" node to the query pane - all the columns are dumped into the query pane just ready for a select....

Otherwise, you would have to add a column for the row number to your backup table.

MM

select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

petervdkerk SSCrazy Points: 2155 More actions · Answer 15

Sorry if I wasnt clear, but by this "what I'd rather not have to do is individually insert the column names that are to be selected and inserted.", I mean:

I want the statement to be flexible and if I later add columns to the database, I dont want to have to alter the query to include the newly added columns.

So thats why I want to use: *, instead of the actual column names.