Remove duplicate records based on record id, annoying GROUP BY statement

  • I want to remove duplicates from the DB. here's my code for that:

    DECLARE @Count int

    DECLARE @companyname nvarchar(50)

    DECLARE @id int

    DECLARE dublicate_cursor CURSOR FAST_FORWARD FOR

    SELECT companyname, Count(*) - 1

    FROM prospects

    GROUP BY companyname

    HAVING Count(*) > 1

    OPEN dublicate_cursor

    FETCH NEXT FROM dublicate_cursor INTO @companyname, @Count

    WHILE @@FETCH_STATUS = 0

    BEGIN

    DELETE TOP(@Count) FROM prospects WHERE companyname = @companyname

    FETCH NEXT FROM dublicate_cursor INTO @companyname, @Count

    END

    CLOSE dublicate_cursor

    DEALLOCATE dublicate_cursor

    As you can see, it now deletes records based on the companyname, but thats not what I want. I want to delete it based on record id.

    So, Im trying to change a small part of the code:

    this gives me the companynames with as a second column the amount it occurs in the DB

    SELECT companyname, Count(*) - 1

    FROM prospects

    GROUP BY companyname

    HAVING Count(*) > 1

    I changed it to, the following, which gives me NO results (no error, just no results)

    SELECT id,companyname, Count(companyname) - 1

    FROM prospects

    GROUP BY companyname,id

    HAVING Count(companyname) > 1

    why does it give NO results? Propably because id is unique, but how can I select it anyway?

  • It doesn't make much sense to me.

    Can you post table script and some sample data?

    Take a look at the article linked in my signature line to see how you can post effectively.

    -- Gianluca Sartori

  • Look like this one. Duplicate post?

    -- Gianluca Sartori

  • CELKO (11/24/2010)


    >> I want to remove duplicates from the DB. here's my code for that: <<

    Your code is terrible and using a cursor is really bad SQL programming. What you have done is re-discover the way that we did data processing with magnetic tapes six decades ago.

    Why did you use a reserved word for a local variable? Can you be more vague than "id"? What entity is it an identifier of? Remember, mag tapes count records and RDBMS uses set of rows. Rows are not anything like records.

    Throw out this whole thing and use:

    CREATE VIEW Duplicate_Prospects (company_name, company_occurrences)

    AS

    SELECT company_name, COUNT(*) AS company_occurrences

    FROM Prospects

    GROUP BY company_name;

    The VIEW will always be current and correct as you add more data to the Prospects table. Unlike punch cards and magnetic tape files, a table can be virtual. Please read a book on basic RDBMS, so you can stop writing COBOL and assembly language in SQL and start using SQL for RDBMS.

    Uh-huh... let's see the code that will delete the duplicates using that view. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • SELECT id,companyname, Count(companyname) - 1

    FROM prospects

    GROUP BY companyname,id

    HAVING Count(companyname) > 1

    why does it give NO results? Propably because id is unique, but how can I select it anyway?

    You have the companyname and id in the Group By, so you won't get a count of more than 1 on the company name.

  • petervdkerk (11/24/2010)


    I want to remove duplicates from the DB. here's my code for that:

    DECLARE @Count int

    DECLARE @companyname nvarchar(50)

    DECLARE @id int

    DECLARE dublicate_cursor CURSOR FAST_FORWARD FOR

    SELECT companyname, Count(*) - 1

    FROM prospects

    GROUP BY companyname

    HAVING Count(*) > 1

    OPEN dublicate_cursor

    FETCH NEXT FROM dublicate_cursor INTO @companyname, @Count

    WHILE @@FETCH_STATUS = 0

    BEGIN

    DELETE TOP(@Count) FROM prospects WHERE companyname = @companyname

    FETCH NEXT FROM dublicate_cursor INTO @companyname, @Count

    END

    CLOSE dublicate_cursor

    DEALLOCATE dublicate_cursor

    As you can see, it now deletes records based on the companyname, but thats not what I want. I want to delete it based on record id.

    So, Im trying to change a small part of the code:

    this gives me the companynames with as a second column the amount it occurs in the DB

    SELECT companyname, Count(*) - 1

    FROM prospects

    GROUP BY companyname

    HAVING Count(*) > 1

    I changed it to, the following, which gives me NO results (no error, just no results)

    SELECT id,companyname, Count(companyname) - 1

    FROM prospects

    GROUP BY companyname,id

    HAVING Count(companyname) > 1

    why does it give NO results? Propably because id is unique, but how can I select it anyway?

    Um... you're deleting dupes based on a single column... companyname. My question would be, with the exception of the ID, do all the other columns have identical information in them, as well? And what kind of column is ID??? Is it an INT and does it carry the IDENTITY property? If not all the other columns have identical information in them, which row for a given companyname do you want to keep?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Ok, I think we've concluded now (you guys mostly;) my code is horrible 🙂

    Actually I just ripped it of some blog and used, since I didnt know cursors are that bad. Well...now I do know 🙂

    To answer all questions:

    - yes, im only removing duplicates on a SINGLE column value, because all other fields are the same (an earlier method has done that).

    - Please find the scripts to create tables and sample data below. In the sample data not all columsn are the same, but thats ok, since those other columns may be ignored in my SQL to remove the duplicates.

    - As you will see in the sample data and tables, some records occur only in "prospects" (called p from now), whilst others also have a reference in "prospectsincampaigns" (called pic from now)

    What am I trying to achieve:

    A duplicate record is found and removed based on exact match on value in column "companyname"

    Scenario's

    1 duplicate record occurs ONLY in p table and none of the duplicate records have a reference to it in pic table:

    - all records, except for 1, are deleted from p table

    2 duplicate record occurs in p table and SOME of those duplicate records HAVE a reference to it in pic table:

    - all records that occur in p table are deleted EXCEPT the ones that have a reference in PIC table

    3 duplicate records ALL have reference in pic table:

    - no records are deleted

    preferably the result of the quesries is the amount of records that were deleted, but if thats hard nevermind....

    /****** Object: Table [dbo].[ProspectsInCampaigns] Script Date: 11/25/2010 18:36:26 ******/

    SET ANSI_NULLS ON

    GO

    SET QUOTED_IDENTIFIER ON

    GO

    CREATE TABLE [dbo].[ProspectsInCampaigns](

    [prospectid] [int] NOT NULL,

    [campaignid] [int] NOT NULL

    ) ON [PRIMARY]

    GO

    /****** Object: Table [dbo].[prospects] Script Date: 11/25/2010 18:25:55 ******/

    SET ANSI_NULLS ON

    GO

    SET QUOTED_IDENTIFIER ON

    GO

    CREATE TABLE [dbo].[prospects](

    [id] [int] IDENTITY(1,1) NOT NULL,

    [companyname] [nvarchar](255) NULL,

    [companytype] [int] NOT NULL,

    [companyaddress] [nvarchar](255) NULL,

    [companyhousenr] [nvarchar](255) NULL,

    [companyhousenradd] [nvarchar](255) NULL,

    [companyzipcode] [nvarchar](255) NULL,

    [companycity] [nvarchar](255) NULL,

    CONSTRAINT [PK_intercity] PRIMARY KEY CLUSTERED

    (

    [id] ASC

    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

    ) ON [PRIMARY]

    GO

    ALTER TABLE [dbo].[prospects] ADD CONSTRAINT [DF_prospects_companytype_1] DEFAULT ((0)) FOR [companytype]

    GO

    ************************** INSERTING THE SAMPLE DATA **************************

    declare @newid int

    INSERT INTO prospects(companyname,companytype,companyaddress,companyzipcode) VALUES('MyCompany',2,'Mystreet','65434')

    set @newid=SCOPE_IDENTITY()

    INSERT INTO ProspectsInCampaigns(prospectid,campaignid) VALUES (@newid,1)

    INSERT INTO prospects(companyname,companyaddress,companyhousenr,companyzipcode) VALUES('MyCompany','Mystreet','126','90210')

    set @newid=SCOPE_IDENTITY()

    INSERT INTO ProspectsInCampaigns(prospectid,campaignid) VALUES (@newid,1)

    INSERT INTO prospects(companyname,companyhousenr,companyzipcode,companycity) VALUES('MyCompany','126','65474','New York')

    set @newid=SCOPE_IDENTITY()

    INSERT INTO ProspectsInCampaigns(prospectid,campaignid) VALUES (@newid,1)

    --not inserted in a campaign, so should be deleted if duplicate

    INSERT INTO prospects(companyname,companytype,companyaddress,companyhousenr,companyzipcode) VALUES('MyCompany',0,'','','')

    INSERT INTO prospects(companyname,companytype,companyaddress,companyhousenr,companyzipcode) VALUES('MyCompany',5,'','','')

    INSERT INTO prospects(companyname,companytype,companyaddress,companyhousenr,companyzipcode) VALUES('Other company',1,'Rosestreet','1','1002')

    INSERT INTO prospects(companyname,companyaddress,companyhousenr,companyzipcode) VALUES('Other company','Rosestreet','1','100245')

    INSERT INTO prospects(companyname,companyaddress,companyhousenr,companyzipcode) VALUES('yahoo.com','dlane 676','1','15444')

    INSERT INTO prospects(companyname,companyaddress,companyhousenr,companyzipcode) VALUES('yahoo.com','dlane','1','15444')

    Thanks so much for the good replies so far 🙂

  • Here's my attempt. I'm confident there are easier solutions not requiring the multiple sort operations...

    Edit: please explain the difference between this thread and the one Gianluca mentioned earlier. At the other thread you told the issue has been resolved followed by your reply over here. Sounds strange...

    ; WITH cte AS

    (

    SELECT

    COUNT(*) OVER (PARTITION BY companyname,

    CASE WHEN prospectid IS NULL THEN 0 ELSE 1 END) AS cnt_part, -- number of rows per companyname with and without entry in ProspectsInCampaigns

    MIN(p.id) OVER (PARTITION BY companyname,

    CASE WHEN prospectid IS NULL THEN 0 ELSE 1 END) min_part, -- lowest prospects.id per companyname with and without entry in ProspectsInCampaigns

    COUNT(*) OVER (PARTITION BY companyname) cnt_total, -- total number of rows per companyname

    p.id,

    prospectid

    FROM prospects p

    LEFT OUTER JOIN ProspectsInCampaigns c ON p.id=c.campaignid

    )

    --DELETE prospects

    SELECT p.*

    FROM prospects p

    INNER JOIN cte ON p.id=cte.id

    WHERE

    (prospectid IS NULL -- only rows with no reference to the ProspectsInCampaigns table

    AND (cnt_part < cnt_total -- either there are still some rows left for that companyname with entries in ProspectsInCampaigns

    OR (cnt_part = cnt_total AND p.id>min_part) -- or use all rows except the one with the earliest entry

    )

    )



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • I think it should only delete rows with id in 4, 5, 7, 9

    It would help of course if we had expected results 😉

    Here is my version (which is similar to Lutz' one but only recommends deleting 4,5,7 and 9)

    ;with dupes

    as

    (

    select

    id

    ,companyname

    --== counter > 1 means there is at least one duplicate based on companyname

    ,COUNT(*) over(partition by companyname) AS counter

    --== rownum = 1 is the record to keep when there are no "pic" records

    ,ROW_NUMBER() OVER(partition by companyname order by id) AS rownum

    --== pic is the count of records in the ProspectsInCampaign table for the current "id"

    ,pic AS pic

    --== totalpic = 0 means we can delete all but the first company id

    ,SUM(pic) over(partition by companyname) AS totalpic

    --== minpic > 0 means all duplicate companys in prospects also have records in ProspectsInCampaigns, so don't DELETE any

    ,MIN(pic) over(partition by companyname) AS minpic

    from prospects

    outer apply (select COUNT(*) from ProspectsInCampaigns where prospectid = id) a(pic)

    )

    select *

    from dupes

    cross apply (

    SELECT

    COALESCE(

    case

    when totalpic =0 and rownum>1 then 'DELETE ME'

    else null

    end, --== no pic records = delete all but first (ordered by id)

    case

    when minpic>0 then NULL

    when totalpic>0 and pic=0 then 'DELETE ME'

    else null

    end --== minpic>0 means don't delete any, some pic records means delete records that have no pic records (pic=0)

    ) AS deleteFlag ) x

    where counter>1

    -- uncomment the following line to select only the rows that should be deleted.

    -- AND deleteFlag='DELETE ME'

    IF you want to know what is being deleted, either store the results of this query in a table and use it as the basis for the DELETE or use an OUTPUT clause to insert the deleted rows into another table when you do the DELETE.

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

  • Hi

    Check the following code. This may be useful.

    declare @test_tab table (slno int,name varchar(10))

    insert into @test_tab values (1,'a')

    insert into @test_tab values (1,'b')

    insert into @test_tab values (1,'c')

    insert into @test_tab values (2,'a')

    insert into @test_tab values (2,'b')

    insert into @test_tab values (2,'c')

    insert into @test_tab values (3,'a')

    insert into @test_tab values (4,'a')

    insert into @test_tab values (5,'a')

    --select * from @test_tab

    declare @test_tab1 table (slno int,name varchar(10),r int)

    insert into @test_tab1

    select * from (

    select slno,name,DENSE_RANK() over (partition by slno order by name) r

    from @test_tab ) a where r = 1

    select * from @test_tab1

    Thanks

    Siva Kumar J.

  • Mister Magoo, that is indeed what the output is supposed to be (also my thanks to LutzM)!

    I ran your query and it marks the correct records for deletion.

    1 question:

    you say I can use this output as a basis for deletion...(I dont want to copy the data to another table) but where exactly would I need to place the delete statement in that case?

  • It's always good to keep the original data of the table stashed in a different DB / different table just in case you have to restore data from that one table because you wrote your code wrong. I actually have a USER database where we keep data like this all the time. Easier to do than snapshots / backups because I don't lose all other production data when I restore (stuff that occurred after my fubar).

    If the IDs are different, but all other columns are the same, there are easier ways to do this code than the CTE. Here's my preferred method.

    SELECT *

    INTO TemporaryDumpTable

    FROM prospects;

    --This keeps the data in case you make big mistake and need to reload the table

    DELETE FROM prospects

    FROM Prospects p1

    INNER JOIN (Select MAX(ID) AS ID, CompanyName

    FROM Prospects

    GROUP BY CompanyName) p2

    ON p1.CompanyName = p2.CompanyName

    AND p1.ID = p2.ID;

    --Deletes only the MAX id.

    Now, if you have more than 1 duplicate, you might do a LEFT OUTER JOIN and delete where p1.ID <> p2.ID. That will work too. HOWEVER, this is code you should test in a dev environment before putting into production, as one misplaced symbol or JOIN statement will destroy your entire table's data.

    Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

  • petervdkerk (11/27/2010)


    Mister Magoo, that is indeed what the output is supposed to be (also my thanks to LutzM)!

    I ran your query and it marks the correct records for deletion.

    1 question:

    you say I can use this output as a basis for deletion...(I dont want to copy the data to another table) but where exactly would I need to place the delete statement in that case?

    Hi Peter, just like this:

    ;with dupes

    as

    (

    select

    id

    ,companyname

    --== counter > 1 means there is at least one duplicate based on companyname

    ,COUNT(*) over(partition by companyname) AS counter

    --== rownum = 1 is the record to keep when there are no "pic" records

    ,ROW_NUMBER() OVER(partition by companyname order by id) AS rownum

    --== pic is the count of records in the ProspectsInCampaign table for the current "id"

    ,pic AS pic

    --== totalpic = 0 means we can delete all but the first company id

    ,SUM(pic) over(partition by companyname) AS totalpic

    --== minpic > 0 means all duplicate companys in prospects also have records in ProspectsInCampaigns, so don't DELETE any

    ,MIN(pic) over(partition by companyname) AS minpic

    from prospects

    outer apply (select COUNT(*) from ProspectsInCampaigns where prospectid = id) a(pic)

    )

    DELETE p

    output deleted.*

    from dupes

    cross apply (

    SELECT

    COALESCE(

    case

    when totalpic =0 and rownum>1 then 'DELETE ME'

    else null

    end, --== no pic records = delete all but first (ordered by id)

    case

    when minpic>0 then NULL

    when totalpic>0 and pic=0 then 'DELETE ME'

    else null

    end --== minpic>0 means don't delete any, some pic records means delete records that have no pic records (pic=0)

    ) AS deleteFlag ) x

    join prospects p

    on p.id = dupes.id

    where counter>1

    AND deleteFlag='DELETE ME'

    Note that I included an OUTPUT clause to show you the deleted records - if you want you can insert that result set into another table as long as the table already exists.

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

  • There's a simpler way to dedupe since SQL 2005.

    ;with DupeCheck as

    (select row_number() over (partition by company_name order by ID) as Row

    from dbo.prospects)

    delete from DupeCheck

    where Row > 1;

    If you want to store the rows that are deleted, for auditing or recovery, use an Output statement in the delete clause.

    I recommend having an Output statement, and running the whole thing in a transaction that you roll back, then verifying what was deleted, then change the rollback to commit and run it again.

    begin transaction;

    create table #T (ID int);

    ;with DupeCheck as

    (select row_number() over (partition by company_name order by ID) as Row

    from dbo.prospects)

    delete from DupeCheck

    output deleted.ID into #T (ID)

    where Row > 1;

    select top 10 * from #T;

    rollback;

    Check the data you get from the Output clause, make sure you really want to get rid of those rows, and then you're good to go once you've confirmed that.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • @Brandie and GSquared,

    There were additional requirements in the OP's question - and while it is helpful to have your excellent input, you are not addressing the OP's problem.

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

    Viewing 15 posts - 1 through 15 (of 33 total)

    You must be logged in to reply to this topic. Login to reply