Update references to duplicate records before deleting duplicates

  • I have 2 tables:

    prospects and prospectsincampaigns

    CREATE TABLE [dbo].[prospects](

    [id] [int] IDENTITY(1,1) NOT NULL,

    [companyname] [nvarchar](255) NULL,

    [importcode] [nvarchar](50) NULL,

    CONSTRAINT [PK_intercity] PRIMARY KEY CLUSTERED

    (

    [id] ASC

    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

    ) ON [PRIMARY]

    GO

    CREATE TABLE [dbo].[ProspectsInCampaigns](

    [prospectid] [int] NOT NULL,

    [campaignid] [int] NOT NULL,

    [appointmentdate] [datetime] NOT NULL

    ) ON [PRIMARY]

    GO

    ALTER TABLE [dbo].[ProspectsInCampaigns] ADD CONSTRAINT [DF_ProspectsInCampaigns_appointmentdate] DEFAULT (((1)/(1))/(1900)) FOR [appointmentdate]

    GO

    in [prospects] I have all the details of a prospect.

    in [prospectsincampaigns] I have appointment details on prospects

    some records occur multiple times in [prospects] (e.g. companyname 'IBM' occurs 4 times', id=3,7,12 and 15).

    The problem is that in [prospectsincampaigns] the details also refer to 4 occurences of IBM in prospects (so prospecid is referring to 3,7,12 and 15).

    before I remove all duplicates from [prospects] I first need to update the references in [prospectsincampaigns] so that they all point to the same id in [prospects] (in this case id=3).

    What I want to do: check for each record in [prospectsincampaigns] if the record it refers to in [prospects] is part of a set of duplicate records (duplicates based on companyname). If it does, I want to update the reference in prospectsincampaigns to the top record of the duplicate set in [prospects]. Basically update the prospectid values in [prospectsincampaigns] so that they point to the top record of the duplicate set they refer to.

    Ok, I hope this makes sense, any questions let me know!

    Thanks a bunch.

  • you can use a CTE with the row_number() function to identify the ID of oneof the duplicate records and join that CTE with whatever you want to do the updates and delete the dups. There are examples of this all over the internet. There's a pretty good example here (about halfway down)

    Once you've straitened out your ProspectsInCampaigns, before you delete anything from prospects, add a foreign key from ProspectsInCampaigns to Prospects.

    The most important thing is that you constrain your tables and table relationships with a valuable primary key and foreign key relationship.

    I would recommend altering the prospects table after removing all duplicates, and adding a unique index on the natural key of the table (the columns that make a row truly unique... not a random ID number). That will prevent further pollution.

    If you're using surrogate keys in your design, it is always a good idea to further constrain the tables with a unique index to prevent the situation you're in now.

    Cheers!

  • Ok, thx, I looked at you example page. I now have created this:

    SELECT companyname, RANK() OVER (PARTITION BY companyname ORDER BY num) rnk

    FROM (SELECT *, ROW_NUMBER() OVER(ORDER BY companyname) num

    FROM prospects

    ) X;

    But Im unsure where to go from here...the above statement gives me a row in the result for each occurrence.

    And even if I would have the duplicates in prospects table, how can I update the prospectsincampaigns with the top row of the duplicate set?

    Sorry, am kinda lost here 🙂

  • Tell me what your natural key is for the prospects table. Is it just company?

  • The natural key of prospects is "id", more records might hold the same companyname value. The situation where multiple DUPLICATE records with same companyname are in the prospects table is always going to occur, and its not even a bad thing (has to do with requirements of my client). But I do want to have the option to clean the DB of duplicate records based on the companyname value.

  • id is a surrogate key, not a natural key.

    the natural key is the one (or more) columns that make a row unique.

    a surrogate key is a unique idenifier appended to the row that has nothing to do with the data...

    ex.

    employeeID would be an example of a surrogate key

    employeeName, DateOfBirth, Department, HireDate would be an example of a natural key

    Maybe yours is [companyname], [importcode] ?

    I find it difficult to understand a requirement that wants completely duplicated rows sometimes but not others. Usually a duplicate row is a problem.

  • My client wants to import records and then based on for example distribute some with a specific importcode to a certain campaign.

    If I would import the records and immediately remove duplicates, that newly imported set could no longer be uniquely identified.

    So for the duration of the campaign, the duplicates may exist, but after the campaign those duplicates will be merged with the other records.

    its because they run a callcenter where clients deliver records in batches that must be delivered back after the campaign.

    Ok, so what would be my natural key, (ps I did not provide the full table definition, since its too big), but I guess it would be:

    companyname, phonenumber, importdate

    Does that help?

    Thanks!

  • I think you need to pull apart the company information and the import information into 2 different tables.

    you would then link the campaign to the company and the import.

    anyway, you shouldn't dedup this table because you're going to be losing data... those rows aren't exact duplicates, are they? There are different import dates, etc.

    If you perform the maintenance you are proposing you may break other functionality and will definitely lose the ability to report on the data you are deleting.

    I don't think you should go down that road.

    You should focus efforts on remodeling the data so you don't have tons of redundant data in a table.

    You should normalize every table in an OLTP database to third normal form.

    That's my friendly advice.

    Good luck.

  • Ok, thats good advice.:)

    The thing is that my client needs the functionality fast, so while Im remodeling (which has impact on other functionality too), I have to work with what I have now.

    Do you have a solution for my CURRENT situation where I will be assuming the rows ARE duplicates? So ignoring importcode and importdate?

    Thanks again!

  • I'll provide the rough syntax for selecting out the information you want and leave it to you to pick up the pieces. I don't know if it will run or compile as written.

    Be sure to back up all the data before you change it, that way you can correct it later.

    I still think this is a really bad idea and you should inform the business of the risks before you do anything. They could lose the ability to track or use that data. Be sure all the developers know what is going on.

    ;with myCTE as (

    select id, companyname, phonenumber, importdate,

    row_number() over (partition by companyname, phonenumber, importdate, order by importdate desc) rowNum

    from [dbo].[prospects]

    )

    select c2.id as newProspectID, c1.*

    from [dbo].[prospects] c1 inner join

    (select id, companyname, phonenumber, importdate, from myCTE where rowNum = 1) c2

    on c1.companyname = c2.companyname

    and c1.phonenumber = c2.phonenumber

    and c1.importdate = c2.importdate

  • Thanks, will have a look at this.

    Also thanks for the tips, I will definitely discuss this with my client.

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply