Merging Duplicate Records

  • Hello! I have a rather complex data integration challenge I'd love some input on. The scenario is a a system of record that has allowed for multiple duplicate records of the same customer. I matched duplicate records to each other in a single To-From table. My problem is to now single out one of the customer records as the 'parent' and map all of the others to it.

    Note: the de-duping process allows for a customer to be mapped to multiple other customer records. Also, I need the leaf level records to be directly mapped to the highest level in the branch possible.

    Lastly, I have solved this problem by looping through the data but I would like to make the solution more set-based, without cursors etc.

    Here's some input data:

    CREATE TABLE dbo.CustomerMatch (

    FromCustomerID INT NOT NULL

    , ToCustomerID INT NOT NULL

    , Matches INT NOT NULL

    , ToCustomerDateCreated DATETIME NOT NULL

    )

    INSERT dbo.CustomerMatch VALUES

    (1, 2, 4, '20100101')

    , (2, 1, 4, '20090101')

    , (1, 3, 3, '20110101')

    , (3, 1, 3, '20090101')

    , (4, 2, 2, '20100101')

    , (2, 4, 2, '20120101')

    , (3, 2, 1, '20100101')

    , (2, 3, 1, '20110101')

    , (3, 4, 1, '20120101')

    , (4, 3, 1, '20110101')

    , (1, 4, 1, '20120101')

    , (4, 1, 1, '20090101')

    'Matches' is the number of attributes on which the two customer records were associated.

    And here's the result I would like:

    CustomerID ParentCustomerID Matches

    ----------- ---------------- -----------

    1 1 4

    2 1 4

    3 1 3

    4 1 2

    Thanks for the help!

    Chris

  • chris.ross 34852 (10/9/2012)


    Hello! I have a rather complex data integration challenge I'd love some input on. The scenario is a a system of record that has allowed for multiple duplicate records of the same customer. I matched duplicate records to each other in a single To-From table. My problem is to now single out one of the customer records as the 'parent' and map all of the others to it.

    Note: the de-duping process allows for a customer to be mapped to multiple other customer records. Also, I need the leaf level records to be directly mapped to the highest level in the branch possible.

    Does your deduping process use WHERE a.CustomerID <> b.CustomerID? If so, you could eliminate the dupes in the output by using WHERE a.CustomerID < b.CustomerID. This would result in an output like the following:

    (1, 2, 4, '20100101')

    , (1, 3, 3, '20110101')

    , (4, 2, 2, '20100101')

    , (3, 2, 1, '20100101')

    , (3, 4, 1, '20120101')

    , (1, 4, 1, '20120101')

    you might want to include the FROMcustomerCreated to help you choose which row to keep.

    Lastly, I have solved this problem by looping through the data but I would like to make the solution more set-based, without cursors etc.

    Here's some input data:

    CREATE TABLE dbo.CustomerMatch (

    FromCustomerID INT NOT NULL

    , ToCustomerID INT NOT NULL

    , Matches INT NOT NULL

    , ToCustomerDateCreated DATETIME NOT NULL

    )

    INSERT dbo.CustomerMatch VALUES

    (1, 2, 4, '20100101')

    , (2, 1, 4, '20090101')

    , (1, 3, 3, '20110101')

    , (3, 1, 3, '20090101')

    , (4, 2, 2, '20100101')

    , (2, 4, 2, '20120101')

    , (3, 2, 1, '20100101')

    , (2, 3, 1, '20110101')

    , (3, 4, 1, '20120101')

    , (4, 3, 1, '20110101')

    , (1, 4, 1, '20120101')

    , (4, 1, 1, '20090101')

    'Matches' is the number of attributes on which the two customer records were associated.

    And here's the result I would like:

    CustomerID ParentCustomerID Matches

    ----------- ---------------- -----------

    1 1 4

    2 1 4

    3 1 3

    4 1 2

    Thanks for the help!

    Chris

    There have been a couple of threads like this recently, I'll try to track 'em down.


    [font="Arial"]Low-hanging fruit picker and defender of the moggies[/font]

    For better assistance in answering your questions, please read this[/url].


    Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

    Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

Viewing 2 posts - 1 through 1 (of 1 total)

You must be logged in to reply to this topic. Login to reply