How can I group near duplicate records under a new common field?

  • Jason A. Long (8/31/2015)


    craig.bobchin (8/31/2015)


    Thanks Jason, I do actually have the fields in the table, I just copied the earlier version of the create table script for dimMembers.

    I'll check the code against the earlier version tomorrow morning. The spreadsheet data looks good. I've been running the original version of the script today with various thresholds and as expected getting wildly varied results.

    One thing I'm seeing is that of the 30k rows in the table almost 50% have either a -1 or a NULL in the GroupID field. I'm not sure what the deal is with that.

    No problem... I did see where my original code was getting NULLs based on your sample data. I found the issue and correct that in the latest code. I have absolutely no idea how you're getting -1 for anything. The GroupID is based on existing dimMemberIDs... So unless dimMemberID = -1 exists in your data, that shouldn't be possible...

    In any case, take the new version of the code for a spin and see if it gets you a little closer.

    We do have a -1 in DimMemberID it's our default for non-matching members. I'll filter it out when I run it.

  • Jason,

    That works great. I've been fooling around with the thresholds to try and get the best match before I present it to the business. I also made a couple of other minot tweaks to handle the 'n/a' that can appear in any of the matching fields.

    Takes about 12 minutes to run against our full table of 30k records.

    One thing I got to thinking about is as we add more records to the dimMember table, and have to run this script against the table daily, member's groupId has the potential to change based on new/updated member data.

    I'm curious as to your take on this aspect. I'm beginning to feel it's turtles all the way down.

    I'm strongly advocating that the company get MDS in house so we can avoid this issue.

  • craig.bobchin (9/1/2015)


    Jason,

    That works great. I've been fooling around with the thresholds to try and get the best match before I present it to the business. I also made a couple of other minot tweaks to handle the 'n/a' that can appear in any of the matching fields.

    Takes about 12 minutes to run against our full table of 30k records.

    One thing I got to thinking about is as we add more records to the dimMember table, and have to run this script against the table daily, member's groupId has the potential to change based on new/updated member data.

    I'm curious as to your take on this aspect. I'm beginning to feel it's turtles all the way down.

    I'm strongly advocating that the company get MDS in house so we can avoid this issue.

    Glad you got it working!

    You concerns about scalability are well founded. Any time you're comparing every row in a table to every other row in the table it's not going scale very well. The use of a triangular join is a lot better than a full Cartesian join but fair bit of the pain is still there.

    If you're going to be doing this on a regular basis, I'd modify the script so that only new rows are being scored. There's no good reason to rescore a row that's currently a member of a good group. Assuming you have a fairly consistent number of new members added each day, scoring only new members should allow the solution to scale MUCH better than rescoring every row every day.

    I don't know what "MDS" is, unless you're talking about Melissa Data... I've heard good things about it but I have no personal experience with it. If you go that rout, please post your impressions.

  • Jason A. Long (9/1/2015)


    craig.bobchin (9/1/2015)


    Jason,

    That works great. I've been fooling around with the thresholds to try and get the best match before I present it to the business. I also made a couple of other minot tweaks to handle the 'n/a' that can appear in any of the matching fields.

    Takes about 12 minutes to run against our full table of 30k records.

    One thing I got to thinking about is as we add more records to the dimMember table, and have to run this script against the table daily, member's groupId has the potential to change based on new/updated member data.

    I'm curious as to your take on this aspect. I'm beginning to feel it's turtles all the way down.

    I'm strongly advocating that the company get MDS in house so we can avoid this issue.

    Glad you got it working!

    You concerns about scalability are well founded. Any time you're comparing every row in a table to every other row in the table it's not going scale very well. The use of a triangular join is a lot better than a full Cartesian join but fair bit of the pain is still there.

    If you're going to be doing this on a regular basis, I'd modify the script so that only new rows are being scored. There's no good reason to rescore a row that's currently a member of a good group. Assuming you have a fairly consistent number of new members added each day, scoring only new members should allow the solution to scale MUCH better than rescoring every row every day.

    I don't know what "MDS" is, unless you're talking about Melissa Data... I've heard good things about it but I have no personal experience with it. If you go that rout, please post your impressions.

    MDS is Master Data Services. Which is designed with this situation in mind. The idea of scoring just new rows makes sense. It will be easy to implement.

    Thanks again for your help.

Viewing 4 posts - 31 through 33 (of 33 total)

You must be logged in to reply to this topic. Login to reply