How can I group near duplicate records under a new common field?

  • Jayanth_Kurup (8/27/2015)


    I think your looking at the problem the wrong way. using t-sql to do fuzzy lookups is kind of tricky since the db is not really suited to doing pattern matches. Sure you can write really complicated code but it still wont be bullet proof. Have to explored using the Fuzzy grouping task in SSIS.

    Check out this video [/url]

    All you might need is to schedule the package to run frequently.

    Jayanth,

    I watched the video, and it does give me some more information. However how would I handle it in my situation where I have multiple columns that I need to match on and not all of them have valid nor consistent data between records?

    Do I need multiple fuzzy grouping transforms or do I do it in one?

    How is the canonical field value determined?

    How do I assign a new key unique to the to the grouping?

  • Jason A. Long (8/27/2015)


    The code I posted earlier shows how to handle the "how to assign a common value to a new field based on the matches" problem.

    The idea is to find the "best" single parent/child match that exists for any given "ChildID". (Note: every row is a "child" of every preceding row and every row is a "parent" of every row that follows.) It's just a matter of finding the best parent for each child.

    The rows that don't have parents (there are no rows with, the PersonID in the ChildID position, with a score high enough to meet the match threshold) end up being the "GroupID" rows. Simply put, they are the 1st occurrence of a unique individual.

    This causes a natural adjacency list type hierarchy, which can be converted to a nested set hierarchy. (Easily accomplished thanks to Jeff Moden[/url])

    Once the Left & Right values are updated, they are used to define a "Group".

    For example...

    PersonID = 1 isn't the child of any other rows so it becomes the start of a group.

    PersonID = 2 matches to PersonID = 1 with a score of 13 (above the threshold) so it will be a part of GroupID = 1

    PersonID = 3 is matched equally to both 1 & 2 with a score of 11 (again above the threshold) making it part of Group 1. (note... It doesn't matter which of the "parents" is selected. They are equally good.)

    PersonID = 4 is matched to 1,2,&3 but the best match is to 3 with a score of 13 (above the threshold) again making it part of Group 1.

    PersonID = 5 is matched to 1,2,3&4 but none of those scores are high enough to meet the threshold, so it becomes the 1st member of Group 5.

    ... and so on...

    Applying the Lft & Rht values looks something like this...

    ParentID ChildID BestScore Lft Rgt

    0 1 100 2 9

    1 2 13 3 4

    1 3 11 5 8

    3 4 13 6 7

    0 5 100 10 17

    Everything between 2 & 9 is part of Group 1

    Everything between 10 & 17 is part of group 5...

    Jason,

    I modified your code for our table/fields and it doesn't quite work. I set the threshold to >=4 and of the 30k records in the table 8818 came back as Nulls in both the GroupID and MatchQuality fields. 19377 had a -1 in group ID and MatchQuality between 4 and 100.

    The 2000 had positive integers as GroupID with some grouping correctly under a group ID and others not grouping correctly.

    Not sure how I can tweak this to fix it.

  • craig.bobchin (8/28/2015)


    Jason,

    I modified your code for our table/fields and it doesn't quite work. I set the threshold to >=4 and of the 30k records in the table 8818 came back as Nulls in both the GroupID and MatchQuality fields. 19377 had a -1 in group ID and MatchQuality between 4 and 100.

    The 2000 had positive integers as GroupID with some grouping correctly under a group ID and others not grouping correctly.

    Not sure how I can tweak this to fix it.

    Give me a few minutes... I'll break up the original code into more digestible chunks and add better comments. Hopefully that'll help you to see where you made a wrong turn.

    If that doesn't work, it may be necessary to post your actual table structure along with some test data (just make sure you aren't posting peoples real information).

    BRB...

  • This 1st section is just setting up the test data and creating a couple of temp tables that'll be needed further down in the script.

    The 1st table, #People, will obviously be replaced by you you own production table.

    The next two temp tables (#MatchScores & #BestMatches) are just working tables... So, there should be no need to create them as permanent tables.

    -- Create the necessary tables

    IF OBJECT_ID('tempdb..#People') IS NOT NULL

    DROP TABLE #People;

    CREATE TABLE #People (

    PersonID INT IDENTITY(1,1),

    FirstName VARCHAR(20),

    LastName VARCHAR(20),

    SSN VARCHAR(9),

    Address_1 VARCHAR(30),

    Address_2 VARCHAR(30),

    City VARCHAR(30),

    [State] CHAR(2),

    Zip CHAR(5),

    GroupID INT,

    MatchQuality INT

    );

    IF OBJECT_ID('tempdb..#MatchScores') IS NOT NULL

    DROP TABLE #MatchScores;

    CREATE TABLE #MatchScores (

    ParentID INT,

    ChildID INT,

    FirstNameScore INT,

    LastNameScore INT,

    SSN_Score INT,

    Add1_Score INT,

    Add2_Score INT,

    CityScore INT,

    StateScore INT,

    ZipScore INT,

    TotalScore INT

    );

    IF OBJECT_ID('tempdb..#BestMatches') IS NOT NULL

    DROP TABLE #BestMatches;

    CREATE TABLE #BestMatches (

    ParentID INT,

    ChildID INT,

    BestScore INT,

    Lft INT,

    Rgt INT

    );

    INSERT #People (FirstName,LastName,SSN,Address_1,Address_2,City,State,Zip) VALUES

    ('Bob','Jones','333224444','258 Grants Trail','#3','Bowling Green','KY','45459'),

    ('Robert','Jones','333224444','258 Grants Trail','Ste 3','Bowling Green','KY','45459'),

    ('Rob','Jones','333224444','258 Grants Tr.',NULL,'Bowling Green','KY','45459'),

    ('R.','Jones','333224444','258 Grants Tr.','Ste 3','Bowling Green','KY','45459'),

    ('Louis','Armstrong','258321478','123 Humming Bird Ln.',NULL,'Jacksonville','FL','32209'),

    ('L.','Armstrng','258321478','123 Humming Bird Lane',NULL,'Jacksonville','FL','32207'),

    ('Louie','Armstrong','258321478','123 Humming Bird Lain',NULL,'Jacksonville','FL','32209'),

    ('Louis','Armstong','258321478','123 Humming Bird Ln.',NULL,'Jacksonville','FL','32209'),

    ('Mark','Adams','321456987','555 Baymeadows Dr.','Unit 42','Ashville','NC','33304'),

    ('M.','Adams','321456987','555 Bay Meadows Dr.','# 42','Ashvlle','NC','33304'),

    ('Mark','Adams','321456987','555 Baymeadows Drive','Unit 42','Ashville','NC','33305'),

    ('Mark','Adams','321456987','555 Baymeadows Dr.','Unit 42','Ashville','NC','33306'),

    ('Bob','Jones','555229999','4227 Some Place Ln.','#3','Bowling Green','KY','45459'),

    ('Bob','Jones','555229999','4227 Some Place Ln.','# 3','Bowling Green','KY','45459'),

    ('Bob','Jones','555292999','4227 SomePlace Ln','Unit 3','Bowling Green','KY','45459'),

    ('Bob','Jones','555292999','4227 Some Place Ln.','Ste 3','Bowling Green','KY','45459'),

    ('Louis','Armstrong','147852369','123 Baldwin Ave.',NULL,'Jacksonville','FL','32209'),

    ('Louis','Armstrong','147852369','123 Baldwin Ave',NULL,'Jacksonville','FL','32209'),

    ('Louis','Armstrong','147852369','123 Baldwin Anenue',NULL,'Jachsonville','FL','32209'),

    ('Louis','Armstrong','147852639','123 Baldwin Ave.',NULL,'Jacksonville','FL','32209'),

    ('Mark','Adams','654789321','524 Main St.','Unit 42','Bakersville','NC','33304'),

    ('Mark','Adam','654789321','524 Main St.','Unit 42','Bakersville','NC','33304'),

    ('Mark','Adams','654789231','524 Main St.','Unit 42','Barkersville','NC','33304'),

    ('Mark','Adams','654789321','524 Main St.','Unit 42','Bakersville','NC','33304');

    -- Take a look at the base data.

    SELECT * FROM #People p;

    ----------------------------------------

    This next section works essentially on the same "triangular join" principal that other's have posted.

    The real key here is the assigning of weights. There will be some trial and error here, as you discover that certain attributes are stronger indicators of identity than others.

    As it's posted, I kept it very simple, looking only at exact matches. In real life, I've gone as far as splitting the SSN into individual characters and matching the individual characters (giving max points to perfect matches and reduced points for partial matches to allow for transposed characters)... and removing all non-alphanumeric characters from names & addresses to help reduce the impact of typos. Anyway, you get the point... You can make it a simple or complex as you'd like...

    WITH ColumnScores AS ( -- Assign weights to the various attributes

    SELECT

    p1.PersonID AS ParentID,

    p2.PersonID AS ChildID,

    CASE WHEN p1.FirstName = p2.FirstName THEN 1 ELSE 0 END AS FirstNameScore,

    CASE WHEN p1.LastName = p2.LastName THEN 2 ELSE 0 END AS LastNameScore,

    CASE WHEN p1.SSN = p2.SSN THEN 6 ELSE 0 END AS SSN_Score,

    CASE WHEN p1.Address_1 = p2.Address_1 THEN 2 ELSE 0 END AS Add1_Score,

    CASE WHEN COALESCE(p1.Address_2, '') = COALESCE(p2.Address_2, '') THEN 1 ELSE 0 END AS Add2_Score,

    CASE WHEN p1.City = p2.City THEN 1 ELSE 0 END AS CityScore,

    CASE WHEN p1.State = p2.State THEN 1 ELSE 0 END AS StateScore,

    CASE WHEN p1.Zip = p2.Zip THEN 1 ELSE 0 END AS ZipScore

    FROM

    #People p1

    JOIN #People p2

    ON p1.PersonID < p2.PersonID

    )

    INSERT #MatchScores (ParentID,ChildID,FirstNameScore,LastNameScore,SSN_Score,

    Add1_Score,Add2_Score,CityScore,StateScore,ZipScore,TotalScore)

    SELECT

    cs.ParentID,

    cs.ChildID,

    cs.FirstNameScore,

    cs.LastNameScore,

    cs.SSN_Score,

    cs.Add1_Score,

    cs.Add2_Score,

    cs.CityScore,

    cs.StateScore,

    cs.ZipScore,

    cs.FirstNameScore + cs.LastNameScore + cs.SSN_Score + cs.Add1_Score +

    cs.Add2_Score + cs.CityScore + cs.StateScore + cs.ZipScore AS TotalScore

    FROM

    ColumnScores cs

    WHERE

    cs.FirstNameScore + cs.LastNameScore + cs.SSN_Score + cs.Add1_Score +

    cs.Add2_Score + cs.CityScore + cs.StateScore + cs.ZipScore > 7 -- Set a threashold for total match quality

    ORDER BY

    cs.ParentID;

    Here is what the data in #MatchScores should look like at this point...

    ParentID ChildID FirstNameScore LastNameScore SSN_Score Add1_Score Add2_Score CityScore StateScore ZipScore TotalScore

    ----------- ----------- -------------- ------------- ----------- ----------- ----------- ----------- ----------- ----------- -----------

    1 2 0 2 6 2 0 1 1 1 13

    1 3 0 2 6 0 0 1 1 1 11

    1 4 0 2 6 0 0 1 1 1 11

    2 3 0 2 6 0 0 1 1 1 11

    2 4 0 2 6 0 1 1 1 1 12

    3 4 0 2 6 2 0 1 1 1 13

    5 6 0 0 6 0 1 1 1 0 9

    5 7 0 2 6 0 1 1 1 1 12

    5 8 1 0 6 2 1 1 1 1 13

    6 7 0 0 6 0 1 1 1 0 9

    6 8 0 0 6 0 1 1 1 0 9

    7 8 0 0 6 0 1 1 1 1 10

    9 10 0 2 6 0 0 0 1 1 10

    9 11 1 2 6 0 1 1 1 0 12

    9 12 1 2 6 2 1 1 1 0 14

    10 11 0 2 6 0 0 0 1 0 9

    10 12 0 2 6 0 0 0 1 0 9

    11 12 1 2 6 0 1 1 1 0 12

    13 14 1 2 6 2 0 1 1 1 14

    13 16 1 2 0 2 0 1 1 1 8

    14 16 1 2 0 2 0 1 1 1 8

    15 16 1 2 6 0 0 1 1 1 12

    17 18 1 2 6 0 1 1 1 1 13

    17 19 1 2 6 0 1 0 1 1 12

    17 20 1 2 0 2 1 1 1 1 9

    18 19 1 2 6 0 1 0 1 1 12

    21 22 1 0 6 2 1 1 1 1 13

    21 23 1 2 0 2 1 0 1 1 8

    21 24 1 2 6 2 1 1 1 1 15

    22 24 1 0 6 2 1 1 1 1 13

    23 24 1 2 0 2 1 0 1 1 8

    Note that the original 24 rows of test data has become 31 rows AND all of the matches have a score of 8 or better...

    Lowering the threshold, would have included more rows and raising it would include fewer rows.

    My choice to use ">7" was fairly arbitrary. My suggestion would be to start with a lower threshold and raise it gradually until you agree with the final grouping results.

    My next post will cover the rest of the script... To be continued...

  • Ok... Now is when we start refining the results that landed in #MatchScores. Again, the idea is the we want the single best parent for each child and those that don't have a good parent, become the beginning of their own new group.

    The following section simply grabs the distinct list of child values and their best matched parents. Note, that it is quite possible that a given child mat have 2 equally good parent candidates. In this case it doesn't matter which one is selected, we only need one of them.

    INSERT #BestMatches (ParentID,ChildID,BestScore)

    SELECT

    bs.ParentID,

    bs.ChildID,

    bs.TotalScore AS BestScore

    FROM

    (SELECT DISTINCT ms1.ChildID

    FROM #MatchScores ms1) c

    CROSS APPLY (

    SELECT TOP 1

    *

    FROM #MatchScores ms2

    WHERE c.ChildID = ms2.ChildID

    ORDER BY ms2.TotalScore DESC

    ) bs;

    At this point, the data in #BestMatches should look like this...

    ParentID ChildID BestScore Lft Rgt

    ----------- ----------- ----------- ----------- -----------

    1 2 13 NULL NULL

    2 3 11 NULL NULL

    3 4 13 NULL NULL

    5 6 9 NULL NULL

    5 7 12 NULL NULL

    5 8 13 NULL NULL

    9 10 10 NULL NULL

    9 11 12 NULL NULL

    9 12 14 NULL NULL

    13 14 14 NULL NULL

    15 16 12 NULL NULL

    17 18 13 NULL NULL

    18 19 12 NULL NULL

    17 20 9 NULL NULL

    21 22 13 NULL NULL

    21 23 8 NULL NULL

    21 24 15 NULL NULL

    If you're paying attention, you'll note that there are only 17 rows of data, which means we're missing 7 from our original dataset. Don't worry, we'll take care of that in the next step...

    ----------------------------------------------------------------

    At this point we have all of the children in the #BestMatches table but we still need to get the remaining 7 rows into the table...

    The reason the 7 rows were omitted in the place is due to the fact that they don't have a suitable parent in the data... So we'll also create a fictional "master parent" row to act as a temporary parent for the parent-less...

    INSERT #BestMatches (ParentID,ChildID,BestScore)

    SELECT ParentID,ChildID,BestScore FROM (VALUES (NULL, 0, 100)) x (ParentID,ChildID,BestScore) -- Creates the master parent row... Note the values being assigned...

    UNION ALL

    SELECT DISTINCT -- This will get the remaining "parent-less" rows. Note that the ParentID is set to 0 to match the Child = 0 of the master parent.

    0 AS ParentID,

    ms.ParentID AS ChildID,

    100 AS BestScore

    FROM #MatchScores ms

    WHERE NOT EXISTS (SELECT 1 FROM #BestMatches bm WHERE ms.ParentID = bm.ChildID);

    Important note on the "Master Parent" section... The ParentID MUST be set to NULL for the dbo.CreateNestedSets proc to work correctly.

    Also, the ParentID if the remaining 7 rows is set to 0 to match the ChildID of 0 on the master parent... It not imperative that you use 0, they just need to match and it has to be a value that doesn't exist in #People.PersonID.

    The reason we need to create the Master Parent is due to the fact that, while we want multiple, small hierarchies, the dbo.CreateNestedSets proc can only handle a singe hierarchy. The Master Parent simply provides temporary "common root" for processing.

    Anyway, this what #BestMatches should look like now... (the original 17 + 7 parent-less + 1 master parent)

    ParentID ChildID BestScore Lft Rgt

    ----------- ----------- ----------- ----------- -----------

    1 2 13 NULL NULL

    2 3 11 NULL NULL

    3 4 13 NULL NULL

    5 6 9 NULL NULL

    5 7 12 NULL NULL

    5 8 13 NULL NULL

    9 10 10 NULL NULL

    9 11 12 NULL NULL

    9 12 14 NULL NULL

    13 14 14 NULL NULL

    15 16 12 NULL NULL

    17 18 13 NULL NULL

    18 19 12 NULL NULL

    17 20 9 NULL NULL

    21 22 13 NULL NULL

    21 23 8 NULL NULL

    21 24 15 NULL NULL

    NULL 0 100 NULL NULL

    0 1 100 NULL NULL

    0 5 100 NULL NULL

    0 9 100 NULL NULL

    0 13 100 NULL NULL

    0 15 100 NULL NULL

    0 17 100 NULL NULL

    0 21 100 NULL NULL

    I'll cover the rest in just one more post... To be continued...

  • Ok... So this is where the real magic happens...

    -- Use nested sets to create the groups

    EXEC dbo.CreateNestedSets -- This proc is simply a dynamic version of Jeff Modens "Hierarchies on Steroids #1: Convert an Adjacency List to Nested Sets" script.

    @TableName = '#BestMatches',

    @ChildColName = 'ChildID',

    @ParentColName = 'ParentID',

    @LeftColName = 'Lft',

    @RightColName = 'Rgt',

    @DeBug = 0;

    DELETE #BestMatches WHERE ParentID IS NULL; -- We no longer need the master parent, so it gets deleted.

    There's lot's of good information about how nested sets work so I won't get into it here but I would encourage you to (at the very least) read Jeff's article Hierarchies on Steroids #1: Convert an Adjacency List to Nested Sets[/url], as it provided the basis for the dbo.CreateNestedSets proc.

    And... Love him or hate him... Joe Celko has some awesome articles on the subject...

    Now that the Lft & Rgt values have been set, and the master parent deleted, the #BestMatches table should look like this... (ORDER BY Lft)

    ParentID ChildID BestScore Lft Rgt

    ----------- ----------- ----------- ----------- -----------

    0 1 100 2 9

    1 2 13 3 8

    2 3 11 4 7

    3 4 13 5 6

    0 5 100 10 17

    5 6 9 11 12

    5 7 12 13 14

    5 8 13 15 16

    0 9 100 18 25

    9 10 10 19 20

    9 11 12 21 22

    9 12 14 23 24

    0 13 100 26 29

    13 14 14 27 28

    0 15 100 30 33

    15 16 12 31 32

    0 17 100 34 41

    17 18 13 35 38

    18 19 12 36 37

    17 20 9 39 40

    0 21 100 42 49

    21 22 13 43 44

    21 23 8 45 46

    21 24 15 47 48

    If you're not at all familiar with nested set hierarchies or how they work, take a few minutes to read the following Wikipedia article https://en.wikipedia.org/wiki/Nested_set_model, so that you have a basic understanding of how it works.

    ---------------------------------------------------

    The last & final step is to use the data from #BestMatches to update the #People table...

    UPDATE p SET

    p.GroupID = bm.GroupID,

    p.MatchQuality = bm.BestScore

    FROM

    #People p

    JOIN (

    SELECT

    Groups.ChildID AS GroupID,

    bm2.ChildID,

    bm2.BestScore

    FROM ( -- Derived table that pulls out just the "Parent" rows. Their PersonID will become the groups GroupID.

    SELECT

    bm1.ChildID,

    bm1.Lft,

    bm1.Rgt

    FROM

    #BestMatches bm1

    WHERE

    bm1.ParentID = 0

    ) Groups

    JOIN #BestMatches bm2 -- This is where the "Child" rows are brought into scope...

    ON bm2.Lft BETWEEN Groups.Lft AND Groups.Rgt -- This where the power of nested sets really shines! Any row that has a Lft value (or right, both work) that's between

    -- the Grpous.Lft and Groups.Rgt is a member of that group. It's as simple as that.

    ) bm

    ON p.PersonID = bm.ChildID;

    Now that #People has been updated, lets have a look...

    PersonID FirstName LastName SSN Address_1 Address_2 City State Zip GroupID MatchQuality

    ----------- -------------------- -------------------- --------- ------------------------------ ------------------------------ ------------------------------ ----- ----- ----------- ------------

    1 Bob Jones 333224444 258 Grants Trail #3 Bowling Green KY 45459 1 100

    2 Robert Jones 333224444 258 Grants Trail Ste 3 Bowling Green KY 45459 1 13

    3 Rob Jones 333224444 258 Grants Tr. NULL Bowling Green KY 45459 1 11

    4 R. Jones 333224444 258 Grants Tr. Ste 3 Bowling Green KY 45459 1 13

    5 Louis Armstrong 258321478 123 Humming Bird Ln. NULL Jacksonville FL 32209 5 100

    6 L. Armstrng 258321478 123 Humming Bird Lane NULL Jacksonville FL 32207 5 9

    7 Louie Armstrong 258321478 123 Humming Bird Lain NULL Jacksonville FL 32209 5 12

    8 Louis Armstong 258321478 123 Humming Bird Ln. NULL Jacksonville FL 32209 5 13

    9 Mark Adams 321456987 555 Baymeadows Dr. Unit 42 Ashville NC 33304 9 100

    10 M. Adams 321456987 555 Bay Meadows Dr. # 42 Ashvlle NC 33304 9 10

    11 Mark Adams 321456987 555 Baymeadows Drive Unit 42 Ashville NC 33305 9 12

    12 Mark Adams 321456987 555 Baymeadows Dr. Unit 42 Ashville NC 33306 9 14

    13 Bob Jones 555229999 4227 Some Place Ln. #3 Bowling Green KY 45459 13 100

    14 Bob Jones 555229999 4227 Some Place Ln. # 3 Bowling Green KY 45459 13 14

    15 Bob Jones 555292999 4227 SomePlace Ln Unit 3 Bowling Green KY 45459 15 100

    16 Bob Jones 555292999 4227 Some Place Ln. Ste 3 Bowling Green KY 45459 15 12

    17 Louis Armstrong 147852369 123 Baldwin Ave. NULL Jacksonville FL 32209 17 100

    18 Louis Armstrong 147852369 123 Baldwin Ave NULL Jacksonville FL 32209 17 13

    19 Louis Armstrong 147852369 123 Baldwin Anenue NULL Jachsonville FL 32209 17 12

    20 Louis Armstrong 147852639 123 Baldwin Ave. NULL Jacksonville FL 32209 17 9

    21 Mark Adams 654789321 524 Main St. Unit 42 Bakersville NC 33304 21 100

    22 Mark Adam 654789321 524 Main St. Unit 42 Bakersville NC 33304 21 13

    23 Mark Adams 654789231 524 Main St. Unit 42 Barkersville NC 33304 21 8

    24 Mark Adams 654789321 524 Main St. Unit 42 Bakersville NC 33304 21 15

    Scroll all the way over to the right to see how everything was matched up and the quality of the match.

    Also note... That row 15 split away (taking row 16 with it) for form a new group rather than joining rows 13 & 14.

    To the human eye, it's apparent that 13, 14, 15 & 16 are the same person, but the transposition in the SSN and other inconsistencies were enough to make SQL Server "think" they are two separate individuals.

    I did that intentionally to illustrate the necessity of analyzing your data and setting score values and the threshold accordingly.

    Anyway, I hold this helps.

    If you're still having problems let me know. If Erika hasn't blown my house down, I'll try to help.

  • craig.bobchin (8/28/2015)


    Jayanth_Kurup (8/27/2015)


    I think your looking at the problem the wrong way. using t-sql to do fuzzy lookups is kind of tricky since the db is not really suited to doing pattern matches. Sure you can write really complicated code but it still wont be bullet proof. Have to explored using the Fuzzy grouping task in SSIS.

    Check out this video [/url]

    All you might need is to schedule the package to run frequently.

    Jayanth,

    I watched the video, and it does give me some more information. However how would I handle it in my situation where I have multiple columns that I need to match on and not all of them have valid nor consistent data between records?

    Do I need multiple fuzzy grouping transforms or do I do it in one?

    How is the canonical field value determined?

    How do I assign a new key unique to the to the grouping?

    If you are looking to build a dictionary then you could probably do it in one fuzzy grouping task. I havent tried multiple columns but I'll try it out and post something on the blog.

    The canonical field value is the output of the Fuzzy Grouping task and is determined on 2 criteria. Similarity and Confidence. Similarity dictates how similar two words are for example

    James Smith and James Smiht will have a high similarity ( you can set this values as shown in the video) . Confidence is where SQL tells you how confident it feels that the two words are related e.g Road and Rd or Saint and St. this is not controlled by the user.

    Once you do fuzzy grouping , you can use Fuzzy lookup to perform the mathc on the columns you need.

    Jayanth Kurup[/url]

  • Jayanth & Jason, I am out of town this weekend, and will give your suggestions a try when I get back to the office on Monday. Thanks

  • Jason A. Long (8/28/2015)


    craig.bobchin (8/28/2015)


    Jason,

    I modified your code for our table/fields and it doesn't quite work. I set the threshold to >=4 and of the 30k records in the table 8818 came back as Nulls in both the GroupID and MatchQuality fields. 19377 had a -1 in group ID and MatchQuality between 4 and 100.

    The 2000 had positive integers as GroupID with some grouping correctly under a group ID and others not grouping correctly.

    Not sure how I can tweak this to fix it.

    Give me a few minutes... I'll break up the original code into more digestible chunks and add better comments. Hopefully that'll help you to see where you made a wrong turn.

    If that doesn't work, it may be necessary to post your actual table structure along with some test data (just make sure you aren't posting peoples real information).

    BRB...

    I'd posted the actual table structure in my 1st post, but here it is again for Convience.

    CREATE TABLE [dbo].[dimMember](

    [dimMemberId] [int] IDENTITY(1,1) NOT NULL,

    [dimSourceSystemId] [int] NOT NULL CONSTRAINT [DF_dimMember_dimSourceSystemId] DEFAULT ((-1)),

    [MemberCode] [nvarchar](50) NOT NULL,

    [FirstName] [nvarchar](250) NOT NULL,

    [LastName] [nvarchar](250) NOT NULL,

    [MiddleInitial] [nvarchar](5) NOT NULL,

    [dimDayidDOB] [int] NOT NULL CONSTRAINT [DF_dimMember_dimDayidDOB] DEFAULT ((-1)),

    [dimDayIdDeath] [int] NOT NULL CONSTRAINT [DF_dimMember_dimDayIdDeath] DEFAULT ((99991231)),

    [Gender] [nvarchar](10) NOT NULL,

    [SubcriberRelationship] [nvarchar](50) NOT NULL,

    [SSN] [nvarchar](25) NOT NULL CONSTRAINT [DF_dimMember_SSN] DEFAULT (N'n/a'),

    [ExecutionGUID] [uniqueidentifier] NOT NULL,

    [RowRecordedDate] [datetime] NOT NULL CONSTRAINT [DF_dimMember_RowRecordedDate] DEFAULT (getdate()),

    [isCurrent] [bit] NOT NULL CONSTRAINT [DF_dimMember_isCurrent] DEFAULT ((1)),

    [SubscriberCode] [nvarchar](50) NOT NULL CONSTRAINT [DF_dimMember_SubscriberId] DEFAULT (N'n/a'),

    [RowUpdatedDate] [datetime] NOT NULL CONSTRAINT [DF_dimMember_RowUpdatedDate] DEFAULT (getdate()),

    [dimDayIdSubscriberStart] [int] NOT NULL CONSTRAINT [DF_dimMember_dimDayIdSubscriberStart] DEFAULT ((1)),

    [dimDayIdSubscriberEnd] [int] NOT NULL CONSTRAINT [DF_dimMember_dimDayIdSubscriberEnd] DEFAULT ((99991231)),

    [Race] [nvarchar](50) NOT NULL CONSTRAINT [DF_dimMember_Race] DEFAULT (N'n/a'),

    [Ethnicity] [nvarchar](50) NOT NULL CONSTRAINT [DF_dimMember_Ethnicity] DEFAULT (N'n/a'),

    [PrimaryLanguage] [nvarchar](50) NOT NULL CONSTRAINT [DF_dimMember_PrimaryLanguage] DEFAULT (N'n/a'),

    [CurrentAddress1] [nvarchar](250) NOT NULL CONSTRAINT [DF_dimMember_CurrentAddress] DEFAULT (N'n/a'),

    [CurrentAddress2] [nvarchar](250) NOT NULL CONSTRAINT [DF_dimMember_CurrentAddress2] DEFAULT (N'n/a'),

    [CurrentZipCode] [nvarchar](50) NOT NULL CONSTRAINT [DF_dimMember_CurrentZipCode] DEFAULT (N'n/a'),

    CONSTRAINT [PK_dimMember] PRIMARY KEY CLUSTERED

    And here are 50 real records with SSNs, Names and Addresses cleaned.

    dimMemberId dimSourceSystemId MemberCode SSN_clean SubscriberCode FirstName LastName CurrentAddress1 CurrentAddress2 dimDayidDOB

    27197 9 000000000A n/a 131270013 DWIGHT DAVIS 13541 MCGyver n/a 19520719

    3133 3 000000000A n/a 131270013 DWIGHT DAVIS n/a n/a 19520719

    10187 3 001188830A n/a 83450002 Valerie Carson n/a n/a 19260224

    40296 10 002328534A 893682519 U1080380201 Lorraine Murphy 7553 JEAN ST n/a 19430617

    10477 3 26951540 185011698 112430004 Tim Johnston n/a n/a 19721221

    5089 3 003227819A 390181787 83370010 Abraham Nash n/a n/a 19320728

    48755 10 003423059A 960870920 U1080652401 Malcolm Chapman n/a n/a 19470212

    26166 9 003523491A n/a 143410021 Camille Hernandez 50339 DRAKE DR n/a 19430909

    1248 3 003523491A n/a 143410021 Camille Hernandez n/a n/a 19430909

    3597 3 005162758D6 884042620 70110001 Doyle Diaz n/a n/a 19231219

    4863 3 005164084A 117016871 81220010 Bradley Haynes n/a n/a 19130414

    5669 3 007129199C1 625782583 102450005 Marsha Watkins n/a n/a 19550519

    11833 3 74460274 934833093 122020001 Austin Curtis n/a n/a 19671007

    28224 9 008267625A n/a 142550001 Stuart Burton 632 Whitewood DR # 194 19340627

    5462 3 008267625A n/a 142550001 Stuart Burton n/a n/a 19340627

    3323 3 008300514A n/a 62140007 Wayne Torres n/a n/a 19410621

    4551 3 009140867A 111224735 82420016 Wallace Owen n/a n/a 19270323

    3222 3 010096833D6 720715700 81220019 Glen Stone n/a n/a 19141010

    3225 3 010101062D 541175675 80860001 Kathleen Garcia n/a n/a 19210728

    6201 3 010300959A 199173039 91210009 June Fields n/a n/a 19390425

    10159 3 011265529A 200804314 70240007 Pete Chambers n/a n/a 19100812

    12370 3 011707399A 211908581 81910007 Spencer Rogers n/a n/a 19850625

    48933 10 012806835M 149806974 U1080170301 Melody Nichols n/a n/a 19340122

    6035 3 013142597A 918319918 102710014 Edith Rivera n/a n/a 19210221

    27413 9 013760369M n/a 132050013 Larry Byrd 6120 MIDDLEBELT 610 n/a 19270616

    3610 3 013760369M n/a 132050013 Larry Byrd n/a n/a 19270616

    10392 3 014072336C1 n/a 103480006 Arthur Dean n/a n/a 19530901

    5060 3 014243998A n/a 111460001 Leo Guzman n/a n/a 19301127

    1771 3 015248827C1 199160226 103270003 Kim Fletcher n/a n/a 19580609

    26440 9 015248827C1 n/a 103270003 Kim Fletcher 1274 LIBRARY STREET n/a 19580609

    31593 9 015406256A n/a 143360013 Terence Lloyd 11391 NOTTINGHAM RD n/a 19420409

    12047 3 015406256A n/a 143360013 Terence Lloyd n/a n/a 19420409

    11016 3 015761904M 100199730 70430007 Harriet Barber n/a n/a 19210829

    4684 3 017166029D7 139045616 101390014 Martin Webb n/a n/a 19240418

    5307 3 018189654A 207606871 83010018 Andrew Delgado n/a n/a 19210108

    10399 3 019125304C1 n/a 111730022 Freda Price n/a n/a 19490821

    176 3 019267292A 124653445 102770004 Carolyn Wagner n/a n/a 19330426

    8906 3 019521356A n/a 112720015 Ron Taylor n/a n/a 19590407

    27316 9 020469264M n/a 131150001 Shelia Morales 13101 FREUD APT 309 n/a 19420428

    3384 3 020469264M n/a 131150001 Shelia Morales n/a n/a 19420428

    31804 9 021146794D2 n/a 141180006 Ernest Quinn 307 SALLIOTTE RD APT 205 19450708

    12501 3 021146794D2 n/a 141180006 Ernest Quinn n/a n/a 19450708

    609 3 021181540D 934638420 81190001 Neal Jenkins n/a n/a 19210310

    31822 9 021401723A n/a 111730011 Garrett Maxwell 26601 COERLIDGE WAY n/a 19520228

    12534 3 021401723A n/a 111730011 Garrett Maxwell n/a n/a 19520228

    57 3 023181017A 692892800 73650020 Sonia Little n/a n/a 19210929

    12406 3 023305600A 619625040 81500003 Noel Martinez n/a n/a 19420627

    29028 9 023626262A n/a 132120009 Marcella Franklin 16031 BEECH DALY RD TRLR 137 19811213

    6938 3 023626262A n/a 132120009 Marcella Franklin n/a n/a 19811213

    41919 10 023626262A 190816359 U1080337801 Marcella Franklin 16031 BEECH DALY RD TRLR 137 n/a 19811213

  • Please format your test data in the form an insert statement. I can't do anything with in the form it's in now.

  • Jason A. Long (8/31/2015)


    Please format your test data in the form an insert statement. I can't do anything with in the form it's in now.

    Sorry, Here it is:

    INSERT INTO [dbo].[dimMember]

    ([dimSourceSystemId]

    ,[MemberCode]

    ,[SSN]

    ,[SubscriberCode]

    ,[FirstName]

    ,[LastName]

    ,[CurrentAddress1]

    ,[CurrentAddress2]

    ,[dimDayidDOB]

    )

    VALUES

    ('9','000000000A','n/a','131270013','Miguel','Gardner','13541 MCGyver','n/a','19520719'),

    ('3','000000000A','n/a','131270013','Miguel','Gardner','n/a','n/a','19520719'),

    ('3','001188830A','n/a','83450002','Valerie','Carson','n/a','n/a','19260224'),

    ('10','002328534A','893682519','U1080380201','Lorraine','Murphy','7553 JEAN ST','n/a','19430617'),

    ('3','26951540','185011698','112430004','Tim','Johnston','n/a','n/a','19721221'),

    ('3','003227819A','390181787','83370010','Abraham','Nash','n/a','n/a','19320728'),

    ('10','003423059A','960870920','U1080652401','Malcolm','Chapman','n/a','n/a','19470212'),

    ('9','003523491A','n/a','143410021','Camille','Hernandez','50339 DRAKE DR','n/a','19430909'),

    ('3','003523491A','n/a','143410021','Camille','Hernandez','n/a','n/a','19430909'),

    ('3','005162758D6','884042620','70110001','Doyle','Diaz','n/a','n/a','19231219'),

    ('3','005164084A','117016871','81220010','Bradley','Haynes','n/a','n/a','19130414'),

    ('3','007129199C1','625782583','102450005','Marsha','Watkins','n/a','n/a','19550519'),

    ('3','74460274','934833093','122020001','Austin','Curtis','n/a','n/a','19671007'),

    ('9','008267625A','n/a','142550001','Stuart','Burton','632 Whitewood DR','# 194','19340627'),

    ('3','008267625A','n/a','142550001','Stuart','Burton','n/a','n/a','19340627'),

    ('3','008300514A','n/a','62140007','Wayne','Torres','n/a','n/a','19410621'),

    ('3','009140867A','111224735','82420016','Wallace','Owen','n/a','n/a','19270323'),

    ('3','010096833D6','720715700','81220019','Glen','Stone','n/a','n/a','19141010'),

    ('3','010101062D','541175675','80860001','Kathleen','Garcia','n/a','n/a','19210728'),

    ('3','010300959A','199173039','91210009','June','Fields','n/a','n/a','19390425'),

    ('3','011265529A','200804314','70240007','Pete','Chambers','n/a','n/a','19100812'),

    ('3','011707399A','211908581','81910007','Spencer','Rogers','n/a','n/a','19850625'),

    ('10','012806835M','149806974','U1080170301','Melody','Nichols','n/a','n/a','19340122'),

    ('3','013142597A','918319918','102710014','Edith','Rivera','n/a','n/a','19210221'),

    ('9','013760369M','n/a','132050013','Larry','Byrd','6120 MIDDLEBELT 610','n/a','19270616'),

    ('3','013760369M','n/a','132050013','Larry','Byrd','n/a','n/a','19270616'),

    ('3','014072336C1','n/a','103480006','Arthur','Dean','n/a','n/a','19530901'),

    ('3','014243998A','n/a','111460001','Leo','Guzman','n/a','n/a','19301127'),

    ('3','015248827C1','199160226','103270003','Kim','Fletcher','n/a','n/a','19580609'),

    ('9','015248827C1','n/a','103270003','Kim','Fletcher','1274 LIBRARY STREET','n/a','19580609'),

    ('9','015406256A','n/a','143360013','Terence','Lloyd','11391 NOTTINGHAM RD','n/a','19420409'),

    ('3','015406256A','n/a','143360013','Terence','Lloyd','n/a','n/a','19420409'),

    ('3','015761904M','100199730','70430007','Harriet','Barber','n/a','n/a','19210829'),

    ('3','017166029D7','139045616','101390014','Martin','Webb','n/a','n/a','19240418'),

    ('3','018189654A','207606871','83010018','Andrew','Delgado','n/a','n/a','19210108'),

    ('3','019125304C1','n/a','111730022','Freda','Price','n/a','n/a','19490821'),

    ('3','019267292A','124653445','102770004','Carolyn','Wagner','n/a','n/a','19330426'),

    ('3','019521356A','n/a','112720015','Ron','Taylor','n/a','n/a','19590407'),

    ('9','020469264M','n/a','131150001','Shelia','Morales','13101 FREUD APT 309','n/a','19420428'),

    ('3','020469264M','n/a','131150001','Shelia','Morales','n/a','n/a','19420428'),

    ('9','021146794D2','n/a','141180006','Ernest','Quinn','307 SALLIOTTE RD','APT 205','19450708'),

    ('3','021146794D2','n/a','141180006','Ernest','Quinn','n/a','n/a','19450708'),

    ('3','021181540D','934638420','81190001','Neal','Jenkins','n/a','n/a','19210310'),

    ('9','021401723A','n/a','111730011','Garrett','Maxwell','26601 COERLIDGE WAY','n/a','19520228'),

    ('3','021401723A','n/a','111730011','Garrett','Maxwell','n/a','n/a','19520228'),

    ('3','023181017A','692892800','73650020','Sonia','Little','n/a','n/a','19210929'),

    ('3','023305600A','619625040','81500003','Noel','Martinez','n/a','n/a','19420627'),

    ('9','023626262A','n/a','132120009','Marcella','Franklin','16031 BEECH DALY RD','TRLR 137','19811213'),

    ('3','023626262A','n/a','132120009','Marcella','Franklin','n/a','n/a','19811213'),

    ('10','023626262A','190816359','U1080337801','Marcella','Franklin','16031 BEECH DALY RD TRLR 137','n/a','19811213'),

  • See what you think of this...

    -- Add the two new columns to dimMember...

    --ALTER TABLE dbo.dimMember ADD GroupID INT;

    --ALTER TABLE dbo.dimMember ADD MatchQuality INT;

    IF OBJECT_ID('tempdb..#MatchScores') IS NOT NULL

    DROP TABLE #MatchScores;

    CREATE TABLE #MatchScores (

    ParentID INT,

    ChildID INT,

    MemberCodeScore INT,

    FirstNameScore INT,

    LastNameScore INT,

    DOB_Score INT,

    SSN_Score INT,

    SubscriberCodeScore INT,

    Add1_Score INT,

    Add2_Score INT,

    TotalScore INT

    );

    IF OBJECT_ID('tempdb..#BestMatches') IS NOT NULL

    DROP TABLE #BestMatches;

    CREATE TABLE #BestMatches (

    ParentID INT,

    ChildID INT,

    BestScore INT,

    Lft INT,

    Rgt INT

    );

    -- Take a look at the base data.

    --SELECT * FROM dbo.dimMember dm;

    WITH ColumnScores AS ( -- Assign weights to the various attributes

    SELECT

    dm1.dimMemberId AS ParentID,

    dm2.dimMemberId AS ChildID,

    CASE WHEN dm1.MemberCode = dm2.MemberCode THEN 3 ELSE 0 END AS MemberCodeScore,

    CASE WHEN dm1.FirstName = dm2.FirstName THEN 1 ELSE 0 END AS FirstNameScore,

    CASE WHEN dm1.LastName = dm2.LastName THEN 2 ELSE 0 END AS LastNameScore,

    CASE WHEN dm1.dimDayidDOB = dm2.dimDayidDOB THEN 1 ELSE 0 END AS DOB_Score,

    CASE WHEN NULLIF(dm1.SSN, 'n/a') = NULLIF(dm2.SSN, 'n/a') THEN 6 ELSE 0 END AS SSN_Score, -- Using NULLIF to prevent "n/a" from causing valid matches.

    CASE WHEN dm1.SubscriberCode = dm2.SubscriberCode THEN 1 ELSE 0 END AS SubscriberCodeScore,

    CASE WHEN NULLIF(dm1.CurrentAddress1, 'n/a') = NULLIF(dm2.CurrentAddress1, 'n/a') THEN 2 ELSE 0 END AS Add1_Score,

    CASE WHEN NULLIF(dm1.CurrentAddress2, 'n/a') = NULLIF(dm2.CurrentAddress2, 'n/a') THEN 1 ELSE 0 END AS Add2_Score

    FROM

    dbo.dimMember dm1

    JOIN dbo.dimMember dm2

    ON dm1.dimMemberId < dm2.dimMemberId

    )

    INSERT #MatchScores (ParentID,ChildID,MemberCodeScore,FirstNameScore,LastNameScore,SSN_Score,SubscriberCodeScore,

    Add1_Score,Add2_Score,TotalScore)

    SELECT

    cs.ParentID,

    cs.ChildID,

    cs.MemberCodeScore,

    cs.FirstNameScore,

    cs.LastNameScore,

    cs.SSN_Score,

    cs.SubscriberCodeScore,

    cs.Add1_Score,

    cs.Add2_Score,

    cs.MemberCodeScore + cs.FirstNameScore + cs.LastNameScore + cs.DOB_Score + cs.SSN_Score + SubscriberCodeScore + cs.Add1_Score +

    cs.Add2_Score AS TotalScore

    FROM

    ColumnScores cs

    WHERE

    MemberCodeScore + cs.FirstNameScore + cs.LastNameScore + cs.DOB_Score + cs.SSN_Score + SubscriberCodeScore + cs.Add1_Score +

    cs.Add2_Score > 4 -- Set a threashold for total match quality

    ORDER BY

    cs.ParentID;

    --===============================================================

    INSERT #BestMatches (ParentID,ChildID,BestScore)

    SELECT

    bs.ParentID,

    bs.ChildID,

    bs.TotalScore AS BestScore

    FROM

    (SELECT DISTINCT ms1.ChildID

    FROM #MatchScores ms1) c

    CROSS APPLY (

    SELECT TOP 1

    *

    FROM #MatchScores ms2

    WHERE c.ChildID = ms2.ChildID

    ORDER BY ms2.TotalScore DESC

    ) bs;

    INSERT #BestMatches (ParentID,ChildID,BestScore)

    SELECT ParentID,ChildID,BestScore FROM (VALUES (NULL, 0, 100)) x (ParentID,ChildID,BestScore)

    UNION ALL

    SELECT DISTINCT

    0 AS ParentID,

    dm.dimMemberId AS ChildID,

    100 AS BestScore

    FROM dbo.dimMember dm -- Change from the original script... Pulling the "Parent-less" from dimMember instead of #BestMatches.

    WHERE NOT EXISTS (SELECT 1 FROM #BestMatches bm WHERE dm.dimMemberId = bm.ChildID);

    --==========================================================================

    -- Use nested sets to create the groups

    EXEC dbo.CreateNestedSets -- This proc is simply a dynamic version of Jeff Modens "Hierarchies on Steroids #1: Convert an Adjacency List to Nested Sets" script.

    @TableName = '#BestMatches',

    @ChildColName = 'ChildID',

    @ParentColName = 'ParentID',

    @LeftColName = 'Lft',

    @RightColName = 'Rgt',

    @DeBug = 0;

    DELETE #BestMatches WHERE ParentID IS NULL;

    --==========================================================================

    UPDATE dm SET

    dm.GroupID = bm.GroupID,

    dm.MatchQuality = bm.BestScore

    FROM

    dbo.dimMember dm

    JOIN (

    SELECT

    Groups.ChildID AS GroupID,

    bm2.ChildID,

    bm2.BestScore

    FROM (

    SELECT

    bm1.ChildID,

    bm1.Lft,

    bm1.Rgt

    FROM

    #BestMatches bm1

    WHERE

    bm1.ParentID = 0

    ) Groups

    JOIN #BestMatches bm2

    ON bm2.Lft BETWEEN Groups.Lft AND Groups.Rgt

    ) bm

    ON dm.dimMemberId = bm.ChildID;

    SELECT * FROM dbo.dimMember dm

    See the attached spreadsheet for the results...

  • Jason A. Long (8/31/2015)


    See what you think of this...

    -- Add the two new columns to dimMember...

    --ALTER TABLE dbo.dimMember ADD GroupID INT;

    --ALTER TABLE dbo.dimMember ADD MatchQuality INT;

    IF OBJECT_ID('tempdb..#MatchScores') IS NOT NULL

    DROP TABLE #MatchScores;

    CREATE TABLE #MatchScores (

    ParentID INT,

    ChildID INT,

    MemberCodeScore INT,

    FirstNameScore INT,

    LastNameScore INT,

    DOB_Score INT,

    SSN_Score INT,

    SubscriberCodeScore INT,

    Add1_Score INT,

    Add2_Score INT,

    TotalScore INT

    );

    IF OBJECT_ID('tempdb..#BestMatches') IS NOT NULL

    DROP TABLE #BestMatches;

    CREATE TABLE #BestMatches (

    ParentID INT,

    ChildID INT,

    BestScore INT,

    Lft INT,

    Rgt INT

    );

    -- Take a look at the base data.

    --SELECT * FROM dbo.dimMember dm;

    WITH ColumnScores AS ( -- Assign weights to the various attributes

    SELECT

    dm1.dimMemberId AS ParentID,

    dm2.dimMemberId AS ChildID,

    CASE WHEN dm1.MemberCode = dm2.MemberCode THEN 3 ELSE 0 END AS MemberCodeScore,

    CASE WHEN dm1.FirstName = dm2.FirstName THEN 1 ELSE 0 END AS FirstNameScore,

    CASE WHEN dm1.LastName = dm2.LastName THEN 2 ELSE 0 END AS LastNameScore,

    CASE WHEN dm1.dimDayidDOB = dm2.dimDayidDOB THEN 1 ELSE 0 END AS DOB_Score,

    CASE WHEN NULLIF(dm1.SSN, 'n/a') = NULLIF(dm2.SSN, 'n/a') THEN 6 ELSE 0 END AS SSN_Score, -- Using NULLIF to prevent "n/a" from causing valid matches.

    CASE WHEN dm1.SubscriberCode = dm2.SubscriberCode THEN 1 ELSE 0 END AS SubscriberCodeScore,

    CASE WHEN NULLIF(dm1.CurrentAddress1, 'n/a') = NULLIF(dm2.CurrentAddress1, 'n/a') THEN 2 ELSE 0 END AS Add1_Score,

    CASE WHEN NULLIF(dm1.CurrentAddress2, 'n/a') = NULLIF(dm2.CurrentAddress2, 'n/a') THEN 1 ELSE 0 END AS Add2_Score

    FROM

    dbo.dimMember dm1

    JOIN dbo.dimMember dm2

    ON dm1.dimMemberId < dm2.dimMemberId

    )

    INSERT #MatchScores (ParentID,ChildID,MemberCodeScore,FirstNameScore,LastNameScore,SSN_Score,SubscriberCodeScore,

    Add1_Score,Add2_Score,TotalScore)

    SELECT

    cs.ParentID,

    cs.ChildID,

    cs.MemberCodeScore,

    cs.FirstNameScore,

    cs.LastNameScore,

    cs.SSN_Score,

    cs.SubscriberCodeScore,

    cs.Add1_Score,

    cs.Add2_Score,

    cs.MemberCodeScore + cs.FirstNameScore + cs.LastNameScore + cs.DOB_Score + cs.SSN_Score + SubscriberCodeScore + cs.Add1_Score +

    cs.Add2_Score AS TotalScore

    FROM

    ColumnScores cs

    WHERE

    MemberCodeScore + cs.FirstNameScore + cs.LastNameScore + cs.DOB_Score + cs.SSN_Score + SubscriberCodeScore + cs.Add1_Score +

    cs.Add2_Score > 4 -- Set a threashold for total match quality

    ORDER BY

    cs.ParentID;

    --===============================================================

    INSERT #BestMatches (ParentID,ChildID,BestScore)

    SELECT

    bs.ParentID,

    bs.ChildID,

    bs.TotalScore AS BestScore

    FROM

    (SELECT DISTINCT ms1.ChildID

    FROM #MatchScores ms1) c

    CROSS APPLY (

    SELECT TOP 1

    *

    FROM #MatchScores ms2

    WHERE c.ChildID = ms2.ChildID

    ORDER BY ms2.TotalScore DESC

    ) bs;

    INSERT #BestMatches (ParentID,ChildID,BestScore)

    SELECT ParentID,ChildID,BestScore FROM (VALUES (NULL, 0, 100)) x (ParentID,ChildID,BestScore)

    UNION ALL

    SELECT DISTINCT

    0 AS ParentID,

    dm.dimMemberId AS ChildID,

    100 AS BestScore

    FROM dbo.dimMember dm -- Change from the original script... Pulling the "Parent-less" from dimMember instead of #BestMatches.

    WHERE NOT EXISTS (SELECT 1 FROM #BestMatches bm WHERE dm.dimMemberId = bm.ChildID);

    --==========================================================================

    -- Use nested sets to create the groups

    EXEC dbo.CreateNestedSets -- This proc is simply a dynamic version of Jeff Modens "Hierarchies on Steroids #1: Convert an Adjacency List to Nested Sets" script.

    @TableName = '#BestMatches',

    @ChildColName = 'ChildID',

    @ParentColName = 'ParentID',

    @LeftColName = 'Lft',

    @RightColName = 'Rgt',

    @DeBug = 0;

    DELETE #BestMatches WHERE ParentID IS NULL;

    --==========================================================================

    UPDATE dm SET

    dm.GroupID = bm.GroupID,

    dm.MatchQuality = bm.BestScore

    FROM

    dbo.dimMember dm

    JOIN (

    SELECT

    Groups.ChildID AS GroupID,

    bm2.ChildID,

    bm2.BestScore

    FROM (

    SELECT

    bm1.ChildID,

    bm1.Lft,

    bm1.Rgt

    FROM

    #BestMatches bm1

    WHERE

    bm1.ParentID = 0

    ) Groups

    JOIN #BestMatches bm2

    ON bm2.Lft BETWEEN Groups.Lft AND Groups.Rgt

    ) bm

    ON dm.dimMemberId = bm.ChildID;

    SELECT * FROM dbo.dimMember dm

    See the attached spreadsheet for the results...

    Thanks Jason, I do actually have the fields in the table, I just copied the earlier version of the create table script for dimMembers.

    I'll check the code against the earlier version tomorrow morning. The spreadsheet data looks good. I've been running the original version of the script today with various thresholds and as expected getting wildly varied results.

    One thing I'm seeing is that of the 30k rows in the table almost 50% have either a -1 or a NULL in the GroupID field. I'm not sure what the deal is with that.

  • craig.bobchin (8/31/2015)


    Thanks Jason, I do actually have the fields in the table, I just copied the earlier version of the create table script for dimMembers.

    I'll check the code against the earlier version tomorrow morning. The spreadsheet data looks good. I've been running the original version of the script today with various thresholds and as expected getting wildly varied results.

    One thing I'm seeing is that of the 30k rows in the table almost 50% have either a -1 or a NULL in the GroupID field. I'm not sure what the deal is with that.

    No problem... I did see where my original code was getting NULLs based on your sample data. I found the issue and correct that in the latest code. I have absolutely no idea how you're getting -1 for anything. The GroupID is based on existing dimMemberIDs... So unless dimMemberID = -1 exists in your data, that shouldn't be possible...

    In any case, take the new version of the code for a spin and see if it gets you a little closer.

Viewing 15 posts - 16 through 30 (of 33 total)

You must be logged in to reply to this topic. Login to reply