Name variation matching

  • We have a number of jobs that perform matches across different tables. The initial match is based on the forename and surname of the person in question being equal. If the forename and surname is the same, a record is created in another table and further matching is performed to give a final match rating.

    We have just purchased a database containing million of name variations. These are in two tables (one for forenames and one for surnames) with the following columns:

    • Source_Name – This contains a list of Names that you can search on (E.g. ‘Steve’)

    • Name_Variation – This list the variation of the sources name. Each variation is on a new row. (I.e. there may be 2 rows with the source name ‘Steve’. One with the variation of ‘Stephen’ and the other with the variation ‘Steven’)

    • Score – This is a rating system with 100 being a close match and 75 being a loose match.

    I am having trouble getting the matching script to return results in an efficient manner. I believe that this is caused by the cross join that I have used in the below script. Is there a more efficient way of doing it?

    SELECT Distinct

    AVP.ID,

    PER.PersonID,

    1,

    1

    FROM PERSON P

    Cross JOIN C6..PERSON_SUBSET PER

    WHERE P.Forename <> ''

    AND

    P.Surname <> ''

    And (P.Forename = PER.EnglishForename Or PER.PERSONID in (

    Select PersonID

    From C6..Person_Subset

    Where EnglishForename In

    (Select Name_Variation

    from Names_Database..Forename

    Where [Source_Name] = P.Forename and Score >= 95)))

    AND (P.Surname = PER.EnglishSurname OR PER.PERSONID in (

    Select PersonID

    From C6..Person_Subset

    Where EnglishSurname in

    (Select Name_Variation

    from Names_Database..Surname

    Where [Source_Name] = P.Surname and Score >= 95)))

  • SSIS has fuzzy lookup and fuzzy grouping which will find potential matches, create a link between the two, score the matching and allow you to give give certain cut off levels for the matching score. e.g. only match on greater than 0.95

    It will probably be a lot easier to switch to the SSIS method than use T-SQL

  • Thanks for your answer Samuel.

    I know its a bit cheeky, but would you be able to provide some more information as to how i would use the fuzzy transformation to create the same results set as the t sql query.

    I am not an SSIS novice but not exactly an expert.

    Thanks

  • Thanks for the links. I have had a look and i do not think that the fuzzy matching is the correct route for me to go down.

    The names table already contains the explicit links between names and their variations. Also while the fuzzy searching would be able to catch things like 'Stephen' and 'Steven' it would have a harder time with names like 'Richard' and 'Dick'.

  • I actually don't understand your query well, but without taking a deep look I advice to use INNER JOINS to substitute the IN clauses in your query .... the use of the IN causes real performance issues when used with big result sets....

  • Thanks, i will try that. It wont effect the query at the mo becuase its the initial cross join that is causing the problems. Also each of the sub selects after the IN's will only re returning around 10 names.

  • The CROSS JOIN isn't your problem, it's just an INNER JOIN that has had it's ON conditions moved down to the WHERE clause. It'll act the same.

    Your real problem is that you have subqueries five deep.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Hmmm, the multiple correlated subqueries of PERSON_SUBSET to itself are definitely a problem, and it does not appear that they are necessary.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Hmm, also, this query is not valid:

    sperry (5/15/2009)


    SELECT Distinct

    AVP.ID, --<<< No AVP alias?

    PER.PersonID,

    1,

    1

    FROM PERSON P

    Cross JOIN C6..PERSON_SUBSET PER

    WHERE P.Forename ''

    AND

    P.Surname ''

    And (P.Forename = PER.EnglishForename Or PER.PERSONID in (

    ...

    Notice that there is no AVP table? We cannot reorganize this if we do not know how this table is getting into the query.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Sorry, my mistake. the AVP alias is the same as the P alias. I have changed the name of the table (and i thought all of the aliases) to remove potentially sensitive information.

    Is the cross join not responsible for creating a larger the necessary data set?

  • sperry (5/15/2009)[hrIs the cross join not responsible for creating a larger the necessary data set?

    No. Although syntactically it is a CROSS JOIN, logically it is not because the WHERE clause is full of P.* to PER.* conditions.

    You performance issues are almost certainly related to those PERSON_SUBSET to PERSON_SUBSET correlated subqueries, which you probably do not need (I am working on it now).

    It is also possible that you need to look at indexes, but lets get the query straight first...

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Could you post the DDL (including keys & indexes) for the Surname and Forename tables?

    thnx,

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • CREATE TABLE [dbo].[Surname](

    [Source_Name] [varchar](20) NOT NULL,

    [Name_Variation] [varchar](20) NOT NULL,

    [Score] [int] NOT NULL

    ) ON [PRIMARY]

    Both surname and forename tables are the same. There is a clustered index on the source_name column. There are no other keys or indexes

  • OK, try this and see if it is any better:

    ;WITH cteSurname95 as (

    Select Name_Variation

    From Names_Database..Surname

    Where Score >= 95

    ), cteForeName95 as (

    Select Name_Variation

    From Names_Database..Forename

    Where Score >= 95

    )

    SELECT Distinct

    P.ID,

    PER.PersonID,

    1,

    1

    FROM PERSON P

    INNER JOIN C6..PERSON_SUBSET PER

    ON PER.EnglishForename IN (

    Select P.Forename

    UNION ALL

    Select fn.Name_Variation

    From cteForeName95 fn

    Where fn.Source_Name = P.Forename

    )

    AND PER.EnglishSurname IN (

    Select P.Surname

    UNION ALL

    Select fn.Name_Variation

    From cteSurName95 fn

    Where fn.Source_Name = P.Surname

    )

    WHERE P.Forename ''

    AND P.Surname ''

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

Viewing 15 posts - 1 through 15 (of 28 total)

You must be logged in to reply to this topic. Login to reply