Name variation matching

  • Thats for that. have run it and it does not seem to have helped. Without searching for name variations the query takes 17 seconds. I have let it run for three hours so far and there is no sign of any results being returned.

    I had to make a small change to add source-name into the functions so now it looks like this

    WITH cteSurname95 as (

    Select Source_Name,Name_Variation

    From Names_Database..Surname

    Where Score >= 95

    ), cteForeName95 as (

    Select Source_Name,Name_Variation

    From Names_Database..Forename

    Where Score >= 95

    With out this there was a syntax error.

    Is there any other information that would help?

  • sperry (5/16/2009)


    Thats for that. have run it and it does not seem to have helped. Without searching for name variations the query takes 17 seconds. I have let it run for three hours so far and there is no sign of any results being returned.

    ...

    Is there any other information that would help?

    Thanks for the feedback Sperry, and sorry that it isn't fixed so far. Logically it is not much different that before, but oftentimes just cleaning up the syntax will help the optimizer, plus I had hoped that getting rid of the duplicate self-joins might fix it.

    I am continuing to work on it. One thing that would help, would it be possible to get a copy of the execution plan? thnx.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Oops, thought of something else...

    sperry (5/16/2009)


    Thats for that. have run it and it does not seem to have helped. Without searching for name variations the query takes 17 seconds. I have let it run for three hours so far and there is no sign of any results being returned.

    How long did these two queries take for the original code? And what are you looking for as a "reasonable" time?

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • sperry (5/16/2009)


    ...

    I had to make a small change to add source-name into the functions so now it looks like this

    WITH cteSurname95 as (

    Select Source_Name,Name_Variation

    From Names_Database..Surname

    Where Score >= 95

    ), cteForeName95 as (

    Select Source_Name,Name_Variation

    From Names_Database..Forename

    Where Score >= 95

    With out this there was a syntax error.

    Is there any other information that would help?

    Yeah, sorry about that. I really need to mock-up some dummy tables so that I can test for all of the errors ahead of time. Could you post the CREAT table DDL for the PERSON and PERSON_SUBSET tables, with keys and indexes? Thanks.

    Oh also, can you tell me about what percentage of the Forname & Surname records have a [Score] of 95% or greater?

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • OK, I made my own table definitions and generated random data to make a more real performance test. After that, it became apparent that the problem was coming from the OR's that include the PERSON Sur/Fore names in with the name_variants for matching to PERSON_SUBSET. The optimiser seems to be doing really bad things because of that and though I could reduce its effect, I could not find a way to make it go away in a single query.

    Therefore, I had to make some temp tables to hold the combined PERSON Sur/Fore names with the name_variants. This now performs extremely well on my system:

    create proc spSurnameMatching_TempTAnswer

    AS

    --====== Create the 95% Forename table + all Person names

    Create table #Forename95_U_P(

    Source_Name varchar(20)

    , Name_Variation varchar(20)

    , Primary Key (Source_name, Name_Variation)

    )

    INSERT into #Forename95_U_P

    Select Source_name, Name_variation

    --From Names_Database..Forename

    From Forename

    Where Score >= 95

    INSERT into #Forename95_U_P

    Select distinct Forename, Forename

    From PERSON P

    Left Join #Forename95_U_P U

    ON P.Forename = U.Source_name

    And P.Forename = U.Name_variation

    Where U.Source_name IS NULL

    ----====== Create the 95% Surname table + all Person names

    Create table #Surname95_U_P(

    Source_Name varchar(20)

    , Name_Variation varchar(20)

    , Primary Key (Source_name, Name_Variation)

    )

    INSERT into #Surname95_U_P

    Select Source_name, Name_variation

    --From Names_Database..Surname

    From Surname

    Where Score >= 95

    INSERT into #Surname95_U_P

    Select distinct Surname, Surname

    From PERSON P

    Left Join #Surname95_U_P U

    ON P.Surname = U.Source_name

    And P.Surname = U.Name_variation

    Where U.Source_name IS NULL

    --====== Find all Person-subsets that 95% match persons

    ;WITH ctePersToPersSubs as (

    Select *

    FROM PERSON P

    INNER JOIN PERSON_SUBSET PER

    ON PER.EnglishForename IN (

    Select fn.Name_Variation

    From #Forename95_U_P fn

    Where fn.Source_Name = P.Forename

    )

    AND PER.EnglishSurname IN (

    Select fn.Name_Variation

    From #Surname95_U_P fn

    Where fn.Source_Name = P.Surname

    )

    )

    SELECT Distinct

    P.ID,

    P.PersonID,

    1,

    1

    From ctePersToPersSubs P

    WHERE P.Forename ''

    AND P.Surname ''

    --====== clean-up temp tables

    drop table #Forename95_U_P

    drop table #Surname95_U_P

    --==================

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Here is the best result that I could get with a single query:

    --create proc spSurnameMatching_CTEAnswer AS

    --====== Find all Person-subsets that 95% match persons

    ;WITH cteSurname95 as (

    Select Source_name, Name_Variation

    --From Names_Database..Surname

    From Surname

    Where Score >= 95

    UNION

    Select Surname, Surname

    From PERSON P

    ), cteForeName95 as (

    Select Source_name, Name_Variation

    --From Names_Database..Forename

    From Forename

    Where Score >= 95

    UNION

    Select Forename, Forename

    From PERSON P

    ), ctePersToPersSubs as (

    Select *

    FROM PERSON P

    --INNER JOIN C6..PERSON_SUBSET PER

    INNER JOIN PERSON_SUBSET PER

    ON PER.EnglishForename IN (

    Select fn.Name_Variation

    From cteForeName95 fn

    Where fn.Source_Name = P.Forename

    )

    AND PER.EnglishSurname IN (

    Select fn.Name_Variation

    From cteSurname95 fn

    Where fn.Source_Name = P.Surname

    )

    )

    SELECT Distinct

    P.ID,

    --PER.PersonID,

    P.PersonID,

    1,

    1

    From ctePersToPersSubs P

    WHERE P.Forename ''

    AND P.Surname ''

    Although it is almost 80x faster than both the original query and my first attempted fix here, it is still about 50x slower than the temp table version above.

    Let me know how these work for you.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Thank you SO MUCH for all of your hard work and over a weekend too. I am away from my computer until monday but will start testing it first thing and get back to you with answers to all of your questions if they are still needed.

    Once again, thank you so much. Now go enjoy the rest of your weekend!!!

  • Thanks, sperry, be sure to let us know how it works out.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Hi RBarryYoung.

    Thanks so much for your hard work. I ran the script (the one that uses tmp tables) this morning and results were returned in just over 20mins. Since this will run as an overnight job, 20 mins is fine.

    I just want to clarify one part of the syntax to make sure i understand it all correctly.

    ;WITH ctePersToPersSubs as

    I have not see the above used in this context before. Does it create a tmp function?

    Thanks again

    Steve

  • sperry (5/18/2009)


    Hi RBarryYoung.

    Thanks so much for your hard work. I ran the script (the one that uses tmp tables) this morning and results were returned in just over 20mins. Since this will run as an overnight job, 20 mins is fine.

    Great thanks for the feedback! Though I can't help feeling that this should still be able to run faster...

    I just want to clarify one part of the syntax to make sure i understand it all correctly.

    ;WITH ctePersToPersSubs as

    I have not see the above used in this context before. Does it create a tmp function?

    No, this is the syntax for creating Common-Table Expressions or CTE's. CTEs are really just named table expressions that can be re-used (because they are named). Except for the naming/reuse with a single query and the ability to define recursive queries (not used here, see BOL) they are really no different than a subquery. However most people agree that they make it easier to organize a complex query and the naming makes them more readable.

    If it helps, think of them as temporary Views that only exist within a single query (normal (non-indexed) Views are also table expressions).

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • I forgot to mention, that because I was using my own made up data, I can be sure that it is correct, so you should definitely check it for validity before relying on it in production.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Thanks for the explination. Always good to learn something new.

    I am in the process of checking the data now.

    Thanks again

  • Forgot to mention, i have also made a post about the same issue on experts exchane. No one has come up with a solution as good as yours, so would you mind if i post your solution there to (Giving you full credit of course).

  • Yes, that's OK.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

Viewing 14 posts - 16 through 28 (of 28 total)

You must be logged in to reply to this topic. Login to reply