Name variation matching

Question

Post reply

Name variation matching

Viewing 14 posts - 16 through 28 (of 28 total)

You must be logged in to reply to this topic. Login to reply

sperry-750868 SSCommitted Points: 1505 More actions · Answer 1

Thats for that. have run it and it does not seem to have helped. Without searching for name variations the query takes 17 seconds. I have let it run for three hours so far and there is no sign of any results being returned.

I had to make a small change to add source-name into the functions so now it looks like this

WITH cteSurname95 as (



Select Source_Name,Name_Variation

From Names_Database..Surname

Where Score >= 95

), cteForeName95 as (

Select Source_Name,Name_Variation

From Names_Database..Forename

Where Score >= 95

With out this there was a syntax error.

Is there any other information that would help?

RBarryYoung SSC Guru Points: 143329 More actions · Answer 2

sperry (5/16/2009)

Thats for that. have run it and it does not seem to have helped. Without searching for name variations the query takes 17 seconds. I have let it run for three hours so far and there is no sign of any results being returned.

...

Is there any other information that would help?

Thanks for the feedback Sperry, and sorry that it isn't fixed so far. Logically it is not much different that before, but oftentimes just cleaning up the syntax will help the optimizer, plus I had hoped that getting rid of the duplicate self-joins might fix it.

I am continuing to work on it. One thing that would help, would it be possible to get a copy of the execution plan? thnx.

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]

RBarryYoung SSC Guru Points: 143329 More actions · Answer 3

Oops, thought of something else...

sperry (5/16/2009)

Thats for that. have run it and it does not seem to have helped. Without searching for name variations the query takes 17 seconds. I have let it run for three hours so far and there is no sign of any results being returned.

How long did these two queries take for the original code? And what are you looking for as a "reasonable" time?

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]

RBarryYoung SSC Guru Points: 143329 More actions · Answer 4

sperry (5/16/2009)

...

I had to make a small change to add source-name into the functions so now it looks like this

WITH cteSurname95 as (
Select Source_Name,Name_Variation From Names_Database..Surname Where Score >= 95 ), cteForeName95 as ( Select Source_Name,Name_Variation From Names_Database..Forename Where Score >= 95

With out this there was a syntax error.

Is there any other information that would help?

Yeah, sorry about that. I really need to mock-up some dummy tables so that I can test for all of the errors ahead of time. Could you post the CREAT table DDL for the PERSON and PERSON_SUBSET tables, with keys and indexes? Thanks.

Oh also, can you tell me about what percentage of the Forname & Surname records have a [Score] of 95% or greater?

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]

RBarryYoung SSC Guru Points: 143329 More actions · Answer 5

OK, I made my own table definitions and generated random data to make a more real performance test. After that, it became apparent that the problem was coming from the OR's that include the PERSON Sur/Fore names in with the name_variants for matching to PERSON_SUBSET. The optimiser seems to be doing really bad things because of that and though I could reduce its effect, I could not find a way to make it go away in a single query.

Therefore, I had to make some temp tables to hold the combined PERSON Sur/Fore names with the name_variants. This now performs extremely well on my system:



create proc spSurnameMatching_TempTAnswer

 AS

--====== Create the 95% Forename table + all Person names

Create table #Forename95_U_P(

	Source_Name varchar(20)

	, Name_Variation varchar(20)

 , Primary Key (Source_name, Name_Variation)

)

INSERT into #Forename95_U_P

 Select Source_name, Name_variation

  --From Names_Database..Forename

  From Forename

  Where Score >= 95

INSERT into #Forename95_U_P

 Select distinct Forename, Forename

  From PERSON P

	Left Join #Forename95_U_P U

		ON P.Forename = U.Source_name

		 And P.Forename = U.Name_variation

  Where U.Source_name IS NULL

----====== Create the 95% Surname table + all Person names

Create table #Surname95_U_P(

	Source_Name varchar(20)

	, Name_Variation varchar(20)

 , Primary Key (Source_name, Name_Variation)

)

INSERT into #Surname95_U_P

 Select Source_name, Name_variation

  --From Names_Database..Surname

  From Surname

  Where Score >= 95

INSERT into #Surname95_U_P

 Select distinct Surname, Surname

  From PERSON P

	Left Join #Surname95_U_P U

		ON P.Surname = U.Source_name

		 And P.Surname = U.Name_variation

  Where U.Source_name IS NULL

--====== Find all Person-subsets that 95% match persons

;WITH ctePersToPersSubs as (

	Select *

	 FROM PERSON P

	  INNER JOIN PERSON_SUBSET PER

		ON PER.EnglishForename IN (

		   Select fn.Name_Variation

			From #Forename95_U_P fn

		    Where fn.Source_Name = P.Forename

		  )

		 AND PER.EnglishSurname IN (

		   Select fn.Name_Variation

			From #Surname95_U_P fn

			Where fn.Source_Name = P.Surname

		  )

)

SELECT Distinct

   P.ID,

   P.PersonID,

   1,

   1

 From ctePersToPersSubs P

 WHERE P.Forename  ''

  AND  P.Surname  ''

--====== clean-up temp tables

drop table #Forename95_U_P

drop table #Surname95_U_P

--==================

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]

RBarryYoung SSC Guru Points: 143329 More actions · Answer 6

Here is the best result that I could get with a single query:



--create proc spSurnameMatching_CTEAnswer AS

--====== Find all Person-subsets that 95% match persons

;WITH cteSurname95 as (

	Select Source_name, Name_Variation

     --From Names_Database..Surname

     From Surname

     Where Score >= 95

 UNION

	Select Surname, Surname

	 From PERSON P

), cteForeName95 as (

	Select Source_name, Name_Variation

     --From Names_Database..Forename

     From Forename

	 Where Score >= 95

 UNION

	Select Forename, Forename

	 From PERSON P

), ctePersToPersSubs as (

	Select *

	 FROM PERSON P

	  --INNER JOIN C6..PERSON_SUBSET PER

	  INNER JOIN PERSON_SUBSET PER

		ON PER.EnglishForename IN (

		   Select fn.Name_Variation

			 From cteForeName95 fn

		   Where fn.Source_Name = P.Forename

		  )

		 AND PER.EnglishSurname IN (

		   Select fn.Name_Variation

			 From cteSurname95 fn

			Where fn.Source_Name = P.Surname

		  )

)

SELECT Distinct

   P.ID,

   --PER.PersonID,

   P.PersonID,

   1,

   1

 From ctePersToPersSubs P

 WHERE P.Forename  ''

  AND  P.Surname  ''

Although it is almost 80x faster than both the original query and my first attempted fix here, it is still about 50x slower than the temp table version above.

Let me know how these work for you.

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]

sperry-750868 SSCommitted Points: 1505 More actions · Answer 7

sperry-750868

SSCommitted

Points: 1505

May 17, 2009 at 7:36 am

#995996

Thank you SO MUCH for all of your hard work and over a weekend too. I am away from my computer until monday but will start testing it first thing and get back to you with answers to all of your questions if they are still needed.

Once again, thank you so much. Now go enjoy the rest of your weekend!!!

RBarryYoung SSC Guru Points: 143329 More actions · Answer 8

RBarryYoung

SSC Guru

Points: 143329

May 17, 2009 at 9:01 am

#996016

Thanks, sperry, be sure to let us know how it works out.

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]

sperry-750868 SSCommitted Points: 1505 More actions · Answer 9

Hi RBarryYoung.

Thanks so much for your hard work. I ran the script (the one that uses tmp tables) this morning and results were returned in just over 20mins. Since this will run as an overnight job, 20 mins is fine.

I just want to clarify one part of the syntax to make sure i understand it all correctly.

;WITH ctePersToPersSubs as

I have not see the above used in this context before. Does it create a tmp function?

Thanks again

Steve

RBarryYoung SSC Guru Points: 143329 More actions · Answer 10

sperry (5/18/2009)

Hi RBarryYoung.

Thanks so much for your hard work. I ran the script (the one that uses tmp tables) this morning and results were returned in just over 20mins. Since this will run as an overnight job, 20 mins is fine.

Great thanks for the feedback! Though I can't help feeling that this should still be able to run faster...

I just want to clarify one part of the syntax to make sure i understand it all correctly.

;WITH ctePersToPersSubs as

I have not see the above used in this context before. Does it create a tmp function?

No, this is the syntax for creating Common-Table Expressions or CTE's. CTEs are really just named table expressions that can be re-used (because they are named). Except for the naming/reuse with a single query and the ability to define recursive queries (not used here, see BOL) they are really no different than a subquery. However most people agree that they make it easier to organize a complex query and the naming makes them more readable.

If it helps, think of them as temporary Views that only exist within a single query (normal (non-indexed) Views are also table expressions).

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]

RBarryYoung SSC Guru Points: 143329 More actions · Answer 11

I forgot to mention, that because I was using my own made up data, I can be sure that it is correct, so you should definitely check it for validity before relying on it in production.

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]

sperry-750868 SSCommitted Points: 1505 More actions · Answer 12

sperry-750868

SSCommitted

Points: 1505

May 18, 2009 at 7:20 am

#996303

Thanks for the explination. Always good to learn something new.

I am in the process of checking the data now.

Thanks again

sperry-750868 SSCommitted Points: 1505 More actions · Answer 13

sperry-750868

SSCommitted

Points: 1505

May 18, 2009 at 7:24 am

#996307

Forgot to mention, i have also made a post about the same issue on experts exchane. No one has come up with a solution as good as yours, so would you mind if i post your solution there to (Giving you full credit of course).

RBarryYoung SSC Guru Points: 143329 More actions · Answer 14

RBarryYoung

SSC Guru

Points: 143329

May 18, 2009 at 7:45 am

#996325

Yes, that's OK.

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]