Need a function to compare strings.

Question

Post reply

Need a function to compare strings.

The Dixie Flatline

SSC Guru

Points: 53353
More actions
March 18, 2009 at 4:22 pm

#135676

I am having to deal with a lot of external data for which NO front end editing can be accomplished. Trust me, that is a fervent wish. I am trying to match this external data to our existing data, and will accept close matches. Here's an example:

declare @sample1-2 table (name varchar(30), phone varchar(12)) insert into @sample select 'Hovious','123-456-7890'

My name gets typoed a LOT and some people have numeric dyslexia that might make them miskey the phone number. I need to be able to find this record when the name is an exact match and the phone number is close, OR when the phone number is an exact match and the name is close.

Here is my working definition of "close"

String A has only one character different than String B: Hovious/Havious

or

String A has one character more or less than String B: Hovious/Hovios

or

String A has two characters transposed in String B: Hovious/Hoviuos

This may be used for large batch runs, so speed is a must. The existing function seems slow (32MS CPU to compare Hovious/Hoviouous), so I'm prepared to rewrite it from scratch. But before I do, I thought I'd ask if anyone out there already has a fast way to something similar to this?

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

Viewing 15 posts - 1 through 15 (of 23 total)

You must be logged in to reply to this topic. Login to reply

Florian Reischl SSC-Dedicated Points: 37299 More actions · Answer 1

Florian Reischl

SSC-Dedicated

Points: 37299

March 18, 2009 at 4:30 pm

#962332

Hi Bob!

Before diving into a SQL parser... What about a CLR function? Would this be a possibility?

Greets

Flo

The more I learn, the more I know what I do not know
Blog: Things about Software Architecture, .NET development and T-SQL

How to Post Data/Code to get the best Help How to Post Performance Problems

noeld SSC Guru Points: 96590 More actions · Answer 2

noeld

SSC Guru

Points: 96590

March 18, 2009 at 4:32 pm

#962333

Google for: SSIS - Fuzzy Lookup

* Noel

The Dixie Flatline SSC Guru Points: 53353 More actions · Answer 3

Unfortunately CLR isn't an option for me for the same reason that front end editing isn't. Long story.

As for fuzzy lookup, I haven't looked at it for a while, but I thought it was basically a comparison of how many exact matches were found in columns, not characters. Did I overlook something?

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

Florian Reischl SSC-Dedicated Points: 37299 More actions · Answer 4

Are you able to compare on a table or do you have to match each single text?

How many data will come within one import action?

Greets

Flo

The more I learn, the more I know what I do not know
Blog: Things about Software Architecture, .NET development and T-SQL

How to Post Data/Code to get the best Help How to Post Performance Problems

The Dixie Flatline SSC Guru Points: 53353 More actions · Answer 5

The normal application will be a comparison between two tables, with a table of several thousand being compared to a table of 75,000 and growing.

I am coding as we chat about this, but any good ideas would still be welcomed. Thanks, Flo.

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

RBarryYoung SSC Guru Points: 143329 More actions · Answer 6

Bob Hovious (3/18/2009)

I am having to deal with a lot of external data for which NO front end editing can be accomplished. Trust me, that is a fervent wish. I am trying to match this external data to our existing data, and will accept close matches. Here's an example:

declare @sample1-2 table (name varchar(30), phone varchar(12)) insert into @sample select 'Hovious','123-456-7890'

My name gets typoed a LOT and some people have numeric dyslexia that might make them miskey the phone number. I need to be able to find this record when the name is an exact match and the phone number is close, OR when the phone number is an exact match and the name is close.

Here is my working definition of "close"

String A has only one character different than String B: Hovious/Havious

or

String A has one character more or less than String B: Hovious/Hovios

or

String A has two characters transposed in String B: Hovious/Hoviuos

This may be used for large batch runs, so speed is a must. The existing function seems slow (32MS CPU to compare Hovious/Hoviouous), so I'm prepared to rewrite it from scratch. But before I do, I thought I'd ask if anyone out there already has a fast way to something similar to this?

SQL Server has two SOP answers for this problem. One is old and limited, by easy to use. The other is newer,very good, but involves administrative considerations to setup and manager.

The old one: SOUNDEX and DIFFERENCE functions. Both use the old Soundex algorithm, invented by Ma Bell in the 60's or 70's for the exact problem that you mention: matching names that might be misspelled. Look it up in BOL. It's easy, fast , and not too hard to make it really fast (you add a soundex column to the table and then index it). However, for anything more than matching spelling variations of one name against a column of single names, it may not work too well.

The new one: Full-Text Indexing. This is the Cadillac solution, it can find words and phrases inside large string columns of text, multiple tables or even entire databases. Tons of flexibility. The cost? Well, the price is free with SQL Server, but you will have to set it up, determine it's scope, and manage the indexer, which is going to want a lot of space for the full-text indexes that it will be making and maintaining. Your choice.

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]

Bruce W Cassidy SSChampion Points: 12668 More actions · Answer 7

Bruce W Cassidy

SSChampion

Points: 12668

March 18, 2009 at 5:21 pm

#962357

[font="Verdana"]Yes, you have had a look at SOUNDEX() (and DIFFERENCE()) :-D[/font]

The Dixie Flatline SSC Guru Points: 53353 More actions · Answer 8

Thanks guys. 🙂

I'm aware of SOUNDEX. The problem is that some of these strings aren't names, or even proper words, so I eliminated that early. Full text search comes closer to general application but still has problems with transposed characters, and maybe speed. Remember, I don't know where in the string an error may appear.

All good thoughts though. Please keep them coming if others occur to you.

I am onto something in my coding although it may turn out to be an ITVF from hell. If I can get it to work, I will post it for critique back here when I have a working prototype.

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

Florian Reischl SSC-Dedicated Points: 37299 More actions · Answer 9

Hi Bob

I think in this case a usual inline function may be one of the fastest solutions.

Here my function



USE tempdb

IF (OBJECT_ID('dbo.ufn_CompareBH') IS NOT NULL)

   DROP FUNCTION dbo.ufn_CompareBH

GO

CREATE FUNCTION dbo.ufn_CompareBH (@s1 VARCHAR(100), @s2 VARCHAR(100))

   RETURNS BIT

--   RETURNS VARCHAR(100)

AS

BEGIN

   DECLARE @len1 SMALLINT

   DECLARE @len2 SMALLINT

   DECLARE @pos1 SMALLINT

   DECLARE @pos2 SMALLINT

   DECLARE @c1 CHAR(1)

   DECLARE @c2 CHAR(1)

   DECLARE @joker BIT

   SELECT @len1 = LEN(@s1), @len2 = LEN(@s2), @joker = 1, @pos1 = 1, @pos2 = 1

   IF (@len1 = @len2 - 1)

   BEGIN

      -- s1 is shorter than s2

      WHILE (@pos1 < @len1)

      BEGIN

         SELECT @c1 = SUBSTRING(@s1, @pos1, 1), @c2 = SUBSTRING(@s2, @pos2, 1)

         -- Maybe the missing character

         IF (@c1 != @c2)

         BEGIN

            -- Joker is gambled away and reuse the current character in s1

            IF (@joker = 1)

               SELECT @joker = 0, @pos1 = @pos1 - 1

            ELSE

               RETURN 0

         END

         SELECT @pos1 = @pos1 + 1, @pos2 = @pos2 + 1

      END

   END

   ELSE IF (@len2 = @len1 - 1)

   BEGIN

      -- s2 is shorter than s1

      WHILE (@pos1 < @len1)

      BEGIN

         SELECT @c1 = SUBSTRING(@s1, @pos1, 1), @c2 = SUBSTRING(@s2, @pos2, 1)

         IF (@c1 != @c2)

         BEGIN

            -- Joker is gambled away and reuse current character in s2

            IF (@joker = 1)

               SELECT @joker = 0, @pos2 = @pos2 - 1

            ELSE

               RETURN 0

         END

         SELECT @pos1 = @pos1 + 1, @pos2 = @pos2 + 1

      END

      RETURN 1

   END

   ELSE IF (@len1 = @len2)

   BEGIN

      -- len is equal; check the characters

      WHILE (@pos1 < @len1)

      BEGIN

         SELECT @c1 = SUBSTRING(@s1, @pos1, 1), @c2 = SUBSTRING(@s2, @pos1, 1)

         IF (@c1 != @c2)

         BEGIN

            -- Check if two characters have been swapped

            IF (@pos1 < @len1 AND @c1 = SUBSTRING(@s2, @pos1 + 1, 1) AND @c2 = SUBSTRING(@s1, @pos1 + 1, 1))

            BEGIN

               IF (@joker = 1)

                  SELECT @joker = 0, @pos1 = @pos1 + 1

               ELSE

                  RETURN 0

            END

            -- Any other wrong character

            ELSE IF (@joker = 1)

               SET @joker = 0

            ELSE

               RETURN 0

         END

         SET @pos1 = @pos1 + 1

      END

      RETURN 1

   END

   RETURN 0

END

GO

My test script



SET NOCOUNT ON

GO

DROP TABLE tbh

GO

CREATE TABLE tbh (id INT IDENTITY NOT NULL, s1 VARCHAR(100), s2 VARCHAR(100), result BIT, PRIMARY KEY (id))

GO

INSERT INTO tbh VALUES ('Hovious', 'Havious', NULL)

INSERT INTO tbh VALUES ('Hovious', 'Hovios', NULL)

INSERT INTO tbh VALUES ('Hovious', 'Hoviuos', NULL)

WHILE ((SELECT COUNT(*) FROM tbh) < 75000)

   INSERT INTO tbh SELECT s1, s2, result FROM tbh

--SELECT COUNT(*) FROM tbh

GO

DECLARE @now DATETIME

SET @now = GETDATE()

UPDATE TOP(75000) tbh SET result = dbo.ufn_CompareBH(s1, s2)

SELECT DATEDIFF(MILLISECOND, @now, GETDATE())

My test result

For 75,000 records was 1,833 milliseconds. Hope this might be fast enough... 🙂

... time for finish now for me. It's quiet late here. See you tomorrow!

Greets

Flo

The more I learn, the more I know what I do not know
Blog: Things about Software Architecture, .NET development and T-SQL

How to Post Data/Code to get the best Help How to Post Performance Problems

The Dixie Flatline SSC Guru Points: 53353 More actions · Answer 10

The Dixie Flatline

SSC Guru

Points: 53353

March 18, 2009 at 6:12 pm

#962381

Thanks, Flo. I am going to hold time trials tonight.

Hope you have a good evening. 🙂

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

Jesse McLain SSCommitted Points: 1782 More actions · Answer 11

Flo, using your function, when I test:

PRINT dbo.ufn_CompareBH('Jesse', 'Jessie')

I get zero returned, but I think it should be 1, right?

Bob, as far as performance is concerned, I think you're going to have to limit the comparisons you will test. If you have 1 million rows in your target, and 10,000 coming in daily, you really don't want to run 10 billion comparisons.

What about this - add an integer column to the tables holding values for comparison, and populate it with the sum of ASCII values of the upper-cased letters to be compared. If the difference between the two sums is less than or equal to (65 + 26), then the two words should be compared.

For example, comparing 'JESSE' and 'JESSIE':



SELECT ABS((ASCII('J') + ASCII('E') + ASCII('S') + ASCII('S') + ASCII('E'))

 - (ASCII('J') + ASCII('E') + ASCII('S') + ASCII('S') + ASCII('I') + ASCII('E'))) - 65

results in 8, indicating that the two words should be compared.

Jesse McLain
jesse@jessemclain.com
www.jessemclain.com
www.jessesql.blogspot.com

The Dixie Flatline SSC Guru Points: 53353 More actions · Answer 12

Interesting suggestion, and I will certainly look into it. Right now I minimize the number of rows by running separate queries, first joining for example on name (exact match) and number being close in the where clause, and then joining on number(exact match) with name being close in the where clause.

I haven't done time trials yet, but I came up with the following inline table-valued function last night, if anyone wants to try to break it (or improve it).

Thanks to everyone again for your input and suggestions.



ALTER FUNCTION [dbo].[itvf_StringsClose]

(

	 @string1 varchar(50)

	,@string2  varchar(50)

)

RETURNS TABLE

AS

-- ============================================================================

-- Author:             Bob Hovious

-- Create date:      3/19/2008

-- Description:       This function compares two strings and deems them to be "close" if:

--                             1.  One has one more character than the other

--                             2.  One character has been mistyped

--                             3.  Two characters are transposed

-- ============================================================================

RETURN

(

WITH

L0 AS (SELECT 1 AS C UNION ALL

       SELECT 1 union all

       SELECT 1 union all

       SELECT 1 union all

       SELECT 1 union all

       SELECT 1 union all

       SELECT 1 union all

       SELECT 1 )	-- 8 rows

,L1 AS (SELECT 1 as C from L0 A cross join L0 b)  --64 rows

,Tally AS (SELECT ROW_NUMBER() OVER(ORDER BY C) AS N FROM L1)   -- maximum test string size is 50, so 64 is plenty

,trimmedStrings as (select ltrim(rtrim(@string1)) as string1, ltrim(rtrim(@string2)) as string2)

,basevalues  as

(select  case when string1 >= string2 then string1 else string2 end as string1  -- long string   1

		,case when string1 >= string2 then string2 else string1 end as string2  -- short string  2

 from trimmedStrings

 where abs(len(string1)-len(string2)) < 2

)

,stuffing as

(select	 string1,ltrim(rtrim(stuff(' '+string2+' ',t2.N,0,'_'))) as S2

		,string2,stuff(string1,t1.N,1,'_') as S1

		,case when t1.N < len(string1) then stuff(string1,t1.N,2,reverse(substring(string1,t1.N,2)))

			  else 'askljfaqwieyhui'  -- highly unlikely to be compared

			  end as S1Transposition

 from basevalues b

 cross join tally t1

 cross join tally t2

 where t1.N <= len(string1)

   and t2.N <= len(string2)+2

 )

select isnull((select 'Y' as [StringsClose]

			   where exists(select 1 from stuffing

							where  string2 like S1

							   or  string1 like S2

							   or  string2 = S1Transposition

			               )

			  ),'N') as stringsClose

)

/* tests

set statistics time on;

select * from dbo.StringsClose('Hovious','Hoviouscc')

set statistics time off;

------------------------------------------------------

set statistics time on;

with names (name1,name2) as

(select 'Hovious','Hovious' union all

 select 'xHovious','Hovious' union all

 select 'Hovious','xHovious' union all

 select 'Hoviousx','Hovious' union all

 select 'Hovious','Hoviousx' union all

 select 'Xovious','Hovious' union all

 select 'Hovious','Xovious' union all

 select 'HoviouX','HoviouX' union all

 select 'HoXious','Hovious' union all

 select 'Hovious','HoXious' union all

 select 'HoviousSS','Hovious' union all

 select 'Hovious','HoviousSS' union all

 select 'HHHovious','Hovious' union all

 select 'Hovious','HHHovious' union all

 select 'OHvious','Hovious' union all

 select 'Hovious','OHvious' union all

 select 'HoIVous','Hovious' union all

 select 'Hovious','HoIVous' union all

 select 'HovioSU','Hovious' union all

 select 'Hovious','HovioSU' union all

 select 'Hovis','Hovious' union all

 select 'Hovious','Hovis' union all

 select 'HovIAs','Hovious' union all

 select 'Hovious','HovIAs' union all

 select 'HXviXus','Hovious' union all

 select 'Hovious','HXviXus'

)

select name1,name2,name.stringsClose

from names

cross apply dbo.itvf_StringsClose(name1,name2) name

where stringsClose = 'Y'

set statistics time off;

*/

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

Lynn Pettis SSC Guru Points: 442447 More actions · Answer 13

Lynn Pettis

SSC Guru

Points: 442447

March 19, 2009 at 9:40 am

#962827

Bob,

If you are processing incoming data, you may really want to look at the fuzzy lookup in SSIS. I used it quite a while ago in a now defunct process and it worked well. It does require some testing and tuning during development, but it may just fit your requirements.

The Dixie Flatline SSC Guru Points: 53353 More actions · Answer 14

Thanks, Lynn. Apparently I missed that when I first skimmed through fuzzy lookup. I was under the misimpression that it involved how many columns were exact matches, rather than the character strings themselves. Frankly, I'm going to have to go back to school on SSIS. For the moment, doing it in pure SQL, I can get it through QA faster.

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills