Fuzzy-String Search: Find misspelled information with T-SQL

  • @dmusick: Interesting approach, and certainly useful. Note that DLD shows the same similarity for 'farm' and 'charm' (40% diff = 60% sim). For performance, I would try to rework your UDF to avoid a local table variable within the UDF. When benchmarking my OpHTML UDF, used in the final SELECT, I noticed that SQL opens transactions for all table variable activity. My Lim UDF, used in the WHERE clause, has better performance because it has no local table variable.

  • Interesting article and well worth a second read, and a play with the code.

    I've used two TSQL fuzzy-matching algorithms in the past, both work "well enough" - subjective and also data-dependant.

    The first uses token-matching, like dmusick's algorithm in a previous post. Grab the first three letters from the reference word, see if it matches in the target. Then do the same with letters 2,3,4, then 3,4,5 and so on. It's not overly complicated and can be squeezed into an iTVF. The downside is that matching on words of twice the token length or less isn't reliable, so matching algorithms will depend upon the word length. Here's a typical token-matching iTVF employing 3-character tokens for reference words of 7 characters or more, and two-character tokens for reference words between 3 and 6 characters long;

    CREATE FUNCTION [dbo].[FuzzyMatch_iTVF2k5]

    (@Reference VARCHAR(100) = NULL,

    @Target VARCHAR(100) = NULL)

    RETURNS table WITH SCHEMABINDING

    AS

    -- Chris Morris 2012

    -- Fuzzy-matching using tokens

    -- See also http://research.microsoft.com/pubs/75996/bm_sigmod03.pdf

    RETURN

    SELECT

    d.Method,

    MatchRatio = CAST(CASE

    WHEN d.Method = 1 THEN 100

    WHEN d.Method = 3 THEN LenTarget*100.00/LenReference

    WHEN d.Method = 4 THEN LenReference*100.00/LenTarget

    WHEN d.Method = 5 THEN

    (

    SELECT

    MatchPC = (100.00 * ISNULL(NULLIF(SUM(

    CASE WHEN Tally.n < PosInTarget THEN Tally.n/PosInTarget ELSE PosInTarget/Tally.n END

    ),0)+2.00,0) / LenReference)

    * CASE WHEN LenTarget > LenReference THEN LenReference/LenTarget ELSE 1.00 END

    FROM ( -- Tally

    SELECT TOP (CAST(LenReference AS INT)-2) n = ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

    FROM (SELECT n FROM (SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

    UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1) d (n)) a,

    (SELECT n FROM (SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

    UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1) d (n)) b

    ) Tally

    CROSS APPLY (SELECT PosInTarget = CHARINDEX(SUBSTRING(@Reference, Tally.n, 3), @Target)) x

    )

    WHEN d.Method = 6 THEN

    (

    SELECT

    MatchPC = (100.00 * ISNULL(NULLIF(SUM(

    CASE WHEN Tally.n < PosInTarget THEN Tally.n/PosInTarget ELSE PosInTarget/Tally.n END

    ),0)+1.00,0) / LenReference)

    * CASE WHEN LenTarget > LenReference THEN LenReference/LenTarget ELSE 1.00 END

    FROM ( -- Tally

    SELECT TOP (CAST(LenReference AS INT)-1) n = ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

    FROM (SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

    UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1) d (n)

    ) Tally

    CROSS APPLY (SELECT PosInTarget = CAST(CHARINDEX(SUBSTRING(@Reference, Tally.n, 2), @Target) AS DECIMAL(5,2))) x

    )

    ELSE NULL

    END AS DECIMAL(5,2))

    FROM (

    SELECT Method = CASE

    WHEN @Reference = @Target THEN 1

    WHEN @Reference IS NULL OR @Target IS NULL THEN 2

    WHEN @Reference LIKE '%'+@Target+'%' THEN 3

    WHEN @Target LIKE '%'+@Reference+'%' THEN 4

    WHEN DATALENGTH(@Reference) >= 7 AND DATALENGTH(@Target) >= 7 THEN 5

    WHEN DATALENGTH(@Reference) > 2 AND DATALENGTH(@Target) > 2 THEN 6

    ELSE 7

    END,

    LenTarget = CAST(DATALENGTH(@Target) AS DECIMAL(5,2)),

    LenReference = CAST(DATALENGTH(@Reference) AS DECIMAL(5,2))

    ) d

    When comparing two columns in the same table, this function will take about 30s to process a million rows.

    The second method is superficially similar to the DLD algorithm but simplified for performance; for n = 1 to length of the reference word, run through the letters of the reference word one at a time, and measure where they are found in the target word and the reference word starting at position n-1. A little simple arithmetic on the resulting numbers yields a match score. Here's the code:

    DECLARE @Reference VARCHAR(25) = 'Joones'

    DECLARE @Target VARCHAR(25) = 'Joonse'

    ;WITH Tally AS (

    SELECT TOP (DATALENGTH(@Reference)) n = ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

    FROM (SELECT n FROM (SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

    UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1) d (n)) a,

    (SELECT n FROM (SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

    UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1) d (n)) b

    )

    SELECT

    -- * -- switch the comments around to see how it works

    MatchRatio = SUM(Matches)/DATALENGTH(@Reference)

    FROM Tally

    CROSS APPLY (SELECT TestLetter = SUBSTRING(@Reference, Tally.n, 1)) x

    CROSS APPLY (

    SELECT -- start search one character position *before* n to catch switched letters

    PosInReference = CAST(CHARINDEX(TestLetter, @Reference, n-1) AS DECIMAL(5,2)),

    PosInTarget = CAST(CHARINDEX(TestLetter, @Target, n-1) AS DECIMAL(5,2))

    ) z

    CROSS APPLY (

    SELECT Matches = CASE WHEN PosInReference > PosInTarget

    THEN PosInTarget/PosInReference ELSE PosInReference/PosInTarget END

    ) m

    It's substantially faster than the token-matching function, processing a million rows in about 4 seconds.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • @ChrisM@Work: Interesting approaches, but I'm wondering why you use DATALENGTH instead of LEN, are you trying to include trailing spaces? (In order to work with NVARCHAR instead of VARCHAR, you would need to divide DATALENGTH by two.)

  • I have been trying to think of good ways to get rid of the table variable. I was actually quite reluctant to use it at all, but I couldn't think of a better way to do it. I would have preferred to use arrays, but SQL doesn't have them. I may have to come up with a system for using string arrays to achieve what I need (like the good old days of programming!)

  • Thomas Keller (9/19/2012)


    @ChrisM@Work: Interesting approaches, but I'm wondering why you use DATALENGTH instead of LEN, are you trying to include trailing spaces? (In order to work with NVARCHAR instead of VARCHAR, you would need to divide DATALENGTH by two.)

    Habit, Tom, nothing more.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • @dmusick: That (replacing a table variable with string variables) is essentially what I've done in my Lim UDF. Since I only need to keep three rows, I only need three strings.

    In my OpHTML UDF, I need to keep all the rows, so it still uses a table variable. In Confio Ignite, I can see the direct relationship between transactions/sec and executions of OpHTML (once per row in search results, and in history of search results). This is true even WITH SCHEMABINDING.

    I have thought about replacing the table variable with one VARCHAR(MAX), or TEXT in SQL2000, but for most use cases, returning just a few rows from a search, it wouldn't make that much difference to tweak the UDF that's only used on the matched rows. (In my site, the history of search results was using too many resources, so I just limited it to 250 most recent.)

  • @all: Scripts updated 2012-09-21 from v1.0.0 to v1.1.0

    Performance of [udfDamerauLevenshteinOpHTML] has been improved, by doing one less INSERT and one less SELECT on the table variable. (The values of @Pos1 and @theOps at the end of the first loop are proper to start the second loop, so they do not need to be saved and reloaded.) All other changes are only in comments.

  • Out of interest, what other fuzzy search methods do people use?

    Metaphone comes to mind.

  • @martin-2 Bastable @fmendes Yes, the algorithms I've heard of most in a SQL context (using extended stored procedures) include Jaro-Winkler[/url] and Metaphone[/url] so I've linked those among others on my Reference[/url] page.

    Note also the SimMetrics link to Java implementations of various algorithms provided by @Davin21.

    But as far as I know, mine is the first published Transact-SQL implementation of a generic string-distance algorithm.

  • This is awesome. You really hit it!

    I took, modified and ran against my database of person's name. If I search for "Snith", it is not suggesting "Smith"!

    it seems to have some limitation on the size of columns that is being searched.

  • Is there an algorithm out there similar to this that will dicern from people's first name what gender they are? This may be asking a bit much, but I thought it wouldn't hurt to ask.

  • @amy.G I believe you need a database of first names; because the most an algorithm could do for you is to normalize the names. I googled "infer gender from first name" and found several useful pages including https://sites.google.com/site/facebooknamelist/namelist

  • @rajesh-2 R-424835: Thanks for saying it's awesome, although you had an issue! The only limitation on size of column being searched, is that only NVARCHAR(253) is passed into the UDF.

    But the assumption is that you are searching a column that only stores one search term per row.

    If you are trying to find misspelled terms inside a longer string, you need to extract the terms first, as described in Todd Fifield's article here[/url].

    Since you mention "database of person's name", do you have the first and last names separated? If you have a row with "Smith" in a column, and you use my code to search for "Snith" on that column, it should certainly find it.

    If you'd like me to help further, please post more details of the SQL you are trying. If you want to do that privately, you can click the green "Make a Suggestion" tab on the right side of http://fuzzy-string.com/[/url], then click Contact Support.

  • Good article.

  • What exactly do you mean in your article under the Confio Ignite graph picture where you state "Comments like "--Ignite=" are included in my code, to make it easy to name the individual SQL statements in the graph."? We use Ignite at our company and the only way I know how to name the specific SQL Hash that they provide is to insert a record into their table. Are you saying if the developer puts that in the comments of the stored procedure or view that it will be displayed that way? Just curious.

Viewing 15 posts - 16 through 30 (of 40 total)

You must be logged in to reply to this topic. Login to reply