Similarity Metrics

  • I googled an open source function which generates a set of metrics returning a similarity coefficient when comparing two strings together. Comparing the results of that function to those generated by a package utilizing the fuzzy lookup transaformation I found that the NeedlemanWunch and SmithWaterman metrics are the best matches but due to relatively limited amount of test data I still can't decide if I should use either. My question is: is there any specific metric that SQL Server utilizes in the fuzzy lookup/grouping transformation?

    The metrics available in the mentioned function are the following:

    BlockDistance

    CosineSimilarity

    DiceSimilarity

    EuclideanDistance

    JaccardSimilarity

    MatchingCoefficient

    OverlapCoefficient

    ChapmanMeanLength

    QGramsDistance

    Levenstein

    MongeElkan

    SmithWaterman

    SmithWatermanGotoh

    SmithWatermanGotohWindowedAffine

    NeedlemanWunch

    Jaro

    JaroWinkler

    ChapmanLengthDeviation

    Thanks ,

    Samer.

  • This is the basic edit distance function whereby the distance is given simply as the minimum edit distance which transforms string1 into string2. Edit Operations are listed as follows:

    Copy character from string1 over to string2 (cost 0)

    Delete a character in string1 (cost 1)

    Insert a character in string2 (cost 1)

    Substitute one character for another (cost 1)

    D(i-1,j-1) + d(si,tj) //subst/copy

    D(i,j) = min D(i-1,j)+1 //insert

    D(i,j-1)+1 //delete

    d(i,j) is a function whereby d(c,d)=0 if c=d, 1 else

    There are many extensions to the Levenshtein distance function typically these alter the d(i,j) function, but further extensions can be made for instance, the Needleman-Wunch distance for which Levenshtein is equivalent if the gap distance is 1. The Levenshtein distance is calulated below for the term "sam chapman" and "sam john chapman", the final distance is given by the bottom right cell, i.e. 5. This score indicates that only 5 edit cost operations are required to match the strings (for example, insertion of the "john " characters, although a number of other routes can be traversed instead).

    "http://kerjakeras.com/kenali-dan-kunjungi-objek-wisata-di-pandeglang/%5D Kenali Dan Kunjungi Objek Wisata Di Pandeglang">

    [url= http://kerjakeras.com/kenali-dan-kunjungi-objek-wisata-di-pandeglang/%5D Kenali Dan Kunjungi Objek Wisata Di Pandeglang

    "

  • Compared Fuzzy lookup similarity results on 20,000 rows of names to the similarity score returned by the following metrics: Levenstein, ChapmanLengthDeviation, Jaro, NeedleManWunch, SmithWaterman and QGramsDistance. Levenstein metric proved to be the best match with a maximum difference of 0.25 in the similarity coefficient. I will settle with this for the time being, but still not sure if SQL Server Fuzzy Lookup is based on that metric and has been modified.

  • Can you please provide the link for the open source similarity metric ...

    thank you in advance .....

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply