A search with a Rank

  • I have a bunch of duplicates within my system and I know there is an easy way to do what I need. I have a record with the name of "cat 414". I want to search a column and rank the results. I do not want to specifically look for Cat or cat 414 I want to base the results on how many of the character are used. The more similar the characters the higher the rank. Kind of a fuzzy search algorithm.

    thanks for the help

  • Sounds like an interesting problem, but we need a little more help before we can offer a solution.

    Could you give us a little more information?

    What would really help is an example showing several rows with different values in that column and how you want to see them ranked based on those values.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • I understand that I needed some detail. In the real process I will have an outer control loop that will feed in values like "BCD 414". I want to find the best match. This is how my feeble mind would process something like this. This will only be run once on a non production server so performance is really not a big issue. I am curious if there is a better way to do this and if there something like this that is more comprehensive that I can use in the future.

    IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[GetMatchingData]') AND type in (N'U'))

    DROP TABLE dbo.GetMatchingData

    DROP TABLE dbo.TheRank

    GO

    Create Table GetMatchingData (

    Id int identity(1,1),

    TheName varchar(10))

    CREATE TABLE TheRank (id int, TheName varchar(10), TheRank int)

    INSERT INTO GetMatchingData (TheName)

    SELECT 'bcd 414' UNION ALL

    SELECT 'bcd 414' UNION ALL

    SELECT 'ABDE' UNION ALL

    SELECT 'Blah' UNION ALL

    SELECT 'xyz414' UNION ALL

    SELECT 'abcd 414' UNION ALL

    SELECT 'test' UNION ALL

    SELECT 'test 414' UNION ALL

    SELECT 'bc44' UNION ALL

    SELECT 'test1' UNION ALL

    SELECT 'test2' UNION ALL

    SELECT 'test4' UNION ALL

    SELECT 'BCD414'

    DECLARE @TheValue varchar(10), @TheMidMin int,@SQL varchar(1000),@TrimmedValue varchar(1)

    SELECT @TheValue = 'bcd 414', @TheMidMin = 0

    WHILE @TheMidMin<=10

    BEGIN

    SET @TheMidMin = @TheMidMin+1

    SET @TrimmedValue = SUBSTRING(@TheValue,@TheMidMin, 1)

    IF @TheMidMin = 1

    BEGIN

    SET @SQL = '

    INSERT INTO TheRank

    SELECT ID, TheName,10 FROM GetMatchingData WHERE TheName LIKE ''%'+Convert(Varchar(1),@TrimmedValue)+'%'''

    EXECUTE (@SQL)

    END

    SET @SQL = '

    UPDATE TheRank

    SET TheRank = TheRank-1

    WHERE TheName NOT LIKE ''%'+Convert(Varchar(1),@TrimmedValue)+'%'''

    EXECUTE (@SQL)

    END

    Select * FROM TheRank

  • I don't really know of anything more "comprehensive". Ranking is based on a criteria of some sort which must be defined. In your example for instance, row 6 is given a rank of 10 even though it isn't an exact match for your string (although it contains an exact match). You are the only person who can define these details.

    I will bow out now, and maybe someone else will have some thoughts to add.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • You are correct and the problem I am working with is around 3K in records. So I am hoping for more of a robust solution besides the one I put together. I wish I could search this but I just keep coming up with gibberish

  • I am about to go out of town for several days, so this will be my last post, but you seem to be on the right track. The three things that could affect rank are (1) presence of matching characters, (2) absence of matching characters, (3) sequence of matching characters. In your situation, just generate a LOT of test combinations (including every edge case I could think of), rank them and then study the situations where a row ranked higher or lower than you thought it should. Gradually clarify your scoring rules. If you get away from integers you might subtract half a point for extraneous character(s) at the front or back, so that row 6 gets a 9.5, instead of a 10.

    Good luck.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • This looks like it might be a problem easily solved by a Tally Table. If I get the chance tonight, I'll see what I can do. Of course, don't let that stop anyone else from giving it a whirl, please.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • I sent the code via email. For some reason SSC didnt like something within it so I couldnt post it here. Let me know if it works for you.

    K

  • Jeff, just what I was thinking...here is a starter for 10...

    --= Ranked search - requires table GetMatchingData to exist =--

    --= create some test searches to look for in GetMatchingData =--

    IF OBJECT_ID('tempdb..#searches') IS NOT NULL

    DROP TABLE #searches;

    CREATE TABLE #searches (skid INT IDENTITY(1,1) PRIMARY KEY NOT NULL,search_key VARCHAR(10) NOT NULL, search_key_length AS LEN(search_key) PERSISTED);

    INSERT INTO #searches ( search_key )

    SELECT 'BCD 414' UNION ALL

    SELECT 'ABC 123' UNION ALL

    SELECT 'bcd 14' UNION ALL

    SELECT 'Aft 234' UNION ALL

    SELECT '??? 224' UNION ALL

    SELECT 'thiswas' UNION ALL

    SELECT 'notgood' UNION ALL

    SELECT 'really!'

    --= end of test data =--

    --= Create a Tally table if required =--

    IF OBJECT_ID('dbo.Tally') IS NULL

    BEGIN

    CREATE TABLE dbo.Tally(N INT);

    CREATE UNIQUE CLUSTERED INDEX ix_cls ON dbo.Tally(N) WITH FILLFACTOR=100;

    INSERT dbo.Tally(N)

    SELECT TOP(11000) ROW_NUMBER() OVER(ORDER BY @@SPID) - 1

    FROM sys.all_columns AS c1,sys.all_columns AS c2

    END

    --= Now perform the search =--

    SELECT

    sch.skid,

    sch.search_key,

    gmd.id,

    SUM(calc.value) as value

    FROM

    (#searches as sch

    CROSS JOIN

    GetMatchingData AS gmd)

    CROSS APPLY

    (

    SELECT SUBSTRING(sch.search_key,N,1) AS C,N

    FROM dbo.Tally

    WHERE N>0 AND N<=sch.search_key_length

    ) lkup

    CROSS APPLY

    (

    SELECT CASE

    WHEN CHARINDEX(lkup.C,gmd.TheName)>0 THEN

    -- if the first character is matched, score 10

    CASE WHEN lkup.N=1 THEN 10

    -- you get nothing for a normal match on other characters

    ELSE 0

    END

    -- You lose a point for not having one of the characters

    ELSE -1

    END AS value

    ) calc

    GROUP BY

    sch.skid, sch.search_key, gmd.id

    HAVING

    MAX(calc.value)=10 -- this ensures we only see rows that match the first character of the search key

    ORDER BY

    sch.skid, gmd.id

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

  • K Smith Thanks that is pretty slick and gets the results to where abcd 414 drops in rank. So you go through every Ascii symbol between 33 and 126 and if it matches in the value (present) and it matches the table value it updates +1. Then for every character that is in the ascii and is not in the value (present) rank goes -1. I think I can work around this.

    Create Table #GetMatchingData (

    Id int identity(1,1),

    TheName varchar(10),

    Rank int default 0)

    INSERT INTO #GetMatchingData (TheName)

    SELECT 'bcd 414' UNION ALL

    SELECT 'bcd 414' UNION ALL

    SELECT 'ABDE' UNION ALL

    SELECT 'Blah' UNION ALL

    SELECT 'xyz414' UNION ALL

    SELECT 'abcd 414' UNION ALL

    SELECT 'test' UNION ALL

    SELECT 'test 414' UNION ALL

    SELECT 'bc44' UNION ALL

    SELECT 'test1' UNION ALL

    SELECT 'test2' UNION ALL

    SELECT 'test4' UNION ALL

    SELECT 'BCD414'

    DECLARE @TheValue varchar(10), @TheMidMin int,@SQL varchar(1000), @TrimmedValue varchar(1), @v int

    select @v = 33

    select @TrimmedValue = char(@v)

    SELECT @TheValue = 'bcd 414', @TheMidMin = 0

    WHILE @TheMidMin<=10

    BEGIN

    SET @TheMidMin = @TheMidMin+1

    while @v <= 126

    begin

    SET @TrimmedValue = char(@v)

    Update #GetMatchingData

    SET Rank = Rank + 1 WHERE Charindex(@TrimmedValue, TheName) > 0 and charindex(@TrimmedValue, @TheValue) > 0

    Update #GetMatchingData

    SET Rank = Rank - 1 WHERE Charindex(@TrimmedValue, TheName) > 0 and charindex(@TrimmedValue, @TheValue) <= 0

    select @v = @v + 1

    end

    END

    Select * FROM #GetMatchingData

    order by rank desc

  • M.Maggoo - That is an unbelievable query but the results are the same as my loop. Where if I am looking for BCD 414 I get

    search_key id thename value

    BCD 414 1 bcd 414 10

    BCD 414 2 bcd 414 10

    BCD 414 6 abcd 414 10

    When we want abcd 414 to be 9

    As for your query I am thinking I will wait to the morning to digest what you put together. You guys are really good.

  • JKSQL (7/21/2011)


    M.Maggoo - That is an unbelievable query but the results are the same as my loop. Where if I am looking for BCD 414 I get

    search_key id thename value

    BCD 414 1 bcd 414 10

    BCD 414 2 bcd 414 10

    BCD 414 6 abcd 414 10

    When we want abcd 414 to be 9

    As for your query I am thinking I will wait to the morning to digest what you put together. You guys are really good.

    The results are the same because I did not see you ask for them to be different and I tend to try and produce the requested result 😉

    To get better help, you need to define the rules better, because I based the logic in that query on your logic in the loop. If that logic is incorrect, please try to explain all the rules...

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

  • Yeah K Smith changed my goal. Now I want it allllllllll. 🙂 Going into this I really thought there was a function that would just do this. Figured that there is always dirty data people need to clean. So my sample was just my feeble go at it. I was surprised by the results I have seen. I currently use a tally table and I am starting to see what you have done but it seems like another way to skin a cat in this case. I think this case is busted. I will have to do an outer loop and use ascii to loop through and get my ranking. Then at the end i will give as much information to the PM so they can make there duplicate matches. After I get that I get to obsolete records and remap them to active records. This path is a lot better than eye balling over 3K in bad data.

    Thanks again everyone for the effort

  • Kath Smith (7/21/2011)


    I sent the code via email. For some reason SSC didnt like something within it so I couldnt post it here. Let me know if it works for you.

    K

    It's normally something like an "injection blocker" on the user side when than happens. A lot of companies have them (as does the one I work for).

    The best way to get around it is to attach the code. That way others can see and benefit from what you've done.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • JKSQL (7/21/2011)


    Yeah K Smith changed my goal. Now I want it allllllllll. 🙂 Going into this I really thought there was a function that would just do this. Figured that there is always dirty data people need to clean. So my sample was just my feeble go at it. I was surprised by the results I have seen. I currently use a tally table and I am starting to see what you have done but it seems like another way to skin a cat in this case. I think this case is busted. I will have to do an outer loop and use ascii to loop through and get my ranking. Then at the end i will give as much information to the PM so they can make there duplicate matches. After I get that I get to obsolete records and remap them to active records. This path is a lot better than eye balling over 3K in bad data.

    Thanks again everyone for the effort

    Whoa! Back the similarity engine up here! 😀 Using the following test data and the code you posted, ask yourself how much the following data actually resembles the search criteria. 😉

    drop table #getmatchingdata

    Create Table #GetMatchingData (

    Id int identity(1,1),

    TheName varchar(10),

    Rank int default 0)

    INSERT INTO #GetMatchingData (TheName)

    SELECT TOP 3000

    LEFT(NEWID(),10)

    FROM sys.all_columns

    set nocount on

    DECLARE @TheValue varchar(10), @TheMidMin int,@SQL varchar(1000), @TrimmedValue varchar(1), @v int

    select @v = 33

    select @TrimmedValue = char(@v)

    SELECT @TheValue = 'bcd 414', @TheMidMin = 0

    WHILE @TheMidMin<=10

    BEGIN

    SET @TheMidMin = @TheMidMin+1

    while @v <= 126

    begin

    SET @TrimmedValue = char(@v)

    Update #GetMatchingData

    SET Rank = Rank + 1 WHERE Charindex(@TrimmedValue, TheName) > 0 and charindex(@TrimmedValue, @TheValue) > 0

    Update #GetMatchingData

    SET Rank = Rank - 1 WHERE Charindex(@TrimmedValue, TheName) > 0 and charindex(@TrimmedValue, @TheValue) <= 0

    select @v = @v + 1

    end

    END

    Select * FROM #GetMatchingData

    order by rank desc

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Viewing 15 posts - 1 through 15 (of 16 total)

    You must be logged in to reply to this topic. Login to reply