Fuzzy Matching Barcodes

  • I have a list of 1 million+ products with numeric barcode values stored as VARCHAR(50) that can range in length in a table. I'm trying to find a way to do a fuzzy search query when a user enters a barcode to find a product. Let's say I have a barcode table like:

    Barcodes

    ----------

    98425103

    98425104

    99425103

    99564312

    If the user entered something like 98425102, I would expect it to return 98425103. However, if the user entered 97425103, I'd expect both 98425103 and 99425103 to be returned. I'm basically looking for the closest matches to the entered barcode, while factoring misspellings/mistypes as well. Would I implement this via RegEx CLR or is there a way to do this through the standard LIKE operator?

  • the barcodes i'm used to, like at the grocery store are basically a concatenation of a vendor number plus a product number;

    ie 41415 -00106 is Eggs, Large at Publix Supermarkets, if my memory is not off.

    so the 41415 is the vendor, and the other half is a product code, which is assigned by the vendor and can be anything;

    in your example, i can understand ignoring the last digit for similar items, but the modification of a number somewhere in the middle like your example implies to me that a different vendor with the same product? is that what your sku represents?

    regardless, i think i'd consider generating a related skus table with the criteria you you can identify, insert the related items based on almost matches and use that for the query;

    any string manipulation on a million row varchar is going to require a table scan every time, and performance will suffer.

    but if you add another table so you know that 101 thru 109 are related to each other, it'd be possible to use indexed and speed things up.

    back to my example:

    41415-00106 would be cross joined against *105 (medium eggs),*107(extra large eggs) and *108 (Jumbo eggs)

    then because i KNOW some other brand of Free Range Eggs are also related, i'd relate Free Range Eggs (99969-45789 )them to those as well, even though their numbers have nothing to do with each other, but their product IS related.

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Unfortunately, the SKUs aren't all vendor specific, but for those that are I could implement narrowing down the product number to improve performance. This still leaves me with, from this pared down result, trying to match on say the last 5 numbers. And upon entry, if the user mis-entered one of the last 5 numbers and the product may not exist in the table. In this case, how would I bring back the closest matching results/products?

  • I don't suppose you're going to make it easy on use and mention that all the skus are comprised of numbers, right?

    If they are all numbers you can use something like this to get the closest matches.

    declare @userinput varchar(50)

    set @userinput = '965487312'

    select barcode

    from

    (

    select '965587312' barcode

    union

    select '965486312' barcode

    union

    select '965487322' barcode

    union

    select '965487311' barcode

    ) source

    order by abs(cast(@userinput as bigint) - cast(barcode as bigint)) asc

  • Actually they are all numbers... and that's an interesting approach. I had always examined it as a pattern matching problem, but I could use the difference to calculate which number is closest to what the user has inputted. I'm going to test this on my data and get back to this thread later.

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply