Optimisation Help Requested

  • Hi all,

    I have a function to create pseudo soundex keys for text matching of names and addresses. The rules used to generate the key are as follows:

    Remove any non-alpha characters.

    Remove any vowels (or Y, as a pseudo-vowel), but retain any vowels where these are single isolated characters.

    Replace any doubled characters within the input string with a single character

    So for a given set of inputs, the output should be as follows:

    John Smith -> JHNSMTH

    John A Smith -> JHNASMTH -- note retention of isolated vowel

    Flat A -> FLTA -- note retention of isolated vowel

    Flat A, 1 Battlefield Road -> FLTABTLFLDRD -- note replacement of doubled 't'

    Flat B, 1 Battlefied Road -> FLTBBTLFLDRD -- note double 'B' is acceptable as this is created by the keying process

    My tally table based function is as follows:

    CREATE function [dbo].[ufn_ReturnTextPart] (

    @address_part varchar(50)

    )

    returns varchar(50)

    as

    begin

    declare @output varchar(50)

    , @space_pos int

    set @output = ''

    -- standardise

    set @address_part = replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(@address_part,'Apartment','Flat'),'Apt ','Flat'),' Apt','Flat'),'1st','first'),'2nd','second'),'3rd','third'),'4th','fourth'),'5th','fifth'),'6th','sixth'),'7th','seventh'),'8th','eighth'),'9th','ninth'),'10th','tenth')

    -- replace double characters

    set @output = replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(@output,'AA','A'),'BB','B'),'CC','C'),'DD','D'),'EE','E'),'FF','F'),'GG','G'),'HH','H'),'II','I'),'JJ','J'),'KK','K'),'LL','L'),'MM','M'),'NN','N'),'OO','O'),'PP','P'),'QQ','Q'),'RR','R'),'SS','S'),'TT','T'),'UU','U'),'VV','V'),'WW','W'),'XX','X'),'YY','Y'),'ZZ','Z')

    -- extract only the text characters that are not vowels or pseudo-vowels

    -- or any single characters, i.e. spaced either side

    ; with cte as (

    select n, substring(@address_part,n,1) as PointChar

    from DBA.dbo.Tally

    where n <= len(@address_part)

    )

    select @output = @output + cte.PointChar

    from cte

    full outer join cte up

    on cte.n = up.n + 1

    full outer join cte down

    on cte.n = down.n - 1

    where cte.PointChar like '[b-z]'

    and cte.PointChar not in ('A','E','I','O','U','Y') -- select any text character that isn't a vowel

    or ( -- select any text character that is on its own, e.g. in Flat A

    isnull(up.PointChar,' ') = ' '

    and isnull(down.PointChar,' ') = ' '

    and cte.PointChar like '[a-z]'

    )

    -- uppercase for output

    return upper(@output)

    end

    But, two full outer joins of the input must be expensive. Is there a better way???

    Regards, Iain

    Edit: wrong where clause used

  • First, why join up and down? Surely, if a character follows the same character, the one it follows also precedes it, and you don't need to detect both things.

    Second, what is the actual end-purpose of this? If, like SoundEx, its purpose is to find strings that would "sound similar" to each other, I would think "1 B, Attfield Rd" would sound similar to "1 Battlefield Rd". Wouldn't they?

    Or is the purpose something else?

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • but if soundex finds any number, or any non a-z,A-Z character, it returns something like "B000" (all zeros), so soundexing on addresses gets meaningless, since most addresses have numbers in them.(1234 Main street)

    why not run your addresses thru an address service, free or otherwise to standardize them instead?

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Hi GSquared,

    The function is part of a matching routine. Inbound name and address records are matched to existing data. Unmatched rows are used to create new name and address records within the database. In order to avoid creating duplicate name and address records, I want to maximise the match rates as far as I can. The initial match key identifies likely matches and needs to be loose enough to allow for common spelling mistakes, but not so loose as to create lots of false positive matches.

    The up and down join is to identify single characters at any position in the string. Single characters have one of three patterns:

    Start of String : Character : Space

    Space : Character : Space

    Space : Character : End of String

    By doing the double offset, I can identify all three patterns by comparing the value of PointChar across each instance of the cte data:

    Up cte Value : Main cte Value : Down cte Value

    null : Character : space

    space : Character : space

    space : Character : null

    E.g.:

    declare @address_part varchar(50)

    set @address_part = 'A Single Character'

    -- extract only the text characters that are not vowels or pseudo-vowels

    -- or any single characters, i.e. spaced either side

    ; with cte as (

    select n, substring(@address_part,n,1) as PointChar

    from DBA.dbo.Tally

    where n <= len(@address_part)

    )

    select up.*,cte.*,down.*

    from cte

    full outer join cte up

    on cte.n = up.n + 1

    full outer join cte down

    on cte.n = down.n - 1

    order by cte.n

    Gives:

    n PointChar n PointChar n PointChar

    ----------- --------- ----------- --------- ----------- ---------

    18 r NULL NULL NULL NULL

    NULL NULL NULL NULL 1 A

    NULL NULL 1 A 2

    1 A 2 3 S

    2 3 S 4 i

    3 S 4 i 5 n

    4 i 5 n 6 g

    5 n 6 g 7 l

    6 g 7 l 8 e

    7 l 8 e 9

    8 e 9 10 C

    9 10 C 11 h

    10 C 11 h 12 a

    11 h 12 a 13 r

    12 a 13 r 14 a

    13 r 14 a 15 c

    14 a 15 c 16 t

    15 c 16 t 17 e

    16 t 17 e 18 r

    17 e 18 r NULL NULL

    Helpful, or clear as mud?

    Thanks, Iain

  • Hi Lowell,

    Re below:

    Lowell (7/16/2012)


    but if soundex finds any number, or any non a-z,A-Z character, it returns something like "B000" (all zeros), so soundexing on addresses gets meaningless, since most addresses have numbers in them.(1234 Main street)

    I'm creating a pseudo-soundex key, not using the actual soundex() function. As you rightly say, this produces such broad matches as to be almost meaningless.

    why not run your addresses thru an address service, free or otherwise to standardize them instead?

    The aim of the process isn't to standardise, it's to append our address and person keys to some inbound data. The addresses have already been standardised with PAF.

    Thanks, Iain

  • Got it on the forward and backward bit. Seems like it would be easier to start at position 0 and query 2-character chunks for specific patterns.

    As far as address matching goes, this won't do it. I've been working with address data for about 12 years, and you're missing important pieces.

    For example, "123 Main St" is the same as "123 Main Street", but you'll end up with a no-match if you parse it the way you're looking at. Same for "123 Main St #1" and "123 Maint St Apt 1" - same address, but your string parser won't get it. If you're doing CASS/PAVE validation, then you can do exact string comparison, because they'll be standardly formatted, and you don't need to do your parsing trick. If you aren't CASSing, you need to at least look into full-text indexing with a custom thesaurus.

    Also, with the path you're going down, keep in mind that "Richard" and "Dick" and "Ricky" are all the same name, but both your function and soundex will call them different. Again, either a thesaurus or a lookup table is necessary.

    Nothing you do will catch everything, but the particular path you're heading down won't catch even the most common of input variations. You'll catch a small percentage where a vowel has been mistyped, and that's pretty much it, unless I'm badly misreading what you're doing.

    A full-text index with the right custom thesaurus, will catch a huge percentage of duplicate-but-not-typed-the-same addresses. One I've set up catches all of these correctly:

    123 N Main St

    123 No Main Str

    123 North Main Street

    123 N Main St #1

    123 N Main St Apt 1

    123 N Main St Unit 1

    123 N Main St Suite 1

    123 N Main St (in Address1 column), 1 (in Address2 column)

    123 Calle Major No

    123 Norte Calle Major

    And so on. (Pardon my Spanish spelling. The thesaurus has it right, but I can't guarantee I typed it correctly here.)

    It will also catch:

    Rob

    Robby

    Robbie

    Bob

    Bobbie

    Bobby

    Robert

    It will treat "Bob" as short for "Roberta" as well, but won't treat "Robert" and "Roberta" as identical, just "highly similar, could be a typo".

    Setting up a system like that is made immensely more complicated if you can't run the addresses through a CASS/PAVE process first. We can't, because we can't get those for every country we do business in (some, it's too expensive for the payback, some it's just plain not available so far as we can tell). So we've got complicated systems like this for dealing with that kind of pattern-matching.

    We've even got the most common misspellings in our lookups and matches, where we can anticipate those.

    But it took months to set that up and get it right. Took years of experience with postal data and naming data. Took a LOT of testing, and business analysis of acceptable error rates (both false-positive and false-negative errors), and so on.

    So, don't get into this if you don't have to. It's a rocky road that's deceptively complex, and there are very few good maps out there for it.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Interesting.

    My input is standardised using PAF for the addresses (I'm in the UK, this is the Postal Service's delivery address file, commonly used for address cleansing and standardising). I assume that CASS/PAVE is a similar kind of thing? PAF will always present Street as Street, not St; Avenue as Avenue not Av or Ave, etc. Some addresses don't match to PAF though. Keys are generated from the PAF address where one is available, otherwise from the raw address.

    Whilst the PAF address should always be standard and in the same format, over time address presentation can change within PAF. In addition, the address data on our master table can be changed by users, so we have to handle the problem of vanity addresses, e.g. PAF gives 1 Main Street, but the homeowner likes to call their house "Dunroamin" and the address on our master table now reads Dunroamin, 1 Main Street. So I can't rely on an exact match.

    In addition to the example I gave, an additional number parts key is generated to match on, so Flat 1, 123 Main Street generates two keys:

    1|123

    FLTMNSTRT

    Name matching uses a set of translation tables to standardise name synonyms and handle the Richard/Ricky/Dick problem. A key is generated for both the actual name and the standardised synonym where this is different.

    Initial match is at Postcode (ZIP code) level. A match score is then built up using the address and name key matches (both exact and "input key matches within master key" and vice versa), plus a Levenshtein distance for the original data on names/synonyms. A name synonym match scores less than an exact name match. Longer keys score higher than shorter ones. Where a score threshold is reached, a match is confirmed. Like you say, it isn't possible to pick up everything, but the approach works relatively well based on previous experience.

    Even given all of that slightly unneccessary justification of approach :-), I like the idea of using a full text index and a custom thesaurus though. Especially to handle the Street/St type problem, for those records where we don't get a PAF match. As you know, it's problematic to standardise for this (usually involving some ugly replace statement during key generation).

    I've used FTIs before for searching against unstructured data held within a single column. But I'm struggling to visualise how it might be implemented in this circumstance. Can you give an example of how I might query this against a set of inbound data? I'm processing batches of inbound data running into millions of rows. I currently hold about 31m addresses in my master table.

    I have a basic address table structure of Address_1 - Address_5 plus Postcode. Individual elements of the data might be found in any column, although they are generally (but not always) populated left to right. So, you might find:

    Dunroamin | 123 | Main Street | Sometown | Null | SM1 1AA

    Dunroamin | 123 Main Street | Sometown | Null | Null | SM1 1AA

    Dunroamin, 123 Main Street | Sometown | Null | Null | Null | SM1 1AA

    Dunroamin | 123 Main Street | Null | Sometown | SomeCounty | SM1 1AA

    Side note: Yes, this is terrible. I've recently inherited this database. A full data quality audit and cleanup is in the planning stages. For the time being, I have to live with it as it is.

    Thanks for your interest, hopefully I've no bored you to death with this novella of a post!

    Regards, Iain

    Edit: derp

  • PAF is the UK counterpart to US PAVE. Sounds like it's a looser standard, if I'm reading it correctly, which means you will get more offbeat variations within it.

    I couldn't find if it has something comparable to Delivery Point Validation, or US Zip+4. Either of those would make matching much easier, since those are numerics instead of complex strings.

    As for FTI being run on millions of rows at a time, don't bother. Just run it on the percent that you can't get standardized, and only against previous addresses that couldn't standardize. You have to break them down into "contains" statements based on their component strings, and then look them up. It's very computationally expensive, and can be quite slow (tenth-of-a-second on my current hardware, per address). No big deal on a small number of addresses, but if you try it on millions ... well, hundreds of thousands of seconds is quite a long while.

    But the ones that do make it through standardization, you shouldn't need to jump through any hoops at all on. Standarize, generate a checksum, compare the checksum to already-generated checksums of existing addresses in your table, then an exact match on the strings, with the right indexes, can be incredibly fast and efficient even with complex, multi-column strings.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply