Splitting a non-uniform concatenated address column into 2 different columns

  • Hello,

    I have a very interesting problem in T-SQL coding for which I can't figure out the solution. Actually there is a Line_1_Address column in our data warehouse address table which is being populated from various sources. Some sources have already concatenated house number + street address fields in the Line_1_Address column whereas one source has separated columns for both data fields.

    Now I'm trying to extract data from this data warehouse table and I need to split the house number from street address and load it into separate columns in my destination table. In case there is no data for house number then I should load it as NULL.

    The issue is that data in this Line_1_Address column is very inconsistent so I don't know which functions to use. If anyone can suggest me a possible solution then that would be great. here is some sample data for your consideration:

    Line_1_Address

    101 E Commerce ST

    120 E Commerce ST

    2 Po Box

    301 W. Bel Air Ave

    West Main Street, PO Box 1388

    116 N Louisville ST

    903 Po Box

    2950 W Market ST

    300 N. Cleveland-Massillon Road

    820 Main Street

    521 1st ST N

    2301 Dawson RD

    611 Pointe North BLVD

    173 North 2nd Street

    410 Hillabee ST

    115 North St. Asaph Street

    115 North St. Asaph Street

    115 North St. Asaph Street

    115 North St. Asaph Street, PO Box 26248

    1800 N. Beauregard Street

    1800 N. Beauregard Street, #100

    1800 N. Beauregard Street, Suite 1800

    211 N. Union Street, Suite 154

    228 South Washington Street, Suite 200

    228 South Washington Street, Suite 200

    5500 Cherokee Avenue, Suite 300

    South Peyton Street

    1605 N. Cedar Crest Blvd. Suite 410

    2015 Hamilton ST STE 205

    2222 South 12th Street

    2895 Hamilton Blvd, #201

    2895 Hamilton Boulevard

    3570 Hamilton Blvd Ste 302

    5000 Tilghman Street, Suite 325

    5910 Hamilton Blvd

    Hamilton Boulevard, Suite 105, The Atriu

    3100 Royal BLVD S

    3655 North Point PKWY STE 425

    3375 Lynnwood Drive

    108 Main St S

    1831 E Three Notch ST

    1807 N. Boulevard

    1924 McConell Springs Road, PO Box 710

    2315 N Main Street Suite 100

    2315 N. Main Street, Suite 100

    2315 N. Main Street, Suite 100

    2315 N. Main Street, Suite 100, PO Box 2

    N Main Street Suite 100, PO Box 2307

    4200 Evergreen Lane

    1610 West Street, Suite 105

  • First of all, good luck. You have some really filthy data to deal with.

    Read up on PATINDEX. It may give you some ideas about parsing these strings. It searches for one string within another and returns an integer of the first occurrence. It can take wild card arguments which makes it useful.

    Here's an example that will let you find out if your address begins with digits or not:

    Declare @a varchar(30)

    Set @a = '2950 W Market ST'

    PRINT PATINDEX('%[^0-9]%',@a)

    Result = 5, which is the first character that is NOT a digit.

    [0-9] searches for a match of any digit. [^0-9] searches fpr a match on the first NON-digit.

    More examples for PATINDEX in BOL

    Good Luck,

    Sigerson

    "No pressure, no diamonds." - Thomas Carlyle

  • You're going to have to do some sophisticated regular expression parsing to get about 90% of the way there. You probably won't be 100% successful at splitting all the records up correctly.

    That being said, I would focus on the bulk, and plan on reporting on the residue. SQL supports some regex (patindex, like etc) processing, but its not going to support the more sophisticated expressions one could use for this purpose.

  • Thanks for your replies guy. That's what I thought too. These are just 50 records of millions of records as the addresses are for different policy holders. The problem is in come cases house number fields are 123a, 123a-112, 123, a123 etc

    So I don't have any way to differentiate between whats write and whats wrong in the code 🙁

  • You're going to have to do it in stages, build some temp tables, and start plowing through it. As you get individual rows sorted, place them into a holding table. Also, I'd put an integer primary key on the temp tables to join As so you can easily separate what's been sorted and what hasn't.

    For the '123a' or '123-A' scenarios, once you know the position of the 1st non-digit, then look to see if the first character BEFORE or AFTER it is a space.

    Sigerson

    "No pressure, no diamonds." - Thomas Carlyle

  • You could spend an awful lot of money trying to develop this and still make some terrible mistakes that you might not ever find.

    My recommendation is to simply buy some CASS certification software. It might be a couple or even several hundred dollars but that's nothing if you divide that by how much the company is paying you by the hour and how much time you're going to spend on writing something homegrown.

    The CASS certification program will also do something that your homegrown solution cannot. It well play your list of addresses against actual address ranges and tell you if an address is invalid even if it's in perfect format.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Thanks for your suggestions guys. I think it doesn't make any sense for me to spend that much time on this issue since it will take too much time. I will ask my modeler to look for a work around.

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply