Splitting City and State with variable formats

  • I have a difficult task here. Say there's a column in a table called CityState. The column can have values such as:

    Las Angeles, CA

    Anaheim, C.A.

    San Diego CA

    Seattle W.A.

    I can't just do a simple substrings on this, because the state could potentially be 4 characters on the right. And I would want it so that it is always just the two character state (if it is C.A. it should be CA). And with a potential comma, the city name isn't as simple either.

    If I wanted to select the city name (without the comma) and the state name (without the period) to insert into another table, what would be the best approach here?

    Thanks in advance. 🙂



    Twitter: @SQLife
    Email: sqlsalt(at)outlook(dot)com

  • i think if you REVERSE the string, and grab up to a space, that might be pretty close to what you are after:

    you can always REPLACE() to remove any commas in the city, and REPLACE() to remove periods in the state

    SELECT

    CityState,

    REVERSE(CityState),

    CHARINDEX(' ',REVERSE(CityState)),

    SUBSTRING(REVERSE(CityState),1,CHARINDEX(' ',REVERSE(CityState))),

    SUBSTRING(REVERSE(CityState),CHARINDEX(' ',REVERSE(CityState))+ 1,30),

    REVERSE(SUBSTRING(REVERSE(CityState),1,CHARINDEX(' ',REVERSE(CityState)))),

    REVERSE(SUBSTRING(REVERSE(CityState),CHARINDEX(' ',REVERSE(CityState))+ 1,30))

    FROM

    (SELECT 'Las Angeles, CA' As CityState UNION ALL

    SELECT 'Anaheim, C.A.' UNION ALL

    SELECT 'San Diego CA' UNION ALL

    SELECT 'Seattle W.A.'

    )X

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • I took Lowell's code and tossed in a couple of bells and whistles. Like standardizing the commas.

    -- Dude, periods in state abbreviations... seriously?

    SELECT CityState,City+', '+st as fixed,city,st

    FROM

    (SELECT *,

    ltrim(rtrim(reverse(left(revcity,(CHARINDEX(' ',revcity,1)))))) as st,

    LTRIM(rtrim(reverse(right(revcity,len(revcity)-(CHARINDEX(' ',revcity,1)))))) as city

    FROM

    (SELECT

    CityState

    ,ltrim(rtrim(reverse(REPLACE(REPLACE(citystate,'.',space(0)),',',SPACE(1))))) as revcity

    FROM

    (SELECT 'Las Angeles, CA' As CityState UNION ALL

    SELECT 'Anaheim, C.A.' UNION ALL

    SELECT 'San Diego CA' UNION ALL

    SELECT 'Seattle WASH.'

    ) X

    ) Y

    ) Z

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • So it ended up being even worse than originally thought of. The data was even messier than that. In some entries, the zip code was appended to the CityState column. (got to love user input :rolleyes:)

    So I ended up having to do a REPLACE() on all numeric characters with an empty character. And there was tons of spaces all over, so I had to trim when I went to split them up.

    If anybody is interested in seeing the final resulting query then I will gladly post. Talk about a few hours of frustration...



    Twitter: @SQLife
    Email: sqlsalt(at)outlook(dot)com

  • trstringer (8/4/2011)


    So it ended up being even worse than originally thought of. The data was even messier than that. In some entries, the zip code was appended to the CityState column. (got to love user input :rolleyes:)

    So I ended up having to do a REPLACE() on all numeric characters with an empty character. And there was tons of spaces all over, so I had to trim when I went to split them up.

    If anybody is interested in seeing the final resulting query then I will gladly post. Talk about a few hours of frustration...

    Absolutely... post the code. I'd also be interested to see 10 or 20 of the really wierd entries you had to put up with so I can take a crack at simplifying the code for you.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Here's my take on it. I don't know for sure if I have all the bases covered but I've avoided the very expense REVERSE function altogether thanks to my buddies ISNULL, NULLIF, and RTRIM.

    Here's the test table I set up...

    --===== Create and populate a test table on-the-fly

    -- This is NOT a part of the solution.

    SELECT d.MixedName

    INTO #CityState

    FROM (

    SELECT 'Las Angeles, CA' As CityState UNION ALL

    SELECT 'Anaheim, C.A.' UNION ALL

    SELECT 'San Diego CA' UNION ALL

    SELECT 'Seattle W.A.' UNION ALL

    SELECT 'Auburn Hills , M.I., 48325-1234 ' UNION ALL

    SELECT 'Some Really Long City Name , C.A 90210 '

    ) d (MixedName)

    ;

    Here's a possible solution...

    --===== Solve the problem without any REVERSE's

    WITH

    cteFirstPass AS

    (

    SELECT CityState =

    RTRIM(

    REPLACE(

    REPLACE(

    SUBSTRING(

    MixedName

    ,1,ISNULL(NULLIF(PATINDEX('%[0-9][0-9][0-9][0-9][0-9]%',MixedName)-1,-1),8000))

    ,',','')

    ,'.','')

    ),

    Zip =

    RTRIM(

    SUBSTRING(

    MixedName

    ,ISNULL(NULLIF(PATINDEX('%[0-9][0-9][0-9][0-9][0-9]%',MixedName),0),7000),10)

    )

    FROM #CityState

    )

    SELECT City = RTRIM(SUBSTRING(CityState,1, LEN(CityState)-2)),

    State = RIGHT(CityState,2),

    Zip

    FROM cteFirstPass

    ;

    Now... the premise of the code above is that the last two letters in the text will always be a State abbreviation whether they're separated by periods, spaces, or something else. The comma means nothing in the code above.

    It also assumes that Zip codes will consist of at least 5 contiguous digits to the right of the State. 😀

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • First off, here's some data that I saw:

    Anaheim CA

    Los Angeles, C.A.

    San Diego,C.A. (notice the lack of space between the comma and the state abbr)

    Tampa Bay, F L 33703 (notice the zip code)

    Miami F.L. 331 (yes, that's not a typo. It is a partial zip code)

    There are no typos there, sometimes there were zero spaces, sometimes there were excess spaces. Sometimes there was 5-digit zips, sometimes there was partial zips. Unbelievable. 🙂 Anyways, this is what I came up with. I'm sure it's not the most efficient piece of code but it seems to get the job done on 20,000+ records rather fast.

    select mailingcitystate

    into #MailingCityStateTemp1

    from OriginalData

    update #MailingCityStateTemp1

    set MailingCityState = REPLACE(mailingcitystate, '0', '')

    update #MailingCityStateTemp1

    set MailingCityState = REPLACE(mailingcitystate, '1', '')

    update #MailingCityStateTemp1

    set MailingCityState = REPLACE(mailingcitystate, '2', '')

    update #MailingCityStateTemp1

    set MailingCityState = REPLACE(mailingcitystate, '3', '')

    update #MailingCityStateTemp1

    set MailingCityState = REPLACE(mailingcitystate, '4', '')

    update #MailingCityStateTemp1

    set MailingCityState = REPLACE(mailingcitystate, '5', '')

    update #MailingCityStateTemp1

    set MailingCityState = REPLACE(mailingcitystate, '6', '')

    update #MailingCityStateTemp1

    set MailingCityState = REPLACE(mailingcitystate, '7', '')

    update #MailingCityStateTemp1

    set MailingCityState = REPLACE(mailingcitystate, '8', '')

    update #MailingCityStateTemp1

    set MailingCityState = REPLACE(mailingcitystate, '9', '')

    update #MailingCityStateTemp1

    set MailingCityState = REPLACE(mailingcitystate, '.', '')

    update #MailingCityStateTemp1

    set MailingCityState = REPLACE(mailingcitystate, ',', ' ')

    update #MailingCityStateTemp1

    set MailingCityState = RTRIM(mailingcitystate)

    select ltrim(LEFT(mailingcitystate, LEN(mailingcitystate) - 3)) as MailingCity,

    replace(RIGHT(mailingcitystate, 3), ' ', '') as MailingState

    into #MailingCityStateTemp2

    from #MailingCityStateTemp1

    A bug that took me a while to find was that I had originally replaced the comma (',') with nothing (''). But that was putting the city and the state's name with no space right next to each other, throwing off when I split the string. The simple solution was to just replace the comma with a space, because all leading and trailing spaces end up getting trimmed.

    Please be gentle on the code criticism! 😀



    Twitter: @SQLife
    Email: sqlsalt(at)outlook(dot)com

  • Nobody?! I had to further tweak that code because it was moreso a SELECTed column in an INSERT VALUES clause. But that's the gyst of it.



    Twitter: @SQLife
    Email: sqlsalt(at)outlook(dot)com

  • Did you give Jeff's solution a try?

  • trstringer (8/6/2011)


    Nobody?! I had to further tweak that code because it was moreso a SELECTed column in an INSERT VALUES clause. But that's the gyst of it.

    Heh... you've not tried my code, yet, I see. For the mess on Zip Codes, just alter my code to look for 3 digits or maybe even just one and Bob's your uncle. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply