Removing Non-Alphanumeric Characters in a Query

  • Problem:

    We have a database which holds address information in the following fields: AddressLine, City, State, ZIP. Because there is no additional field for 'Apartment', we need to parse that information from the AddressLine field. In many cases, the "Apt" is followed by '#', ':', '.', etc, which I would like to remove without having to use multiple nested REPLACE() functions.

    Assumptions (for the purposes of this example):

    1. Apartment information is always at the end of the Address Line

    2. Apartment is always denoted by "Apt" and is always preceded by a space

    Goal: To return only the apartment number in the Derived "Apartment" field

    The SQL so far:

    CREATE TABLE #Address

    (

    AddressLine VARCHAR(100)

    )

    INSERT INTO #Address

    VALUES ('123 Main St Apt 4')

    INSERT INTO #Address

    VALUES ('123 Main St Apt #4')

    INSERT INTO #Address

    VALUES ('123 Main St Apt. 4')

    INSERT INTO #Address

    VALUES ('123 Main St Apt: 4')

    INSERT INTO #Address

    VALUES ('123 Main St')

    SELECT AddressLine,

    CASE

    WHEN PATINDEX('% APT[^A-Z0-9]%', AddressLine) > 0 THEN SUBSTRING(AddressLine, 1, PATINDEX('% APT[^A-Z0-9]%', AddressLine) - 1)

    ELSE AddressLine

    END AS AddressLineWithoutApt,

    CASE

    WHEN PATINDEX('% APT[^A-Z0-9]%', AddressLine) > 0 THEN SUBSTRING(AddressLine, PATINDEX('% APT[^A-Z0-9]%', AddressLine), LEN(AddressLine))

    END AS Apartment

    FROM #Address

    DROP TABLE #Address

    Thanks for your help.

  • Because the AddressLine column always ends with the appartment number, this can be achieved using LEN. For example, the following code will return all appartment numbers up to 5 digits long:

    CREATE TABLE #Address

    (

    AddressLine VARCHAR(100)

    )

    INSERT INTO #Address

    VALUES ('123 Main St Apt 4')

    INSERT INTO #Address

    VALUES ('123 Main St Apt #4')

    INSERT INTO #Address

    VALUES ('123 Main St Apt. 4')

    INSERT INTO #Address

    VALUES ('123 Main St Apt: 4')

    INSERT INTO #Address

    VALUES ('123 Main St')

    SELECT

    CASE

    WHEN SUBSTRING(AddressLine,LEN(AddressLine)-4,5) LIKE '[0-9][0-9][0-9][0-9][0-9]'

    THEN RIGHT(AddressLine,5)

    WHEN SUBSTRING(AddressLine,LEN(AddressLine)-3,4) LIKE '[0-9][0-9][0-9][0-9]'

    THEN RIGHT(AddressLine,4)

    WHEN SUBSTRING(AddressLine,LEN(AddressLine)-2,3) LIKE '[0-9][0-9][0-9]'

    THEN RIGHT(AddressLine,3)

    WHEN SUBSTRING(AddressLine,LEN(AddressLine)-1,2) LIKE '[0-9][0-9]'

    THEN RIGHT(AddressLine,2)

    WHEN SUBSTRING(AddressLine,LEN(AddressLine),1) LIKE '[0-9]'

    THEN RIGHT(AddressLine,1)

    END

    FROM #Address

    DROP TABLE #Address

  • Why are you against using a function?

    Seth Phelabaum


    Consistency is only a virtue if you're not a screwup. πŸ˜‰

    Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

  • Garadin (4/29/2010)


    Why are you against using a function?

    Good question. Not against using a function, but this particular server is a 2000 with a whole host of complicated dependencies already. We are hopefully transitioning it to 2K5 or 2K8 soon and the less we can put on the server to limit the complications, the better. That's kind of the idea anyway. If it could be done via a query, that would be nice for our particular situation.

  • How about a function that goes in a utility database. That way you can just recreate the utility database when you upgrade the server and don't have to worry about adding additional time to making sure all components are moved over. This is my preferred method of this, it is a function, but it's pretty general.

    http://phelabaum.com/archive/2010/01/tally-table-string-cleaning/

    Seth Phelabaum


    Consistency is only a virtue if you're not a screwup. πŸ˜‰

    Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

  • Something like this?

    -- sample data

    CREATE TABLE #Address

    (

    AddressLine VARCHAR(100)

    )

    INSERT INTO #Address VALUES ('123 Main St Apt 4')

    INSERT INTO #Address VALUES ('123 Main St Apt #4')

    INSERT INTO #Address VALUES ('123 Main St Apt. 4809')

    INSERT INTO #Address VALUES ('123 Main St Apt: 14')

    INSERT INTO #Address VALUES ('123 Main St')

    -- solution

    SELECT a.AddressLine,

    AddressCleaned = ISNULL(NULLIF(LEFT(a.AddressLine, CHARINDEX(' Apt', a.AddressLine)), ''), a.AddressLine),

    ApartmentNumber = REVERSE(x.rebmuNtnemtrapA)

    FROM #Address a

    OUTER APPLY (

    SELECT rebmuNtnemtrapA = (SELECT SUBSTRING(REVERSE(a.AddressLine), n.n, 1)

    FROM (SELECT 1 AS [n] UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5) n

    WHERE SUBSTRING(REVERSE(a.AddressLine), n.n, 1) LIKE '[0-9]'

    FOR XML PATH(''))

    ) x

    β€œWrite the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • It's 2K =)

    Seth Phelabaum


    Consistency is only a virtue if you're not a screwup. πŸ˜‰

    Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

  • After reading responses, this is the solution I've come up with so far:

    --Initialize Test Data

    CREATE TABLE #Address

    (

    Id INT,

    AddressLine VARCHAR(100)

    )

    INSERT INTO #Address

    VALUES (1,

    '123 Main St # 4')

    INSERT INTO #Address

    VALUES (2,

    '123 Main St Apt: 4')

    INSERT INTO #Address

    VALUES (3,

    '123 Main St UNIT 4')

    INSERT INTO #Address

    VALUES (4,

    '123 Main St Apt 4 ')

    INSERT INTO #Address

    VALUES (5,

    '123 Main St Apt #4 ')

    INSERT INTO #Address

    VALUES (6,

    '123 Main St Apt. #4-A ')

    INSERT INTO #Address

    VALUES (7,

    '123 Main St BLDG: #4-A ')

    INSERT INTO #Address

    VALUES (8,

    '123 Main St ')

    --Solution

    SELECT Id,

    AddressLine,

    CASE

    WHEN PATINDEX('% APT[^A-Z0-9]%', AddressLine) > 0

    THEN LEFT(AddressLine, PATINDEX('% APT[^A-Z0-9]%', AddressLine))

    WHEN PATINDEX('% #[^A-Z0-9]%', AddressLine) > 0

    THEN LEFT(AddressLine, PATINDEX('% #[^A-Z0-9]%', AddressLine))

    WHEN PATINDEX('% UNIT[^A-Z0-9]%', AddressLine) > 0

    THEN LEFT(AddressLine, PATINDEX('% UNIT[^A-Z0-9]%', AddressLine))

    WHEN PATINDEX('% BLDG[^A-Z0-9]%', AddressLine) > 0

    THEN LEFT(AddressLine, PATINDEX('% BLDG[^A-Z0-9]%', AddressLine))

    ELSE AddressLine

    END AS AddressWithoutApt,

    REPLACE(CASE

    --' APT' in Address Line

    WHEN PATINDEX('% APT[^A-Z0-9]%', AddressLine) > 0

    THEN LTRIM(RIGHT(RTRIM(AddressLine), LEN(AddressLine) - PATINDEX('% APT[^A-Z0-9]%', AddressLine) - 4))

    --' #' in Address Line

    WHEN PATINDEX('% #[^A-Z0-9]%', AddressLine) > 0

    THEN LTRIM(RIGHT(RTRIM(AddressLine), LEN(AddressLine) - PATINDEX('% #[^A-Z0-9]%', AddressLine) - 2))

    --' UNIT' in Address Line

    WHEN PATINDEX('% UNIT[^A-Z0-9]%', AddressLine) > 0

    THEN LTRIM(RIGHT(RTRIM(AddressLine), LEN(AddressLine) - PATINDEX('% UNIT[^A-Z0-9]%', AddressLine) - 5))

    --' BLDG' in Address Line

    WHEN PATINDEX('% BLDG[^A-Z0-9]%', AddressLine) > 0

    THEN LTRIM(RIGHT(RTRIM(AddressLine), LEN(AddressLine) - PATINDEX('% BLDG[^A-Z0-9]%', AddressLine) - 5))

    END, '#', '') AS AptNum

    FROM #Address

    --Drop Test Data

    DROP TABLE #Address

    This query considers other possible suffixes and is working pretty well, except in the case that there is an apartment number following an address, which has nothing preceding it. Many apartment numbers also include an alpha character or a dash, so this query takes some of these exceptions into account. I'm still working on cases where the number appears with no prefix, but this is the best I've come up with so far.

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply