how to strip out all non alpha numeric characters?

  • --drop table #addr

    create table #addr

    ( id int identity (1,1)

    , address1 varchar(200)

    )

    insert into #addr

    select '532 40th ave. n.e. apt)104'

    union all

    select '1000 sector way st. apt)111-ac))'

    union all

    select '3333 summer-point place'

    union all

    select 'p.o.box 22000a-330042'

    union all

    select '5566 first stett n. lot.204-a apt)2-c'

    --should return only alpha numeric chars.

    '53240thaveneapt104'

    '1000sectorwaystapt111ac'

    '3333summerpointplace'

    'pobox22000a330042'

    '5566firststettnlot204aapt2c'

  • i think this will do what you are after; this leaves 0-9,A-Z,a-z and strips out everything else...commas,punctuation, hi ascii, etc.

    select dbo.StripNonAlphaNumeric(address1) from #addr

    --results

    53240thaveneapt104

    1000sectorwaystapt111ac

    3333summerpointplace

    pobox22000a330042

    5566firststettnlot204aapt2c

    select ascii('a'),ascii('z'),ascii('A'),ascii('Z')

    GO

    CREATE FUNCTION StripNonAlphaNumeric(@OriginalText VARCHAR(8000))

    RETURNS VARCHAR(8000)

    BEGIN

    DECLARE @CleanedText VARCHAR(8000)

    ;WITH tally (N) as

    (SELECT TOP 10000 row_number() OVER (ORDER BY sc1.id)

    FROM Master.dbo.SysColumns sc1

    CROSS JOIN Master.dbo.SysColumns sc2)

    SELECT @CleanedText = ISNULL(@CleanedText,'') +

    CASE

    --ascii numbers are 48(for '0') thru 57 (for '9')

    WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) BETWEEN 48 AND 57

    THEN SUBSTRING(@OriginalText,Tally.N,1)

    --ascii upper case letters A-Z is 65 thru 90

    WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) BETWEEN 65 AND 90

    THEN SUBSTRING(@OriginalText,Tally.N,1)

    --ascii lower case letters a-z is 97 thru 122

    WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) BETWEEN 97 AND 122

    THEN SUBSTRING(@OriginalText,Tally.N,1)

    ELSE '' END

    FROM tally WHERE Tally.N <= LEN(@OriginalText)

    RETURN @CleanedText

    END

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • That's exactly what i was looking for. Thanks Lowel. I knew there was a way via ascii but hadn't done it before. This ones going in the tool box.

    Again thank you!

  • Maybe you could make the CASE statement of Lowells nice function a little shorter:

    CASE

    WHEN (SUBSTRING(@OriginalText,Tally.N,1)) like '[0-Z]'

    THEN SUBSTRING(@OriginalText,Tally.N,1)

    ELSE '' END

    It seems like it'll return the same value.



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • lmu92 (3/23/2010)


    Maybe you could make the CASE statement of Lowells nice function a little shorter:

    CASE

    WHEN (SUBSTRING(@OriginalText,Tally.N,1)) like '[0-Z]'

    THEN SUBSTRING(@OriginalText,Tally.N,1)

    ELSE '' END

    It seems like it'll return the same value.

    It seems, but it really depends on the collation being used. Check out the following script:

    DECLARE @OriginalText VARCHAR(MAX) = '0;:ëêà^AaZz'

    SELECT

    CASE

    WHEN (SUBSTRING(@OriginalText COLLATE Latin1_General_BIN2,Tally.N,1)) like '[0-Z]'

    THEN SUBSTRING(@OriginalText,Tally.N,1)

    ELSE '' END,

    CASE

    WHEN (SUBSTRING(@OriginalText COLLATE Latin1_General_CI_AS,Tally.N,1)) like '[0-Z]'

    THEN SUBSTRING(@OriginalText,Tally.N,1)

    ELSE '' END,

    CASE

    WHEN (SUBSTRING(@OriginalText COLLATE Latin1_General_CI_AI,Tally.N,1)) like '[0-Z]'

    THEN SUBSTRING(@OriginalText,Tally.N,1)

    ELSE '' END

    FROM

    (

    SELECT Number N FROM master..spt_values WHERE Type = 'P'

    ) Tally

    where

    Tally.N <= LEN(@OriginalText)

    If you really want to remove any character but 0-9, a-z, A-Z compare the Ascii-codes. Also note that in a binary collation ':' and ';' fall between [0-Z].

    Peter

  • Thanx for clarification, Peter!

    I knew there must have been a reason Lowell provided a method that seemed to leave room for improvement... :blush:



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • lmu92 (3/23/2010)


    Thanx for clarification, Peter!

    I knew there must have been a reason Lowell provided a method that seemed to leave room for improvement... :blush:

    Well, you were close. I think this one will do the job too

    CASE

    WHEN (SUBSTRING(@OriginalText COLLATE Latin1_General_BIN2,Tally.N,1)) like '[0-9a-zA-Z]'

    THEN SUBSTRING(@OriginalText,Tally.N,1)

    ELSE '' END,

    BTW, where did you get [0-Z] from. I was surprised to see that ':' or ';' do not fall within that range when using a collation like Latin1_General_CI_Ax. I scanned BOL, but I can not find anything about this pattern.

    Peter

  • You may also want to file away this version, which uses an inline table valued function. It performs significantly faster if you have to clean up a large number of rows in a table with one pass. I modified the function above to get rid of anything except 0-9 and tested it against an existing function that does the same thing.

    CREATE FUNCTION dbo.itvf_NumbersOnly

    (

    @Input Varchar(max)

    )

    RETURNS TABLE

    AS

    RETURN

    (

    WITH data as

    (select N,substring(@input,N,1) as element

    from dbo.tally

    where substring(@input,N,1) between '0' and '9' -- alter to include only characters needed

    and N <= len(@input)

    )

    Select(Select ''+element

    FROM data

    ORDER BY N

    FOR XML PATH('')

    ) as [NumberString]

    )

    GO

    Also, I should note that CLR is said to be significantly faster at string manipulation, but we don't do CLR here yet, so I can't test. There was a huge thread on the subject here. There may be a variation in performance depending on the size of the strings to be cleaned and the number of characters to be replaced.

    ---------------

    -- code to test

    ---------------

    create table #test (rowID int identity primary key, data varchar(max))

    insert into #test

    select top 100000 newID()

    from dbo.tally

    select top 10 * from #test

    set statistics io on;

    set statistics time on;

    select rowid,Data,NumberString

    into #test2

    from #test

    cross apply dbo.itvf_NumbersOnly(data)

    set statistics time off;

    set statistics io off;

    print '----------------------------------------------------------------'

    set statistics io on;

    set statistics time on;

    select rowid,data,dbo.StripNonNumeric(data) as NumberString

    into #test3

    from #test

    set statistics time off;

    set statistics io off;

    select top 10 * from #test2 order by rowid

    select top 10 * from #test3 order by rowid

    drop table #test

    drop table #test2

    drop table #test3

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • ...substring(@input,N,1) between '0' and '9' ...

    Still the same mistake. This will match characters like '³' and '²'.

  • I see what you mean, but my point wasn't the selection/omission criteria in the where clause. In fact, I learned something from the earlier posts and made a mental note to be aware of potential differences in collation.

    The point is the slow performance of a user-defined function compared to an inline table-valued function. At 100,000 rows I was seeing a difference of seconds versus minutes. Take that for what its worth. 🙂

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • The Dixie Flatline (3/23/2010)


    I see what you mean, but my point wasn't the selection/omission criteria in the where clause.

    But that's what the thread was about. The same issue was raised in a couple of recent threads so I thought I should mention it.

    In fact, I learned something from the earlier posts and made a mental note to be aware of potential differences in collation.

    I am learning everyday here, both from the answers as the questions. This is a great site.

    The point is the slow performance of a user-defined function compared to an inline table-valued function. At 100,000 rows I was seeing a difference of seconds versus minutes. Take that for what its worth. 🙂

    Agreed.

  • @peter-2:

    BTW, where did you get [0-Z] from. I was surprised to see that ':' or ';' do not fall within that range when using a collation like Latin1_General_CI_Ax. I scanned BOL, but I can not find anything about this pattern.

    Actually, it's "developed" using the old fashioned trial and error method 😉

    I used Lowells code and modified it to have a case clause for numbers as well as characters. Then I just tried to make it even more simple, doing the "contertest" with CHAR(..) for some (but not all) numbers Lowells code excluded.

    So, there's no "reliable source"...:-)



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • Interesting, again I learned something new today. Thanks for the reply.

    Peter

  • Collation is so important when coding a routine that does a large number of comparisons or sorts, and not just because of the potential for errors as shown in this thread. Performance can be absolutely shocking (up to 40 times worse) when using Windows collations, even with non-Unicode data. Binary collations are fastest of all, but the SQL_* collations are not very far behind at all.

    See https://connect.microsoft.com/SQLServer/feedback/details/512459/replace-function-extremely-slow-with-non-binary-windows-collation for details and a reproduction script.

  • Paul, I have to ask just when is Windows collation used?

    I would assume that since all the work is being done inside SQL that some default SQL collation sequence would be used, and you say that performance is not at all bad with SQL collation. Doesn't each server, and DB have a default collation setting?

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply