Remove alpha characters

  • I have to data sent to me in a social security number field. I have to fix the data. I've used replcee to remove the dashes and change the ' ' to '' however users sometime enter alpha characters. The db field takes the characters because its varchar. I do not want to reject a record. So I'm cleaning up the data. If they add a character to a specific spot it would be easy but it can be any of the numbers. Is there a way to search for and remove alpha characters?

  • If you're using SQL Server 2000, this is easy with a User Defined Function (UDF), e.g.:

    
    
    CREATE FUNCTION dbo.f_RemoveChars(@Input varchar(1000))
    RETURNS varchar(1000) AS
    BEGIN
    DECLARE @pos smallint
    SET @Pos = PATINDEX('%[^0-9]%',@Input)
    WHILE @Pos > 0 BEGIN
    SET @Input = STUFF(@Input,@pos,1,'')
    SET @Pos = PATINDEX('%[^0-9]%',@Input)
    END
    RETURN @Input END

    --Jonathan

    Edited by - Jonathan on 11/25/2003 07:02:36 AM



    --Jonathan

  • Wow, that was great. Worked awesome

  • Can someone point me to a good reference describing (dumbed down) bitwise operators such as ^ and giving examples of their use?

    thanks

    k2

  • Don't know if you've seen it, but SQL Books online has good sections on each of the bitwise operators AND (&), NOT (~), OR (|) and exclusive OR (^), which are technical but give examples by way of explanation.

    Also, in the example code given by Jonathan, the ^ operator is used not as a bitwise operator but as a wildcard character telling PATINDEX to exclude characters in the range 0 - 9. Just in case you were confused by that particular usage!

    If you're looking for further articles on bitwise operators, it might be worth starting a new thread.

    Cheers,

    mia

    Life moves pretty fast. If you don't stop and look around once in a while, you could miss it.


    Cheers,
    mia

    Life moves pretty fast. If you don't stop and look around once in a while, you could miss it.

  • You can also check with ISNUMERIC inside a loop (loop through the characters in the string). If not numeric, then replace with the charaters you want.

    Another method is checking the ASCII value of the characters. The ASCII table you can easily get from Internet.



    Regards,
    kokyan

  • For long strings with more than a small percentage of characters that must be stripped, the overhead of the PATINDEX and STUFF functions will degrade performance.

    For these cases, a substantial improvement can be had simply by looping through every character of the string and concatenating the desired characters onto a new string.

    An additional, albeit smaller, gain can be made by replacing the explicit loop with an auxiliary numbers table and an SQL set-based (select) statement as shown below.

    -- One-time numbers table creation and initialization

    create table Num(n int)

    declare @i int

    set @i = 0

    while @i <= 12

    begin

    insert Num values(@i)

    set @i = @i PLUS 1

    end

    --[where PLUS represents the plus sign]

    -- Using the table to strip non-digit characters from a social security number:

    declare @ss1 varchar(15)

    declare @ss2 varchar(15)

    set @ss1 = '111-22x3333'

    set @ss2 = ''

    select @ss2 = @ss2 PLUS case when substring(@ss1, n, 1) between '0' and '9' then substring(@ss1, n, 1) else '' end

    from Num

    where n <= len(@ss1)

    -- [where PLUS represents the plus sign]

    select 'Original SSN' = @ss1, 'Clean SSN' = @ss2

    Note that an auxiliary numbers table can be effectively used in a variety of situations.

  • quote:


    For long strings with more than a small percentage of characters that must be stripped, the overhead of the PATINDEX and STUFF functions will degrade performance.

    For these cases, a substantial improvement can be had simply by looping through every character of the string and concatenating the desired characters onto a new string.


    Did you test this? I did, and you're wrong.

    --Jonathan



    --Jonathan

  • Yes, and here are my test cases. Note that if you reverse the ordering of the second case the results are reversed - revealing the obvious ... that character distribution matters. Bottom line - test algorithms against representative data samples.

    declare @s1 varchar(500), @s2 varchar(500),

    @d datetime,

    @pos int, @i int

    declare @loop int, @max_loops int,

    @tl1 int, @tl2 int,

    @ts1 int, @ts2 int

    select @tl1 = 0, @tl2 = 0,

    @ts1 = 0, @ts2 = 0,

    @max_loops = 1000

    set @loop = 0

    -- Average each test over @max_loops runs

    while @loop < @max_loops

    begin

    --

    -- Test 1: Repeated Address data of length 480 (Digits comprise 35% of string)

    --

    -- Strip non-digits using concatenation

    select @s1 = replicate('123 Drury Lane Apt10 Alba, CA 83952-4830', 12),

    @d = getdate(),

    @i = 1,

    @s2 = ''

    while @i <= len(@s1)

    begin

    if substring(@s1, @i, 1) between '0' and '9'

    begin

    set @s2 = @s2 PLUS substring(@s1, @i, 1)

    end

    set @i = @i PLUS 1

    end

    select @tl1 = @tl1 PLUS datediff(ms, @d, getdate())

    -- Strip non-digits using STUFF/PATINDEX

    select @s1 = replicate('123 Drury Lane Apt10 Alba, CA 83952-4830', 12),

    @d = getdate(),

    @Pos = PATINDEX('%[^0-9]%',@s1)

    WHILE @Pos > 0

    BEGIN

    SET @s1 = STUFF(@s1,@pos,1,'')

    SET @Pos = PATINDEX('%[^0-9]%',@s1)

    END

    select @ts1 = @ts1 + datediff(ms, @d, getdate())

    if @s1 <> @s2 print 'Test 1 Error'

    --

    -- Test 2: 240 digits followed by 240 non-digits (Digits comprise 50% of string)

    --

    -- Strip non-digits using concatenation

    set @s1 = replicate('1', 240) PLUS replicate('a', 240)

    select @d = getdate()

    set @i = 1

    set @s2 = ''

    while @i <= len(@s1)

    begin

    if substring(@s1, @i, 1) between '0' and '9'

    begin

    set @s2 = @s2 PLUS substring(@s1, @i, 1)

    end

    set @i = @i PLUS 1

    end

    select @tl2 = @tl2 PLUS datediff(ms, @d, getdate())

    -- Strip non-digits using STUFF/PATINDEX

    set @s1 = replicate('1', 240) + replicate('a', 240)

    select @d = getdate()

    SET @Pos = PATINDEX('%[^0-9]%',@s1)

    WHILE @Pos > 0

    BEGIN

    SET @s1 = STUFF(@s1,@pos,1,'')

    SET @Pos = PATINDEX('%[^0-9]%',@s1)

    END

    select @ts2 = @ts2 PLUS datediff(ms, @d, getdate())

    if @s1 <> @s2 print 'Test 2 Error'

    set @loop = @loop PLUS 1

    end

    select @tl1 / @max_loops as test1_concat,

    @ts1/@max_loops as test1_stuff,

    @tl2 / @max_loops as test2_concat,

    @ts2 / @max_loops as test2_stuff

    test1_concat test1_stuff test2_concat test2_stuff

    ------------ ----------- ------------ -----------

    7 12 8 21

  • quote:


    Yes, and here are my test cases.


    Interesting; I stand corrected with data like this. Thank you. I tested on an actual table where there weren't as many embedded characters and so didn't reach your qualification percentage.

    I found that it's PATINDEX, not STUFF, that is relatively slow. It may do something akin to the way your code evaluates each character, except once each loop rather than once each character of the string. As STUFF is not the culprit, perhaps something like this would be better in cases like you tested:

    
    
    CREATE FUNCTION dbo.f_RemoveChars(@Input varchar(1000))
    RETURNS varchar(1000) AS BEGIN
    DECLARE @Pos smallint
    SET @Pos = 1
    WHILE @Pos <= LEN(@Input) BEGIN
    IF SUBSTRING(@Input, @Pos, 1) BETWEEN '0' AND '9'
    SET @Input = STUFF(@Input, @Pos, 1, '')
    SET @Pos = @Pos + 1
    END
    RETURN @Input END

    --Jonathan



    --Jonathan

  • Jonathan,

    I agree about using 'STUFF' - in fact, when I went back to my code to copy it into my message, I realized that I had used 'STUFF', rather than concatenatation, but re-wrote it to confirm my original claim.

  • Sorry I'm so late on this one (2 1/2 years aint too bad )... just ran across it...

    ISNUMERIC is not a good thing to check for "All Digits" because it allows periods, +/- signs, dollar signs, currency type idicators, spaces, hard spaces, tabs, line feeds, and a wad of other things.

    Use something like the following to find all values in a column that have some other than 0-9 in it...

     SELECT *
       FROM sometable
      WHERE somecolumn NOT LIKE '%[^0-9]%'

    You'll be surprised at how fast the double "NOT" works  (^ in the rational expression stands for "NOT") 

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply