Remove non alpha chars from a column

  • CELKO (9/30/2014)


    I use nested REPLACE() functions. Some day we might have the ANSI/ISO Standard TRANSLATE()

    Hi Joe,

    I agree that using nested replace functions are the berries for performance especially when replacing multiple adjacent spaces with just one. Would that be practical in this case? You'd only be keeping 52 out of 256 characters (ASCII characters) and that's if it's not Unicode. I haven't actually tested it but would seem that the 254 nested replaces required would be a bit of a performance problem even for nested replaces.

    Does your alpha-only function have that many replaces or is there another trick to it?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Not sure if this query extract is helpful but it helped me in a recent situation where I was trying to deal with non-alpha characters:

    WHERE ([Bill Freq] IS NOT NULL) AND ([Bill Freq] NOT LIKE '^[A-Za-z0-9]+$') AND ([Bill Freq] NOT LIKE '')

  • The original version I wrote of this was in 2006 for Experts Exchange, and it was to strip nonnumeric chars (although of course the basic idea is the same):

    http://www.experts-exchange.com/Database/MS-SQL-Server/Q_21957163.html

    Of course unfortunately you'll now need an EE membership -- or free trial -- to see it, grrrr.

    Edit: And of course it's certainly possible someone else wrote similar code before that. If they did, I'm unaware of it.

    SQL DBA,SQL Server MVP(07, 08, 09) "Money can't buy you happiness." Maybe so, but it can make your unhappiness a LOT more comfortable!

  • bgalway (9/30/2014)


    Not sure if this query extract is helpful but it helped me in a recent situation where I was trying to deal with non-alpha characters:

    WHERE ([Bill Freq] IS NOT NULL) AND ([Bill Freq] NOT LIKE '^[A-Za-z0-9]+$') AND ([Bill Freq] NOT LIKE '')

    The idea isn't to find things... it's to fix things by changing the actual data. Also, your test full null is likely not necessary because, unless you've changed the server defaults or have the appropriate SET statement somewhere in your code, NULLs will naturally be excluded because they cannot be compared even to each other.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Some day we might have the ANSI/ISO Standard TRANSLATE()

    In the meantime you can use the one that I created.

    I am working on an article I plan to submit to SSC and was reluctant to share this without a little more testing but, perhaps, this would be a good opportunity to get some feedback/suggestions.

    One of them uses this simple getnums function:

    CREATE FUNCTION dbo.getnums(@rows int)

    RETURNS TABLE WITH SCHEMABINDING AS

    RETURN

    (

    WITH

    L1 AS (SELECT n=n FROM (values (1),(1)) t(n)),

    L2 AS (SELECT n=1 FROM L1 a CROSS APPLY L1 b),

    L3 AS (SELECT n=1 FROM L2 a CROSS APPLY L2 b),

    L4 AS (SELECT n=1 FROM L3 a CROSS APPLY L3 b),

    L5 AS (SELECT n=1 FROM L4 a CROSS APPLY L4 b),

    iTally AS

    (

    SELECT n = row_number() over (order by (select null))

    FROM L5 a CROSS APPLY L5 b

    )

    SELECT TOP (@rows) n

    FROM itally

    ORDER BY n

    );

    GO

    Here are the first three versions (yes, we all know how bad loops are, I'll explain why I created one with a loop in a moment...)

    ;-- (1) dbo.loopTranslate (scalar using a loop)

    CREATE FUNCTION dbo.loopTranslate

    ( @string varchar(8000),

    @replace varchar(100),

    @with varchar(100)

    )

    /*

    -- Use

    DECLARE @string varchar(100) = '##!!xyxxyyzz!!##';

    SELECT [@string]= @string, newstring = dbo.loopTranslate(@string,'xyz#!','abc')

    */

    RETURNS varchar(8000)

    AS

    BEGIN

    DECLARE @i int = 1;

    WHILE @i <= len(@replace)

    BEGIN

    SELECT @string = replace(@string,substring(@replace,@i,1),substring(@with,@i,1)),

    @i = @i+1

    END

    RETURN @string

    END;

    GO

    -- (2) dbo.svfTranslate (scalar using a tally table)

    CREATE FUNCTION dbo.svfTranslate

    ( @string varchar(8000),

    @replace varchar(100),

    @with varchar(100)

    )

    /*

    -- Use

    DECLARE @string varchar(100) = '##!!xyxxyyzz!!##';

    SELECT [@string]= @string, newstring = dbo.svfTranslate(@string,'xyz#!','abc')

    */

    RETURNS varchar(8000) AS

    BEGIN

    SELECT @string = replace(@string,substring(@replace,n,1),substring(@with,n,1))

    FROM dbo.getnums(len(@replace));

    RETURN @string;

    END;

    GO

    -- (3) itvfTranslate (inline tvf using recursive cte)

    CREATE FUNCTION dbo.itvfTranslate

    ( @string varchar(8000),

    @replace varchar(100),

    @with varchar(100)

    )

    /*

    -- Use

    DECLARE @string varchar(100) = '##!!xyxxyyzz!!##';

    SELECT [@string]= @string, newstring = x

    FROM dbo.itvfTranslate(@string,'xyz#!','abc')

    */

    RETURNS table AS

    RETURN

    (

    WITH a AS

    (

    SELECT n=1, x=replace(@string,substring(@replace,1,1),substring(@with,1,1))

    UNION ALL

    SELECT n+1, x=replace(x,substring(@replace,n+1,1),substring(@with,n+1,1))

    FROM a

    WHERE n < len(@replace)

    )

    SELECT x

    FROM a

    WHERE n = len(@replace)

    );

    GO

    --Note: I also did a couple variations using the "Quirky Update" method but they performed badly...

    ...and here's one using a scalar recursive function (note the commented section for examples of use):

    -- (4) rTranslate (recursive scalar)

    CREATE FUNCTION dbo.rTranslate

    ( @S varchar(1000)(8000),

    @r varchar(32),

    @w varchar(32),

    @i int=1

    )

    /*

    Created by: Alan Burstein

    Created on: 9/2/2014

    Usage Examples:

    -- (1) basic replace and remove (both functions)

    -- remove $ and #, Replace a with A, c with C, and b with x...

    DECLARE @string varchar(20)='###$$$aaabbbccc$$$###',

    @pre varchar(5)='acb#$',

    @post varchar(5)='ACx';

    SELECT original = @string,

    Translated = dbo.rTranslate(@string,@pre,@post,1);

    GO

    -- (2) format phone numbers

    -- (a) format phone (atomic value)

    DECLARE @string varchar(20)='(425)555-1212',

    @pre varchar(5)=')(', @post varchar(5)='-';

    SELECT original = @string,

    Translated = dbo.rTranslate(@string,@pre,@post,1);

    -- (b) format phone numbers(from table)

    WITH phoneNbrs(n,pn) AS

    ( SELECT 1, '(425)555-1212' UNION ALL

    SELECT 2, '425.555.1212' UNION ALL

    SELECT 3, '425_555_1212' UNION ALL

    SELECT 4, '(425)555.1212' )

    SELECT n, pn AS before, [after] = dbo.rTranslate(pn,x,y,1)

    FROM phoneNbrs

    CROSS APPLY (VALUES('.)_('+char(32),'---')) t(x,y);

    -- (c) hide phone numbers

    WITH phoneNbrs(n,pn) AS

    ( SELECT 1, '(425) 555-1212' UNION ALL

    SELECT 2, '425.555.1212' UNION ALL

    SELECT 3, '425-555-1212' UNION ALL

    SELECT 4, '4255551212' )

    SELECT n, pn AS before, [after] = dbo.rTranslate(pn,x,y,1)

    FROM phoneNbrs

    CROSS APPLY (VALUES('()-.0123456789','()-.**********')) t(x,y)

    GO

    -- (3) Replace accent characters with normal characters,

    DECLARE @string varchar(100)='Thë Quìck Greeñ Fox jumpëd over thë lázy dogs back!',

    @special1 varchar(32)='áâãäæèïéìëíîçåñòóôöõàøúüûùýÁÃÄ',

    @special2 varchar(32)='ÆÈÏÉÌËÍÎÅÑÒÓÔÖÕÀØÚÜÛÙÝ!',

    @normal1 varchar(32)='aaaaaeieieiicanoooooaouuuuyAAAAA',

    @normal2 varchar(32)='EIEIEIIANOOOOOAOUUUUY.';

    SELECT @string AS original,

    newstring = dbo.rTranslate(dbo.rTranslate(@string,@special1,@normal1,1),@special2,@normal2,1);

    -- note how I deal with the recursion limit, note the query plan

    GO

    -- (4) using each to find a pattern in a set of strings...

    -- looking for the string/pattern combination that will produce ABCABC

    DECLARE @strings TABLE(string varchar(20) not null);

    DECLARE @patterns TABLE(p_id int identity primary key,

    pattern varchar(10) not null,

    p_replace varchar(10) not null);

    DECLARE @SearchFor varchar(10)='ABCABC';

    INSERT @strings VALUES('123abcABC1'),('222defACC'),('123ABCmno');

    INSERT @patterns VALUES('123','E'),('mno','XYZ'),('abc123','ABC'),('ABC','XYZ');

    WITH pattern_search AS

    (

    SELECT string, pattern, p_replace, newstring = dbo.rTranslate(string,pattern,p_replace,1)

    FROM @strings s

    CROSS APPLY @patterns p

    )

    SELECT *, [matched] = case newstring when @SearchFor then 'yes' else 'no' end

    FROM pattern_search;

    */

    RETURNS varchar(8000)

    AS

    BEGIN

    RETURN

    ( CASE

    WHEN @i <= len(@r)

    THEN dbo.rTranslate(replace(@s,substring(@r,@i,1),substring(@w,@i,1)),@r,@w,@i+1)

    ELSE @S

    END

    )

    END

    GO

    The downside to this guy, of course, is the 32 char limit but this can be circumvented using nesting; note example #3 in my code

    dbo.rTranslate(dbo.rTranslate(@string,@special1,@normal1,1),@special2,@normal2,1);...

    The test harness I've been using:

    IF OBJECT_ID('tempdb..#vals') IS NOT NULL DROP TABLE #vals;

    SELECT val = replicate(cast(newid() as char(36)),200)

    INTO #vals

    FROM dbo.getnums(5000);

    GO

    --SELECT * FROM #vals

    SET NOCOUNT ON;

    dbcc freeproccache

    dbcc dropcleanbuffers

    DECLARE @r varchar(20)='ABCDEF190',

    @w varchar(20)='00000000';

    DECLARE @x varchar(8000);

    SET STATISTICS TIME ON;

    --SET STATISTICS IO ON;

    PRINT char(13)+'svf loop version:'

    SELECT @x = dbo.loopTranslate(val, @r, @w)

    FROM #vals;

    PRINT char(13)+'svf tally version:'

    SELECT @x = dbo.svfTranslate(val, @r, @w)

    FROM #vals;

    PRINT char(13)+'itvf recursive cte version:'

    SELECT @x = x

    FROM #vals

    CROSS APPLY dbo.itvfTranslate(val, @r, @w);

    PRINT char(13)+'recursive version:'

    SELECT @x = dbo.rTranslate(val, @r, @w, 1)

    FROM #vals;

    -- add a replace() function for testing

    SET STATISTICS TIME OFF;

    --SET STATISTICS IO OFF;

    GO

    and the results:

    svf loop version:

    SQL Server Execution Times:

    CPU time = 4742 ms, elapsed time = 4792 ms.

    svf tally version:

    SQL Server Execution Times:

    CPU time = 4665 ms, elapsed time = 4735 ms.

    itvf recursive cte version:

    SQL Server Execution Times:

    CPU time = 5460 ms, elapsed time = 5561 ms.

    recursive version:

    SQL Server Execution Times:

    CPU time = 812 ms 4665, elapsed time = 871 ms 4680.

    I included the loop version to show how badly the other other 2 performed. I have tested the recursive version with various string lengths and row counts and the results are the same: the recursive scalar version is 5-10 times faster. The code is simple as is the query plan: never mind, they're all basically the same - the recursive scalar version however is the cleanest as is the query plan.

    Edit: I realized that the recursive version was taking an input @string input of varchar(1000) vs. varchar(8000) for the other three. Nonetheless, here's a few examples of how to create a translate() function in SQL Server 2000 through SQL Server 2014. For anyone interested.

    Final edit: just read through the thread and saw Jeff's comment...

    "Note that we cannot use "SET STATISTICS" to measure performance here because there's a scalar function involved. "SET STATISTICS" greatly skews the performance results when scalar functions are present making the scalar function look a whole lot worse than it actually is. Please see the following article for more on that little testing nuance..."

    I'll need to re-do my test.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Ah... my apologies. I had to withdraw my post that was in this spot. At first, it looked like your CROSS APPLY "GetNums" was twice as fast as the traditional CROSS JOIN method. Much to my embarrassment, I found that I had phat-phingered the number "2000000" instead of "1000000" on the traditional method.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Maybe I'm missing something here, but this is simple and fast:

    declare @t varchar(5000) = 'RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row." This-divided-by,that|&/multiplied\by(100)give*you,%;<percent>.'

    declare @S varchar(5000) = ''

    select @S = @S + substring(@t, i, 1)

    from dbo.fnTalleyTable(len(isnull(@t, ''))) tt

    where substring(@t, i, 1) like '[a-z]'

    select @S

    Don

    Don Simpson



    I'm not sure about Heisenberg.

  • DonlSimpson (10/2/2014)


    Maybe I'm missing something here, but this is simple and fast:

    declare @t varchar(5000) = 'RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row." This-divided-by,that|&/multiplied\by(100)give*you,%;<percent>.'

    declare @S varchar(5000) = ''

    select @S = @S + substring(@t, i, 1)

    from dbo.fnTalleyTable(len(isnull(@t, ''))) tt

    where substring(@t, i, 1) like '[a-z]'

    select @S

    Don

    Not recommended.

    Multi-Row Variable Assignment and ORDER BY


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • The problem with your code, Don, is that it only handles scalar values.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • dwain.c (10/2/2014)


    DonlSimpson (10/2/2014)


    Maybe I'm missing something here, but this is simple and fast:

    declare @t varchar(5000) = 'RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row." This-divided-by,that|&/multiplied\by(100)give*you,%;<percent>.'

    declare @S varchar(5000) = ''

    select @S = @S + substring(@t, i, 1)

    from dbo.fnTalleyTable(len(isnull(@t, ''))) tt

    where substring(@t, i, 1) like '[a-z]'

    select @S

    Don

    Not recommended.

    Multi-Row Variable Assignment and ORDER BY

    Hmmm... Itzik used concatenation of a column of data in his example went haywire. Does anyone have an example of this type of code going crazy?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • All right then, how about replacing the explicit string concatenation with "for xml path"?

    declare @originalText varchar(7500), @includeCharacters varchar(20), @cleanText varchar(7500)

    select @originalText = 'RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row." This-divided-by,that|&/multiplied\by(100)give*you,%;<percent>.'

    , @includeCharacters = '[a-z]', @cleanText = ''

    select @cleanText =

    (

    select substring(@originalText, i, 1)

    from dbo.fnTalleyTable(len(isnull(@originalText, ''))) tt

    where substring(@originalText, i, 1) like @includeCharacters

    for xml path ('')

    )

    select @cleanText

    Don

    Don Simpson



    I'm not sure about Heisenberg.

  • Heh... here I go ago... necro'ing the same wicked old post.

    @DonlSimpson ... Try the following and see why I prefer not to use XML for this type of thing.


    declare @originalText varchar(7500), @includeCharacters varchar(20), @cleanText varchar(7500)

    select @originalText = 'RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row." <<&whoopsi&>> This-divided-by,that|&/multiplied\by(100)give*you,%;<percent>.'
       , @includeCharacters = '[a-z<>&]', @cleanText = ''

    select @cleanText =
       (
       select substring(@originalText, i, 1) 
       from dbo.fnTalleyTable(len(isnull(@originalText, ''))) tt
       where substring(@originalText, i, 1) like @includeCharacters
          for xml path ('')
       )

    select @cleanText

    Of course, it the code were for alpha-numeric only, that problem wouldn't show up.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Shifting gears a bit,.. Lordy, I miss Dwain.  :unsure:

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

Viewing 13 posts - 16 through 27 (of 27 total)

You must be logged in to reply to this topic. Login to reply