REPLACE Multiple Spaces with One

  • Michael,

    Just a quick reply for now, because I have something else to add to this thread in a moment.

    Collation determines the way character data is mapped to bit patterns, the rules for ordering, and also the rules for making comparisons. No data is ever changed - it's just the rules for ordering and comparison that differ depending on how the various collations involved resolve. The best references for this sometimes confusing topic are the following BOL entries:

    Collation Precedence - 2008

    Collation Precedence - 2005

    Collation Precedence - 2000

    If you work through some examples with the relevant BOL entry, it should become clearer.

    Paul

  • Jeff & Michael,

    :w00t: I now understand why Jeff's method broke! :w00t:

    Consider these very simple statements:

    SELECT REPLACE(' ', ' ', 'X') -- Example 1

    SELECT REPLACE(' ' COLLATE LATIN1_GENERAL_BIN, ' ', 'X') -- Example 2

    SELECT REPLACE(SPACE(2), SPACE(2), 'X') -- Example 3

    SELECT REPLACE(SPACE(2) COLLATE LATIN1_GENERAL_BIN, SPACE(2), 'X') -- Example 4

    There are two space characters in Example 1 and Example 2.

    Examples 3 and 4 are exactly the same except using SPACE(2) in place of the two space characters.

    In SQL 2000 (8.0.2171):

    Example 1 produces X

    Example 2 produces XX <== This is what breaks Jeff's method

    Example 3 produces X

    Example 4 produces X

    In SQL 2005 and 2008, all four examples give the expected output of a single 'X'.

    So...what is example 2 all about in 2000? Well, a full discussion would be very long and dull (it involves data type resolution, implicit conversions, trailing blank handling with VARCHAR, expression constant folding, and the order in which all these rules are applied).

    The bottom line is that literals that end with a space can be problematic in 2000 where an explicit collation is specified. It exposes what is probably a bug where trailing blanks are trimmed from one implicit VARCHAR literal, but not the other, due to a subtle difference in the order the implicit CONVERTs are applied, compared to the concatenations.

    Adding COLLATE clauses to everything certainly works, but now that I know why it is happening, it turns out we only need two - wherever there are trailing blanks:

    SELECT

    CleanString =

    REPLACE (

    REPLACE (

    REPLACE (

    --Changes 2 spaces to the OX model

    OriginalString COLLATE LATIN1_GENERAL_BIN, SPACE(2), SPACE(1) + CHAR(7)),

    --Changes the XO model to nothing

    CHAR(7) + SPACE(1) COLLATE LATIN1_GENERAL_BIN, SPACE(0)),

    --Changes the remaining X's to nothing

    CHAR(7), SPACE(0))

    FROM @Demo

    WHERE CHARINDEX(' ',OriginalString) > 0

    The above code works correctly on 2000, 2005, and 2008, regardless of server or database collations. I changed the space and empty string literals to SPACE(n) functions for clarity - it works just fine with literals too.

    For anyone interested, an example difference in the 2000 query plan where this bug is exposed is shown below:

    --[Expr1002]=replace(replace(replace(Convert(Convert(@Demo.[OriginalString])), Convert(space(2)), Convert(space(1)+char(7))), Convert(char(7))+Convert(space(1)), Convert(space(0))), Convert(char(7)), Convert(space(0)))

    --[Expr1002]=replace(replace(replace(Convert(Convert(@Demo.[OriginalString])), Convert(space(2)), Convert(space(1)+char(7))), Convert(char(7)+space(1)), Convert(space(0))), Convert(char(7)), Convert(space(0)))

    Paul

  • Dear All

    A fresh aproach :smooooth:

    If the problem was slightly different from Jeff's article, and we had to remove any bad character or pattern of characters then I thought holding those characters or patterns in a table would be cool - and then of course you can just add new patterns to the table as you like! This would allow a end user to have control over 'string cleaning'.

    For starters I have added the double space possibilities only, just to see if I could get things to work:

    This is a small script that you can copy and have a go with :

    -- Create a table of all the rotten stuff we want to remove

    DECLARE @table_StringsToRemove TABLE (RemovePattern VARCHAR(50), ReplaceWith VARCHAR(30))

    INSERT INTO @table_StringsToRemove(RemovePattern, ReplaceWith)

    SELECT SPACE(2), ''

    UNION ALL SELECT SPACE(3), ''

    UNION ALL SELECT SPACE(5), ''

    UNION ALL SELECT SPACE(9), ''

    UNION ALL SELECT SPACE(17), ''

    UNION ALL SELECT SPACE(33), ''

    -- Create a test table of dirty strings:

    DECLARE @table_StringsToClean TABLE (MyString VARCHAR(max))

    INSERT INTO @table_StringsToClean(MyString)

    SELECT 'a b c d e f g h i j k'

    UNION ALL SELECT ' x b c d e f g h i j k'

    -- Clean the strings

    UPDATE @table_StringsToClean

    SET MyString = REPLACE(MyString, RemovePattern, ReplaceWith)

    FROM @table_StringsToRemove, @table_StringsToClean

    WHERE CHARINDEX(RemovePattern, MyString) != -1

    -- Display result

    SELECT * FROM @table_StringsToClean

    -- output of avove SQL is

    a bc de f g h i jk

    xb cd ef gh ij k

    Prety cool ?

    [Edit : no its not the output is wrong]

    This you might like because:

    A/ its inline set based SQL (I might try tomorrow if time as a UDF)

    B/ you can add new patterns without recoding the SQL

    C/ you can replace with Stars *** maybe to indicate text removed: perhaps obcene words

    If you wanted to remove other stuff you could try this:

    DECLARE @table_StringsToRemove TABLE (RemovePattern VARCHAR(50), ReplaceWith VARCHAR(30))

    INSERT INTO @table_StringsToRemove(RemovePattern, ReplaceWith)

    SELECT SPACE(2), ''

    UNION ALL SELECT SPACE(3), ''

    UNION ALL SELECT SPACE(5), ''

    UNION ALL SELECT SPACE(9), ''

    UNION ALL SELECT SPACE(17), ''

    UNION ALL SELECT SPACE(33), ''

    UNION ALL SELECT REPLICATE(CHAR(9),2), '' -- Duplicate TABS

    UNION ALL SELECT REPLICATE(CHAR(9),5), ''

    UNION ALL SELECT REPLICATE(CHAR(9),9), ''

    UNION ALL SELECT REPLICATE(CHAR(9),17), ''

    UNION ALL SELECT REPLICATE(CHAR(9),33), ''

    UNION ALL SELECT 'boobs', 'knockers' --- Obcence stuff!!

    UNION ALL SELECT 'RBAR', 'R***' --- Another rude word!

    -- we can add anything to the above list

    DECLARE @table_StringsToClean TABLE (MyString VARCHAR(max))

    INSERT INTO @table_StringsToClean(MyString)

    SELECT 'a b c d e f g h i j k'

    UNION ALL SELECT ' x b c d e f g h i j k'

    UPDATE @table_StringsToClean

    SET MyString = REPLACE(MyString, RemovePattern, ReplaceWith)

    FROM @table_StringsToRemove, @table_StringsToClean

    WHERE CHARINDEX(RemovePattern, MyString) != -1

    SELECT * FROM @table_StringsToClean

    One last idea!!

    Wrap the above into an AFTER UPDATE trigger keeping the data clean, then massive batch updates not required - and performance less of an issue.

    P.S. As regards prior performance tests:

    What I am feeling from the advice/comments, is don't take ANY results for granted and test out all the different solutions in your own unique environment - where things may be quite different - for many different reasons!

    Mind you - if you run your entire enterprise on a crabby pattie laptop like mine - then expect similar results to those I have posted 🙂

    Hope you all had a fine weekend:-):-)

  • Oops I just noticed the output is wrong:

    a bc de f g h i jk

    Any ideas?

  • Here is the corrected script

    -- Create a table of all the rotten stuff we want to remove

    DECLARE @table_StringsToRemove TABLE (RemovePattern VARCHAR(50), ReplaceWith VARCHAR(30))

    INSERT INTO @table_StringsToRemove(RemovePattern, ReplaceWith)

    SELECT SPACE(2), ' '

    UNION ALL SELECT SPACE(3), ' '

    UNION ALL SELECT SPACE(5), ' '

    UNION ALL SELECT SPACE(9), ' '

    UNION ALL SELECT SPACE(17), ' '

    UNION ALL SELECT SPACE(33), ' '

    -- Create a test table of dirty strings:

    DECLARE @table_StringsToClean TABLE (MyString VARCHAR(max))

    INSERT INTO @table_StringsToClean(MyString)

    SELECT 'a b c d e f g h i j k'

    UNION ALL SELECT ' x b c d e f g h i j k'

    -- Clean the strings

    UPDATE @table_StringsToClean

    SET MyString = REPLACE(REPLACE(REPLACE(REPLACE(MyString, RemovePattern, ReplaceWith), RemovePattern, ReplaceWith) ,RemovePattern, ReplaceWith),RemovePattern, ReplaceWith)

    FROM @table_StringsToRemove, @table_StringsToClean

    WHERE CHARINDEX(RemovePattern, MyString) != -1

    -- Display result

    SELECT * FROM @table_StringsToClean

  • C# Screw (11/22/2009)


    Oops I just noticed the output is wrong:

    a bc de f g h i jkAny ideas?

    It's a nice idea - and very creative - but it won't work like that I'm afraid.

    In order to work, this method would need to be able to apply more than one REPLACE operation to each row, do it in a guaranteed order, and operate on the result of the previous operation.

    Sadly, none of these things are possible using UPDATE and a CROSS JOIN. If it were, we might not need recursive CTEs...

    The engine will do the join, but the stream aggregate includes an ANY internal aggregate:

    [RemovePattern] = Scalar Operator(ANY([RemovePattern])), [ReplaceWith] = Scalar Operator(ANY([ReplaceWith])), [MyString] = Scalar Operator(ANY([MyString]))

    ...so each row is only operated on once, by one REPLACE. The 'corrected' version simply applies the exact same REPLACE expression four times. In effect, it applies the SPACE(2) -> SPACE(1) replacement four times. Note that this is just a quirk of the current plan - it is not guaranteed to choose the SPACE(2) row over SPACE(9), SPACE(3), SPACE(17), or any other...!

    Sorry!

  • Yes thanks Paul, I noticed myself it was not replacng the 'rbar' text in the 2nd test (after I posted unfortuanately :blush:) - bother - I was so excited by the idea too! Back to the drawng board! 🙂

  • Hi

    not quite as exciting - but at least the replace strings are in a table as I wanted :

    😎

    CREATE FUNCTION fn_CleanString(@MyString VARCHAR(max))

    RETURNS VarChar(MAX)

    AS

    BEGIN

    -- Remove unwanted strings

    SELECT @MyString = REPLACE(@MyString, RemovePattern, ReplaceWith)

    FROM tPatterns

    WHERE CHARINDEX(RemovePattern, @MyString) != -1

    --Now Remove duplicate spaces & tabs(Jeff) - or other method

    SELECT @MyString = REPLACE(

    REPLACE(

    REPLACE(

    LTRIM(RTRIM(@MyString)) ,' ',' '+CHAR(7)) --Changes 2 spaces to the OX model

    ,CHAR(7)+' ','') --Changes the XO model to nothing

    ,CHAR(7),'') --Changes the remaining X's to nothing

    RETURN @MyString

    END

    go

    -- Create a table of all the rotten stuff we want to remove

    SELECT 'tablescan' RemovePattern, 'seek' ReplaceWith

    INTO tPatterns

    UNION ALL SELECT 'RBAR', 'R***'

    UNION ALL SELECT 'anyword', 'R*******'

    -- Create some dirty test data

    SELECT 'a b c d e f g h rbar i tablescan j k' myString

    INTO tMyStrings

    UNION ALL

    SELECT ' tablescan rbar x b c d e f g h i j k'

    -- Clean & output the our test data

    SELECT dbo.fn_CleanString(myString) FROM tMyStrings

    go

    -- Cleanup

    DROP TABLE tPatterns

    DROP TABLE tMyStrings

    DROP FUNCTION fn_CleanString

  • :satisfied:!!!!Wait a minute - I thought I was onto something!!!!!!:satisfied:

    This works nicely without the Nested replace:-):-):

    CREATE FUNCTION fn_CleanString(@MyString VARCHAR(max))

    RETURNS VarChar(MAX)

    AS

    BEGIN

    -- Remove unwanted strings

    SELECT @MyString = REPLACE(@MyString, RemovePattern, ReplaceWith)

    FROM tPatterns

    WHERE CHARINDEX(RemovePattern, @MyString) != -1

    RETURN @MyString

    END

    go

    -- Create a table of all the rotten stuff we want to remove

    SELECT 'tablescan' RemovePattern, 'seek' ReplaceWith

    INTO tPatterns

    UNION ALL SELECT 'RBAR', 'R***'

    UNION ALL SELECT 'anyword', 'R*******'

    UNION ALL SELECT SPACE(2), ' '

    UNION ALL SELECT SPACE(2), ' '

    UNION ALL SELECT SPACE(2), ' '

    UNION ALL SELECT SPACE(2), ' '

    UNION ALL SELECT SPACE(2), ' '

    -- Create some dirty test data

    SELECT 'a b c d e f g h rbar i tablescan j k' myString

    INTO tMyStrings

    UNION ALL

    SELECT ' tablescan rbar x b c d e f g h i j k'

    -- Clean & output the our test data

    SELECT dbo.fn_CleanString(myString) FROM tMyStrings

    go

    -- Cleanup

    DROP TABLE tPatterns

    DROP TABLE tMyStrings

    DROP FUNCTION fn_CleanString

  • Paul White (11/22/2009)


    Jeff & Michael,

    :w00t: I now understand why Jeff's method broke! :w00t:

    Consider these very simple statements:

    Heh... Yep... that's what I got out of it, too, but I didn't say any such thing because of the handrails that a lot of folks have been trying to install lately... It was easier for me to just say that the fix is to put the COLLATE statement on all string constants. You, of course, are doing it the right way. Another "right" way would be to use two "unlikely" characters but that would also require a 4th replace to restore the first unlikely character to a space, perhaps negating the speed gain.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • C# Screw (11/22/2009)


    :satisfied:!!!!Wait a minute - I thought I was onto something!!!!!!:satisfied:

    This works nicely without the Nested replace:-):-):

    Very cool revelation but I'd suggest using the same spacing scheme that Michael used for two reasons... it's faster and it will handle up to 65536 spaces.

    As a side bar, it's interesting that some folks in my group at work recently received some 3rd party files with (you won't believe this) a freakin' million spaces between some words. That's not an exaggeration... there were over 1 million contiguous spaces in several spots in the file. I'm thinking "road trip" to find out in person if those kids like pork chops. 😛

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • C# Screw (11/22/2009)


    :satisfied:!!!!Wait a minute - I thought I was onto something!!!!!!:satisfied:

    :ermm: That works, yes, but you know it'll suck performance-wise, right? Look at the RBAR!!!

    I'm trying to be helpful, but I know I sound like the Grinch:

    1. Try to avoid MAX data types as parameters and variables where possible

    2. Schema-bind your functions so the engine checks it for determinism

    3. Schema-qualify your objects

    4. Don't do row-by-agonizing-row data access in functions!

    5. Actually, don't do data access at all in functions!!

    Paul

  • WHERE CHARINDEX(RemovePattern, @MyString) != -1

    Also, can you tell me when CHARINDEX will ever = -1? Wouldn't it be more effective to just have ...

    WHERE CHARINDEX(RemovePattern, @MyString) > 0

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Jeff Moden (11/22/2009)


    Heh... Yep... that's what I got out of it, too, but I didn't say any such thing because of the handrails that a lot of folks have been trying to install lately... It was easier for me to just say that the fix is to put the COLLATE statement on all string constants. You, of course, are doing it the right way. Another "right" way would be to use two "unlikely" characters but that would also require a 4th replace to restore the first unlikely character to a space, perhaps negating the speed gain.

    Appreciate the comments Jeff - thanks! I think introducing another replace would unfairly penalise your method (which I still admire BTW) simply due to a sneaky 2000 bug. I'm just glad it is fixed in 2005 and 2008.

  • Paul White (11/22/2009)


    C# Screw (11/22/2009)


    :satisfied:!!!!Wait a minute - I thought I was onto something!!!!!!:satisfied:

    :ermm: That works, yes, but you know it'll suck performance-wise, right? Look at the RBAR!!!

    I'm trying to be helpful, but I know I sound like the Grinch:

    1. Try to avoid MAX data types as parameters and variables where possible

    2. Schema-bind your functions so the engine checks it for determinism

    3. Schema-qualify your objects

    4. Don't do row-by-agonizing-row data access in functions!

    5. Actually, don't do data access at all in functions!!

    Paul

    I mostly agree with #5... but then there's things like Tally tables and lookup tables (both can be cached) that make it worth it especially when Cross Apply is used. I will agree that it certainly deserves a performance test.

    I'll also agree that nested replaces will still beat the RBAR of doing single replaces even if they are table driven in a single query... we've already seen that in testing.

    Just in case someone is confused (not you Paul), not all seemingly "loopless" code is Set Based... you also have to look at how many times each row or item is "touched". For example, a "Triangular Join" to do a running total on a lousy 10,000 rows will produce more than 50 MILLION rows of internal RBAR which is why it's so very slow compared to even a Cursor.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

Viewing 15 posts - 241 through 255 (of 425 total)

You must be logged in to reply to this topic. Login to reply