REPLACE Multiple Spaces with One

Question

Post reply

REPLACE Multiple Spaces with One

Viewing 15 posts - 241 through 255 (of 425 total)

You must be logged in to reply to this topic. Login to reply

Paul White SSC Guru Points: 150457 More actions · Answer 1

Michael,

Just a quick reply for now, because I have something else to add to this thread in a moment.

Collation determines the way character data is mapped to bit patterns, the rules for ordering, and also the rules for making comparisons. No data is ever changed - it's just the rules for ordering and comparison that differ depending on how the various collations involved resolve. The best references for this sometimes confusing topic are the following BOL entries:

Collation Precedence - 2008

Collation Precedence - 2005

Collation Precedence - 2000

If you work through some examples with the relevant BOL entry, it should become clearer.

Paul

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

Paul White SSC Guru Points: 150457 More actions · Answer 2

Jeff & Michael,

:w00t: I now understand why Jeff's method broke! :w00t:

Consider these very simple statements:



SELECT REPLACE('  ', '  ', 'X') -- Example 1

SELECT REPLACE('  ' COLLATE LATIN1_GENERAL_BIN, '  ', 'X') -- Example 2

SELECT REPLACE(SPACE(2), SPACE(2), 'X') -- Example 3

SELECT REPLACE(SPACE(2) COLLATE LATIN1_GENERAL_BIN, SPACE(2), 'X') -- Example 4

There are two space characters in Example 1 and Example 2.

Examples 3 and 4 are exactly the same except using SPACE(2) in place of the two space characters.

In SQL 2000 (8.0.2171):

Example 1 produces X

Example 2 produces XX <== This is what breaks Jeff's method

Example 3 produces X

Example 4 produces X

In SQL 2005 and 2008, all four examples give the expected output of a single 'X'.

So...what is example 2 all about in 2000? Well, a full discussion would be very long and dull (it involves data type resolution, implicit conversions, trailing blank handling with VARCHAR, expression constant folding, and the order in which all these rules are applied).

The bottom line is that literals that end with a space can be problematic in 2000 where an explicit collation is specified. It exposes what is probably a bug where trailing blanks are trimmed from one implicit VARCHAR literal, but not the other, due to a subtle difference in the order the implicit CONVERTs are applied, compared to the concatenations.

Adding COLLATE clauses to everything certainly works, but now that I know why it is happening, it turns out we only need two - wherever there are trailing blanks:



SELECT

    CleanString =

    REPLACE (

        REPLACE (

            REPLACE (

                    --Changes 2 spaces to the OX model

                    OriginalString COLLATE LATIN1_GENERAL_BIN, SPACE(2), SPACE(1) + CHAR(7)),

                --Changes the XO model to nothing

                CHAR(7) + SPACE(1) COLLATE LATIN1_GENERAL_BIN, SPACE(0)),

            --Changes the remaining X's to nothing

            CHAR(7), SPACE(0))

FROM @Demo

WHERE CHARINDEX('  ',OriginalString) > 0

The above code works correctly on 2000, 2005, and 2008, regardless of server or database collations. I changed the space and empty string literals to SPACE(n) functions for clarity - it works just fine with literals too.

For anyone interested, an example difference in the 2000 query plan where this bug is exposed is shown below:

--[Expr1002]=replace(replace(replace(Convert(Convert(@Demo.[OriginalString])), Convert(space(2)), Convert(space(1)+char(7))), Convert(char(7))+Convert(space(1)), Convert(space(0))), Convert(char(7)), Convert(space(0)))

--[Expr1002]=replace(replace(replace(Convert(Convert(@Demo.[OriginalString])), Convert(space(2)), Convert(space(1)+char(7))), Convert(char(7)+space(1)), Convert(space(0))), Convert(char(7)), Convert(space(0)))

Paul

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

C# Gnu SSCertifiable Points: 5973 More actions · Answer 3

Dear All

A fresh aproach :smooooth:

If the problem was slightly different from Jeff's article, and we had to remove any bad character or pattern of characters then I thought holding those characters or patterns in a table would be cool - and then of course you can just add new patterns to the table as you like! This would allow a end user to have control over 'string cleaning'.

For starters I have added the double space possibilities only, just to see if I could get things to work:

This is a small script that you can copy and have a go with :



-- Create a table of all the rotten stuff we want to remove

DECLARE @table_StringsToRemove TABLE (RemovePattern VARCHAR(50), ReplaceWith VARCHAR(30))

INSERT INTO @table_StringsToRemove(RemovePattern, ReplaceWith)

SELECT SPACE(2), ''

UNION ALL SELECT SPACE(3), ''

UNION ALL SELECT SPACE(5), ''

UNION ALL SELECT SPACE(9), ''

UNION ALL SELECT SPACE(17), ''

UNION ALL SELECT SPACE(33), ''

-- Create a test table of dirty strings:

DECLARE @table_StringsToClean TABLE (MyString VARCHAR(max))

INSERT INTO @table_StringsToClean(MyString)

SELECT 'a b  c   d    e     f       g       h                 i                   j                          k'

UNION ALL SELECT '                           x            b         c        d       e      f     g    h   i  j k'

-- Clean the strings

UPDATE @table_StringsToClean

SET MyString = REPLACE(MyString, RemovePattern, ReplaceWith)

FROM @table_StringsToRemove,  @table_StringsToClean

WHERE CHARINDEX(RemovePattern, MyString) != -1

-- Display result

SELECT * FROM @table_StringsToClean

-- output of avove SQL is

a bc de f g h i jk

xb cd ef gh ij k

Prety cool ?

[Edit : no its not the output is wrong]

This you might like because:

A/ its inline set based SQL (I might try tomorrow if time as a UDF)

B/ you can add new patterns without recoding the SQL

C/ you can replace with Stars *** maybe to indicate text removed: perhaps obcene words

If you wanted to remove other stuff you could try this:

DECLARE @table_StringsToRemove TABLE (RemovePattern VARCHAR(50), ReplaceWith VARCHAR(30))



INSERT INTO @table_StringsToRemove(RemovePattern, ReplaceWith)

SELECT SPACE(2), ''

UNION ALL SELECT SPACE(3), ''

UNION ALL SELECT SPACE(5), ''

UNION ALL SELECT SPACE(9), ''

UNION ALL SELECT SPACE(17), ''

UNION ALL SELECT SPACE(33), ''

UNION ALL SELECT REPLICATE(CHAR(9),2), '' -- Duplicate TABS

UNION ALL SELECT REPLICATE(CHAR(9),5), ''

UNION ALL SELECT REPLICATE(CHAR(9),9), ''

UNION ALL SELECT REPLICATE(CHAR(9),17), ''

UNION ALL SELECT REPLICATE(CHAR(9),33), ''

UNION ALL SELECT 'boobs', 'knockers'  --- Obcence stuff!!

UNION ALL SELECT 'RBAR', 'R***' --- Another rude word!

-- we can add anything to the above list

DECLARE @table_StringsToClean TABLE (MyString VARCHAR(max))

INSERT INTO @table_StringsToClean(MyString)

SELECT 'a b  c   d    e     f       g       h                 i                   j                          k'

UNION ALL SELECT '                           x            b         c        d       e      f     g    h   i  j k'

UPDATE @table_StringsToClean

SET MyString = REPLACE(MyString, RemovePattern, ReplaceWith)

FROM @table_StringsToRemove,  @table_StringsToClean

WHERE CHARINDEX(RemovePattern, MyString) != -1

SELECT * FROM @table_StringsToClean

One last idea!!

Wrap the above into an AFTER UPDATE trigger keeping the data clean, then massive batch updates not required - and performance less of an issue.

P.S. As regards prior performance tests:

What I am feeling from the advice/comments, is don't take ANY results for granted and test out all the different solutions in your own unique environment - where things may be quite different - for many different reasons!

Mind you - if you run your entire enterprise on a crabby pattie laptop like mine - then expect similar results to those I have posted 🙂

Hope you all had a fine weekend:-):-)

C# Gnu
____________________________________________________

Excel Database Tasks : efficient business data processing
Version 3.3 released February 2015

C# Gnu SSCertifiable Points: 5973 More actions · Answer 4

C# Gnu

SSCertifiable

Points: 5973

November 22, 2009 at 4:14 pm

#1082754

Oops I just noticed the output is wrong:

a bc de f g h i jk

Any ideas?

C# Gnu
____________________________________________________

Excel Database Tasks : efficient business data processing
Version 3.3 released February 2015

C# Gnu SSCertifiable Points: 5973 More actions · Answer 5

Here is the corrected script

-- Create a table of all the rotten stuff we want to remove



DECLARE @table_StringsToRemove TABLE (RemovePattern VARCHAR(50), ReplaceWith VARCHAR(30))

INSERT INTO @table_StringsToRemove(RemovePattern, ReplaceWith)

SELECT SPACE(2), ' '

UNION ALL SELECT SPACE(3), ' '

UNION ALL SELECT SPACE(5), ' '

UNION ALL SELECT SPACE(9), ' '

UNION ALL SELECT SPACE(17), ' '

UNION ALL SELECT SPACE(33), ' '

-- Create a test table of dirty strings:

DECLARE @table_StringsToClean TABLE (MyString VARCHAR(max))

INSERT INTO @table_StringsToClean(MyString)

SELECT 'a b  c   d    e     f       g       h                 i                   j                          k'

UNION ALL SELECT '                           x            b         c        d       e      f     g    h   i  j k'

-- Clean the strings

UPDATE @table_StringsToClean

SET MyString = REPLACE(REPLACE(REPLACE(REPLACE(MyString, RemovePattern, ReplaceWith), RemovePattern, ReplaceWith) ,RemovePattern, ReplaceWith),RemovePattern, ReplaceWith)

FROM @table_StringsToRemove,  @table_StringsToClean

WHERE CHARINDEX(RemovePattern, MyString) != -1

-- Display result

SELECT * FROM @table_StringsToClean

C# Gnu
____________________________________________________

Excel Database Tasks : efficient business data processing
Version 3.3 released February 2015

Paul White SSC Guru Points: 150457 More actions · Answer 6

C# Screw (11/22/2009)

Oops I just noticed the output is wrong:

a bc de f g h i jkAny ideas?

It's a nice idea - and very creative - but it won't work like that I'm afraid.

In order to work, this method would need to be able to apply more than one REPLACE operation to each row, do it in a guaranteed order, and operate on the result of the previous operation.

Sadly, none of these things are possible using UPDATE and a CROSS JOIN. If it were, we might not need recursive CTEs...

The engine will do the join, but the stream aggregate includes an ANY internal aggregate:

[RemovePattern] = Scalar Operator(ANY([RemovePattern])), [ReplaceWith] = Scalar Operator(ANY([ReplaceWith])), [MyString] = Scalar Operator(ANY([MyString]))

...so each row is only operated on once, by one REPLACE. The 'corrected' version simply applies the exact same REPLACE expression four times. In effect, it applies the SPACE(2) -> SPACE(1) replacement four times. Note that this is just a quirk of the current plan - it is not guaranteed to choose the SPACE(2) row over SPACE(9), SPACE(3), SPACE(17), or any other...!

Sorry!

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

C# Gnu SSCertifiable Points: 5973 More actions · Answer 7

Yes thanks Paul, I noticed myself it was not replacng the 'rbar' text in the 2nd test (after I posted unfortuanately :blush:) - bother - I was so excited by the idea too! Back to the drawng board! 🙂

C# Gnu
____________________________________________________

Excel Database Tasks : efficient business data processing
Version 3.3 released February 2015

C# Gnu SSCertifiable Points: 5973 More actions · Answer 8

Hi

not quite as exciting - but at least the replace strings are in a table as I wanted :

😎

CREATE FUNCTION fn_CleanString(@MyString VARCHAR(max))



RETURNS VarChar(MAX)

AS

BEGIN

	-- Remove unwanted strings

	SELECT @MyString = REPLACE(@MyString, RemovePattern, ReplaceWith)

	FROM tPatterns

	WHERE CHARINDEX(RemovePattern, @MyString) != -1

	--Now Remove duplicate spaces & tabs(Jeff) - or other method

	SELECT @MyString = REPLACE(

			REPLACE(

				REPLACE(

					LTRIM(RTRIM(@MyString))                ,'  ',' '+CHAR(7))  --Changes 2 spaces to the OX model

				,CHAR(7)+' ','')        --Changes the XO model to nothing

			,CHAR(7),'')  --Changes the remaining X's to nothing

	RETURN @MyString

END

go

-- Create a table of all the rotten stuff we want to remove

SELECT 'tablescan' RemovePattern, 'seek' ReplaceWith

INTO tPatterns

UNION ALL SELECT 'RBAR', 'R***'

UNION ALL SELECT 'anyword', 'R*******' 

-- Create some dirty test data

SELECT 'a b  c   d    e     f       g       h     rbar            i        tablescan           j                          k' myString

INTO tMyStrings

UNION ALL

SELECT '           tablescan        rbar        x            b         c        d       e      f     g    h   i  j k'

-- Clean & output the our test data

SELECT dbo.fn_CleanString(myString) FROM  tMyStrings

go

-- Cleanup

DROP TABLE tPatterns

DROP TABLE tMyStrings

DROP FUNCTION fn_CleanString

C# Gnu
____________________________________________________

Excel Database Tasks : efficient business data processing
Version 3.3 released February 2015

C# Gnu SSCertifiable Points: 5973 More actions · Answer 9

:satisfied:!!!!Wait a minute - I thought I was onto something!!!!!!:satisfied:

This works nicely without the Nested replace:-):-):

CREATE FUNCTION fn_CleanString(@MyString VARCHAR(max))



RETURNS VarChar(MAX)

AS

BEGIN

        -- Remove unwanted strings

        SELECT @MyString = REPLACE(@MyString, RemovePattern, ReplaceWith)

        FROM tPatterns

        WHERE CHARINDEX(RemovePattern, @MyString) != -1

        RETURN @MyString

END

go

-- Create a table of all the rotten stuff we want to remove

SELECT 'tablescan' RemovePattern, 'seek' ReplaceWith

INTO tPatterns

UNION ALL SELECT 'RBAR', 'R***'

UNION ALL SELECT 'anyword', 'R*******'

UNION ALL SELECT SPACE(2), ' '

UNION ALL SELECT SPACE(2), ' '

UNION ALL SELECT SPACE(2), ' '

UNION ALL SELECT SPACE(2), ' '

UNION ALL SELECT SPACE(2), ' ' 

-- Create some dirty test data

SELECT 'a b  c   d    e     f       g       h     rbar            i        tablescan           j                          k' myString

INTO tMyStrings

UNION ALL

SELECT '           tablescan        rbar        x            b         c        d       e      f     g    h   i  j k'

-- Clean & output the our test data

SELECT dbo.fn_CleanString(myString) FROM  tMyStrings

go

-- Cleanup

DROP TABLE tPatterns

DROP TABLE tMyStrings

DROP FUNCTION fn_CleanString

C# Gnu
____________________________________________________

Excel Database Tasks : efficient business data processing
Version 3.3 released February 2015

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 10

Paul White (11/22/2009)

Jeff & Michael,

:w00t: I now understand why Jeff's method broke! :w00t:

Consider these very simple statements:

Heh... Yep... that's what I got out of it, too, but I didn't say any such thing because of the handrails that a lot of folks have been trying to install lately... It was easier for me to just say that the fix is to put the COLLATE statement on all string constants. You, of course, are doing it the right way. Another "right" way would be to use two "unlikely" characters but that would also require a 4th replace to restore the first unlikely character to a space, perhaps negating the speed gain.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 11

C# Screw (11/22/2009)

:satisfied:!!!!Wait a minute - I thought I was onto something!!!!!!:satisfied:

This works nicely without the Nested replace:-):-):

Very cool revelation but I'd suggest using the same spacing scheme that Michael used for two reasons... it's faster and it will handle up to 65536 spaces.

As a side bar, it's interesting that some folks in my group at work recently received some 3rd party files with (you won't believe this) a freakin' million spaces between some words. That's not an exaggeration... there were over 1 million contiguous spaces in several spots in the file. I'm thinking "road trip" to find out in person if those kids like pork chops. 😛

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

Paul White SSC Guru Points: 150457 More actions · Answer 12

C# Screw (11/22/2009)

:satisfied:!!!!Wait a minute - I thought I was onto something!!!!!!:satisfied:

:ermm: That works, yes, but you know it'll suck performance-wise, right? Look at the RBAR!!!

I'm trying to be helpful, but I know I sound like the Grinch:

1. Try to avoid MAX data types as parameters and variables where possible

2. Schema-bind your functions so the engine checks it for determinism

3. Schema-qualify your objects

4. Don't do row-by-agonizing-row data access in functions!

5. Actually, don't do data access at all in functions!!

Paul

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 13

WHERE CHARINDEX(RemovePattern, @MyString) != -1

Also, can you tell me when CHARINDEX will ever = -1? Wouldn't it be more effective to just have ...

WHERE CHARINDEX(RemovePattern, @MyString) > 0

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

Paul White SSC Guru Points: 150457 More actions · Answer 14

Jeff Moden (11/22/2009)

Heh... Yep... that's what I got out of it, too, but I didn't say any such thing because of the handrails that a lot of folks have been trying to install lately... It was easier for me to just say that the fix is to put the COLLATE statement on all string constants. You, of course, are doing it the right way. Another "right" way would be to use two "unlikely" characters but that would also require a 4th replace to restore the first unlikely character to a space, perhaps negating the speed gain.

Appreciate the comments Jeff - thanks! I think introducing another replace would unfairly penalise your method (which I still admire BTW) simply due to a sneaky 2000 bug. I'm just glad it is fixed in 2005 and 2008.

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 15

Paul White (11/22/2009)

C# Screw (11/22/2009)

:satisfied:!!!!Wait a minute - I thought I was onto something!!!!!!:satisfied:

:ermm: That works, yes, but you know it'll suck performance-wise, right? Look at the RBAR!!!

I'm trying to be helpful, but I know I sound like the Grinch:

1. Try to avoid MAX data types as parameters and variables where possible

2. Schema-bind your functions so the engine checks it for determinism

3. Schema-qualify your objects

4. Don't do row-by-agonizing-row data access in functions!

5. Actually, don't do data access at all in functions!!

Paul

I mostly agree with #5... but then there's things like Tally tables and lookup tables (both can be cached) that make it worth it especially when Cross Apply is used. I will agree that it certainly deserves a performance test.

I'll also agree that nested replaces will still beat the RBAR of doing single replaces even if they are table driven in a single query... we've already seen that in testing.

Just in case someone is confused (not you Paul), not all seemingly "loopless" code is Set Based... you also have to look at how many times each row or item is "touched". For example, a "Triangular Join" to do a running total on a lousy 10,000 rows will produce more than 50 MILLION rows of internal RBAR which is why it's so very slow compared to even a Cursor.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions