REPLACE Multiple Spaces with One

  • " ||*9*9|| " solution is funtastic. Great job and thanks a lot .

  • 😎 Thanks, Jeff!

  • Good solution Jeff, and well documented. However, I was a tad disappointed that you included the LTRIM(RTRIM(. That is unnecessary overhead that is unrelated to solving the stated problem (i.e. removal of multiple contiguous spaces). 😀

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • I starting writing this reply saying that this isn't very efficient at all...

    However, after I tested it with large strings (I used the HTML source of the article as the input string), I found it to be very efficient compared to my inline function implementation(s) which worked off the premise that the overhead of a function was better since I only went through the length of the string once in the loop. This premise was wrong, very wrong. The replace solution performed 4 times better than the slimmest while loop implementation I had.

    I will submit this finding: It performed even better when compiled as an inline-function

    The only limitation to this is there doesn't seem to be a way to clean other white space characters without using lots of replaces. I will have to keep my other cleaning function for now because it is designed to clean random text input and purge all non-standard text charcters. So it will reduce LF, CR, and Tabs to a single space, and then simply ignore any non-space character that doesn't fall between 33 and 126 in the ascii table while converting multiple spaces to a single space. I realize it would only take a couple replaces to get fix the whitespace characters, but filtering the non standard text charcters is not so easy.

  • Here are some results comparing SQL Query/SQL function/SQL C# CLR using a modified version of Thiago's script (100k records on my laptop):

    SQL Query : 4 minutes (2009-11-16 17:05:20.670 -> 2009-11-16 17:09:33.190)

    SQL Function : 1 min 33 seconds (2009-11-16 17:09:33.190 - > 2009-11-16 17:11:02.277)

    SQL CLR C# : 14 seconds (2009-11-16 17:11:02.277 - > 2009-11-16 17:11:16.077)

    We might have expected c# to have the day, but the Query/Function result is surprising.

    Can anyone explain why? (is this to do with SQL optimization abilities)?

    Is this test Relaible?

    --PREPARE

    CREATE FUNCTION dbo.fn_CleanUp(@FooString VARCHAR(max))

    RETURNS VARCHAR(max)

    BEGIN

    WHILE CHARINDEX(' ',@FooString) > 0

    SELECT @FooString = REPLACE(@FooString,' ',' ')

    RETURN @FooString

    END

    GO

    CREATE TABLE #TEMP1 (COL1 VARCHAR(900))

    CREATE TABLE #TEMP2 (COL2 VARCHAR(900), COL3 VARCHAR(900), COL4 VARCHAR(900),COL5 VARCHAR(900))

    go

    --INSERT 200k ROWS WITH RANDOM SPACES ON EACH TABLE, SEPARATE TABLES ARE USED TO AVOID CACHING, THIS MAY TAKE QUITE AWHILE

    DECLARE @SPACECOUNT1 INT,@SPACECOUNT2 INT,@SPACECOUNT3 INT,@SPACECOUNT4 INT

    SELECT @SPACECOUNT1 = CAST(CAST(NEWID() AS VARBINARY(1)) AS INT),@SPACECOUNT2 = CAST(CAST(NEWID() AS VARBINARY(1)) AS INT)

    INSERT INTO #TEMP1 (COL1)

    OUTPUT inserted.COL1 INTO #TEMP2 (COL2)

    SELECT 'TEST1'+SPACE(@SPACECOUNT1)+'TEST2'+SPACE(@SPACECOUNT2)+'TEST3'

    GO 100000

    --select * FROM #TEMP1

    --select * FROM #TEMP2

    --SELECTS

    SELECT GETDATE()

    UPDATE #TEMP2 SET COL3 = LTRIM(RTRIM(

    REPLACE(REPLACE(REPLACE(COL2,' ',' ||*9*9||'),'||*9*9|| ',''),'||*9*9||','')

    ))

    FROM #TEMP2

    GO

    SELECT GETDATE()

    UPDATE #TEMP2 SET COL4= dbo.fn_CleanUp(COL2)

    FROM #TEMP2

    GO

    SELECT GETDATE()

    UPDATE #TEMP2 SET COL5= dbo.fn_CleanString_CLR(COL2)

    FROM #TEMP2

    GO

    SELECT GETDATE()

    --CLEANUP

    DROP FUNCTION dbo.fn_CleanUp

    DROP TABLE #TEMP1

    DROP TABLE #TEMP2

    go

  • newjcb (11/16/2009)


    I starting writing this reply saying that this isn't very efficient at all...

    However, after I tested it with large strings (I used the HTML source of the article as the input string), I found it to be very efficient compared to my inline function implementation(s) which worked off the premise that the overhead of a function was better since I only went through the length of the string once in the loop. This premise was wrong, very wrong. The replace solution performed 4 times better than the slimmest while loop implementation I had.

    I will submit this finding: It performed even better when compiled as an inline-function

    The only limitation to this is there doesn't seem to be a way to clean other white space characters without using lots of replaces. I will have to keep my other cleaning function for now because it is designed to clean random text input and purge all non-standard text charcters. So it will reduce LF, CR, and Tabs to a single space, and then simply ignore any non-space character that doesn't fall between 33 and 126 in the ascii table while converting multiple spaces to a single space. I realize it would only take a couple replaces to get fix the whitespace characters, but filtering the non standard text charcters is not so easy.

    I wonder if CLR wouldn't be better to help you solve the flexible solution you require...

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • I wonder if CLR wouldn't be better to help you solve the flexible solution you require...

    Yep, (if) my test results in my post above are correct we are looking at CLR at least 10x faster, or execution will take just one tenth of time.

    I have another really good use of CLR as solution to sp_OACreate security issues - I will have to post as an article!

  • I didn't see it later on, but my solution in the past has been very simple; nest REPLACE() two spaces with one space as deep as required for the maximum number of contiguous spaces in the field, i.e. for VARCHAR(8000), I nest it 13 deep, as 2^13=8192.

    Note that the OX style replacing, when X is multiple characters, theoretically can result in severe internal fragmentation, as the strings become longer (and thus fewer rows per page are allows... if the pages were already full, they have to split).

  • brigzy (11/16/2009)


    Here are some results comparing SQL Query/SQL function/SQL C# CLR using a modified version of Thiago's script (100k records on my laptop):

    SQL Query : 4 minutes (2009-11-16 17:05:20.670 -> 2009-11-16 17:09:33.190)

    SQL Function : 1 min 33 seconds (2009-11-16 17:09:33.190 - > 2009-11-16 17:11:02.277)

    SQL CLR C# : 14 seconds (2009-11-16 17:11:02.277 - > 2009-11-16 17:11:16.077)

    We might have expected c# to have the day, but the Query/Function result is surprising.

    Can anyone explain why? (is this to do with SQL optimization abilities)?

    Is this test Relaible?

    --PREPARE

    CREATE FUNCTION dbo.fn_CleanUp(@FooString VARCHAR(max))

    RETURNS VARCHAR(max)

    BEGIN

    WHILE CHARINDEX(' ',@FooString) > 0

    SELECT @FooString = REPLACE(@FooString,' ',' ')

    RETURN @FooString

    END

    GO

    CREATE TABLE #TEMP1 (COL1 VARCHAR(900))

    CREATE TABLE #TEMP2 (COL2 VARCHAR(900), COL3 VARCHAR(900), COL4 VARCHAR(900),COL5 VARCHAR(900))

    go

    --INSERT 200k ROWS WITH RANDOM SPACES ON EACH TABLE, SEPARATE TABLES ARE USED TO AVOID CACHING, THIS MAY TAKE QUITE AWHILE

    DECLARE @SPACECOUNT1 INT,@SPACECOUNT2 INT,@SPACECOUNT3 INT,@SPACECOUNT4 INT

    SELECT @SPACECOUNT1 = CAST(CAST(NEWID() AS VARBINARY(1)) AS INT),@SPACECOUNT2 = CAST(CAST(NEWID() AS VARBINARY(1)) AS INT)

    INSERT INTO #TEMP1 (COL1)

    OUTPUT inserted.COL1 INTO #TEMP2 (COL2)

    SELECT 'TEST1'+SPACE(@SPACECOUNT1)+'TEST2'+SPACE(@SPACECOUNT2)+'TEST3'

    GO 100000

    --select * FROM #TEMP1

    --select * FROM #TEMP2

    --SELECTS

    SELECT GETDATE()

    UPDATE #TEMP2 SET COL3 = LTRIM(RTRIM(

    REPLACE(REPLACE(REPLACE(COL2,' ',' ||*9*9||'),'||*9*9|| ',''),'||*9*9||','')

    ))

    FROM #TEMP2

    GO

    SELECT GETDATE()

    UPDATE #TEMP2 SET COL4= dbo.fn_CleanUp(COL2)

    FROM #TEMP2

    GO

    SELECT GETDATE()

    UPDATE #TEMP2 SET COL5= dbo.fn_CleanString_CLR(COL2)

    FROM #TEMP2

    GO

    SELECT GETDATE()

    --CLEANUP

    DROP FUNCTION dbo.fn_CleanUp

    DROP TABLE #TEMP1

    DROP TABLE #TEMP2

    go

    I got very different results from yours: I used only 10000 rows and could already see the difference.

    SELECT GETDATE() == 2009-11-16 12:44:40.250

    UPDATE #TEMP2 SET COL3 = LTRIM(RTRIM(

    REPLACE(REPLACE(REPLACE(COL2,' ',' ||*9*9||'),'||*9*9|| ',''),'||*9*9||','')

    ))

    FROM #TEMP2

    GO

    SELECT GETDATE() == 2009-11-16 12:44:40.967

    So the orginal version (which I don't like because it can drastically increase the size of the string) took 0.717 seconds

    UPDATE #TEMP2 SET COL4= dbo.fn_CleanUp(COL2)

    FROM #TEMP2

    GO

    SELECT GETDATE() == 2009-11-16 12:44:43.297

    Your while loop version took: 43.297 - 40.967 = 2.33 seconds!

    UPDATE #TEMP2 SET COL5= dbo.tool_CleanTextSpaces(COL2)

    FROM #TEMP2

    GO

    SELECT GETDATE() == 2009-11-16 12:44:43.940

    This is my version which uses char(7) and Ltrim & rtrim before the replaces: 43.940 - 43.297 = 0.643 seconds!

    I didn't test your CLR, so it may have been faster, but without a doubt, the while loop replace will always be slower.

    EDIT: I tested with 100k and got

    2009-11-16 12:57:15.420

    2009-11-16 12:57:26.033 -- Means ||*9.. took 11.033 seconds

    2009-11-16 12:58:00.903 -- Means function While CharIndex replace version took 34.87 seconds

    2009-11-16 12:58:09.117 -- Means tool_CleanTextSpaces took 8.214 Seconds

    Confirms above even more.

  • TheSQLGuru (11/16/2009)


    newjcb (11/16/2009)


    I starting writing this reply saying that this isn't very efficient at all...

    However, after I tested it with large strings (I used the HTML source of the article as the input string), I found it to be very efficient compared to my inline function implementation(s) which worked off the premise that the overhead of a function was better since I only went through the length of the string once in the loop. This premise was wrong, very wrong. The replace solution performed 4 times better than the slimmest while loop implementation I had.

    I will submit this finding: It performed even better when compiled as an inline-function

    The only limitation to this is there doesn't seem to be a way to clean other white space characters without using lots of replaces. I will have to keep my other cleaning function for now because it is designed to clean random text input and purge all non-standard text charcters. So it will reduce LF, CR, and Tabs to a single space, and then simply ignore any non-space character that doesn't fall between 33 and 126 in the ascii table while converting multiple spaces to a single space. I realize it would only take a couple replaces to get fix the whitespace characters, but filtering the non standard text charcters is not so easy.

    I wonder if CLR wouldn't be better to help you solve the flexible solution you require...

    It probably would. I currently only use one CLR and that is the one to enable REGEX comparision. I have been debating doing more recently...

  • I have run the test again and I get:

    (milliseconds)

    Batch execution completed 10000 times.

    SQL:

    6026

    SQL:Looping

    2700

    SQL:CLR

    433

    It is weird as I too would have expectd looping to be slower

    Here is the script with time output messages:

    (please can you post your SQL function or add it to this script):

    --PREPARE

    SET NOCOUNT ON

    go

    CREATE FUNCTION dbo.fn_CleanUp(@FooString VARCHAR(max))

    RETURNS VARCHAR(max)

    BEGIN

    WHILE CHARINDEX(' ',@FooString) > 0

    SELECT @FooString = REPLACE(@FooString,' ',' ')

    RETURN @FooString

    END

    GO

    CREATE TABLE #TEMP1 (COL1 VARCHAR(900))

    CREATE TABLE #TEMP2 (COL2 VARCHAR(900), COL3 VARCHAR(900), COL4 VARCHAR(900),COL5 VARCHAR(900))

    go

    --INSERT 200k ROWS WITH RANDOM SPACES ON EACH TABLE, SEPARATE TABLES ARE USED TO AVOID CACHING, THIS MAY TAKE QUITE AWHILE

    DECLARE @SPACECOUNT1 INT,@SPACECOUNT2 INT,@SPACECOUNT3 INT,@SPACECOUNT4 INT

    SELECT @SPACECOUNT1 = CAST(CAST(NEWID() AS VARBINARY(1)) AS INT),@SPACECOUNT2 = CAST(CAST(NEWID() AS VARBINARY(1)) AS INT)

    INSERT INTO #TEMP1 (COL1)

    OUTPUT inserted.COL1 INTO #TEMP2 (COL2)

    SELECT 'TEST1'+SPACE(@SPACECOUNT1)+'TEST2'+SPACE(@SPACECOUNT2)+'TEST3'

    GO 10000

    --select * FROM #TEMP1

    --select * FROM #TEMP2

    --SELECTS

    DECLARE @TheTime DATETIME

    SELECT @TheTime= GETDATE()

    UPDATE #TEMP2 SET COL3 = LTRIM(RTRIM(

    REPLACE(REPLACE(REPLACE(COL2,' ',' ||*9*9||'),'||*9*9|| ',''),'||*9*9||','')

    ))

    FROM #TEMP2

    PRINT 'SQL:'

    PRINT DATEDIFF(ms,@TheTime,GETDATE())

    GO

    DECLARE @TheTime DATETIME

    SELECT @TheTime= GETDATE()

    UPDATE #TEMP2 SET COL4= dbo.fn_CleanUp(COL2)

    FROM #TEMP2

    PRINT 'SQL:Looping'

    PRINT DATEDIFF(ms,@TheTime,GETDATE())

    go

    DECLARE @TheTime DATETIME

    SET @TheTime=GETDATE()

    UPDATE #TEMP2 SET COL5= dbo.fn_CleanString_CLR(COL2)

    FROM #TEMP2

    PRINT 'SQL:CLR'

    PRINT DATEDIFF(ms,@TheTime,GETDATE())

    --CLEANUP

    DROP FUNCTION dbo.fn_CleanUp

    DROP TABLE #TEMP1

    DROP TABLE #TEMP2

    go

  • johan.vandamme (11/16/2009)


    One other point to consider is sentence endings. Normally there are two spaces at the end of sentences (for readability, etc., and yes, not everyone uses them. So a second character is necessary to "keep double spaces following a period".

    To get around this and extended characters problems, a regex solution through CLR seems to be the best generalized solution for all replace issues, not just spaces.

    Added the CLR code for regex, in case anyone's interested:

    using System.Data.SqlTypes;

    using Microsoft.SqlServer.Server;

    using System.Text.RegularExpressions;

    namespace sql.clr

    {

    public partial class sql2K8Functions

    {

    [Microsoft.SqlServer.Server.SqlFunction()]

    public static bool fRegExMatch(SqlString Input, SqlString Pattern)

    {

    Regex RegexInstance = new Regex(Pattern.ToString());

    if ((Input.IsNull) || Pattern.IsNull)

    {

    return false;

    }

    else

    {

    return RegexInstance.IsMatch(Input.ToString());

    }

    }

    [Microsoft.SqlServer.Server.SqlFunction()]

    public static string fRegExReplace(SqlString Input, SqlString Pattern, SqlString Replacement)

    {

    Regex RegexInstance = new Regex(Pattern.ToString());

    if (Input.IsNull || Pattern.IsNull || Replacement.IsNull)

    {

    return null;

    }

    else

    {

    return RegexInstance.Replace(Input.ToString(), Replacement.ToString());

    }

    }

    }

    }

    In terms of speed, CLR version performs faster than built-in replace. Queries that used to take minutes to finish take only seconds now.

  • @brigzy

    Before I go ahead and try again, can you first try switching the orders around? Try running the loop version first, then select, then clr. Your Raid controller may be caching the temp table and causing all queries after the first to be run with different IO circumstances.

  • newjcb (11/16/2009)


    @brigzy

    Before I go ahead and try again, can you first try switching the orders around? Try running the loop version first, then select, then clr. Your Raid controller may be caching the temp table and causing all queries after the first to be run with different IO circumstances.

    Beginning execution loop

    Batch execution completed 10000 times.

    SQL:Looping

    2706

    SQL:

    6206

    SQL:CLR

    440

    --PREPARE

    SET NOCOUNT ON

    go

    CREATE FUNCTION dbo.fn_CleanUp(@FooString VARCHAR(max))

    RETURNS VARCHAR(max)

    BEGIN

    WHILE CHARINDEX(' ',@FooString) > 0

    SELECT @FooString = REPLACE(@FooString,' ',' ')

    RETURN @FooString

    END

    GO

    CREATE TABLE #TEMP1 (COL1 VARCHAR(900))

    CREATE TABLE #TEMP2 (COL2 VARCHAR(900), COL3 VARCHAR(900), COL4 VARCHAR(900),COL5 VARCHAR(900))

    go

    --INSERT 200k ROWS WITH RANDOM SPACES ON EACH TABLE, SEPARATE TABLES ARE USED TO AVOID CACHING, THIS MAY TAKE QUITE AWHILE

    DECLARE @SPACECOUNT1 INT,@SPACECOUNT2 INT,@SPACECOUNT3 INT,@SPACECOUNT4 INT

    SELECT @SPACECOUNT1 = CAST(CAST(NEWID() AS VARBINARY(1)) AS INT),@SPACECOUNT2 = CAST(CAST(NEWID() AS VARBINARY(1)) AS INT)

    INSERT INTO #TEMP1 (COL1)

    OUTPUT inserted.COL1 INTO #TEMP2 (COL2)

    SELECT 'TEST1'+SPACE(@SPACECOUNT1)+'TEST2'+SPACE(@SPACECOUNT2)+'TEST3'

    GO 10000

    --select * FROM #TEMP1

    --select * FROM #TEMP2

    --SELECTS

    DECLARE @TheTime DATETIME

    SELECT @TheTime= GETDATE()

    UPDATE #TEMP2 SET COL4= dbo.fn_CleanUp(COL2)

    FROM #TEMP2

    PRINT 'SQL:Looping'

    PRINT DATEDIFF(ms,@TheTime,GETDATE())

    go

    DECLARE @TheTime DATETIME

    SELECT @TheTime= GETDATE()

    UPDATE #TEMP2 SET COL3 = LTRIM(RTRIM(

    REPLACE(REPLACE(REPLACE(COL2,' ',' ||*9*9||'),'||*9*9|| ',''),'||*9*9||','')

    ))

    FROM #TEMP2

    PRINT 'SQL:'

    PRINT DATEDIFF(ms,@TheTime,GETDATE())

    GO

    DECLARE @TheTime DATETIME

    SET @TheTime=GETDATE()

    UPDATE #TEMP2 SET COL5= dbo.fn_CleanString_CLR(COL2)

    FROM #TEMP2

    PRINT 'SQL:CLR'

    PRINT DATEDIFF(ms,@TheTime,GETDATE())

    --CLEANUP

    DROP FUNCTION dbo.fn_CleanUp

    DROP TABLE #TEMP1

    DROP TABLE #TEMP2

    go

  • @Jeff

    Nice article! Thanks for sharing. 🙂

    @brigzy

    Probably try to work without the "GO N" feature (which is a feature of SSMS, not a T-SQL functionality).

    I just tried a simple CLR function and Jeff's way to clean data with one table and 100,000 rows to be cleaned.

    CLR (for sure, compiled as Release version):

    [Microsoft.SqlServer.Server.SqlFunction(

    DataAccess=DataAccessKind.None,

    IsDeterministic=true,

    IsPrecise=false,

    SystemDataAccess=SystemDataAccessKind.None)

    ]

    public static SqlString RemoveDoubleSpacesClr(string inData) {

    while (inData.IndexOf(" ") != -1) {

    inData = inData.Replace(" ", " ");

    }

    return inData;

    }

    My test:

    SET NOCOUNT ON;

    ---==================================================================

    -- Table with data to be cleaned

    IF (OBJECT_ID('tempdb..#TestCleanSpaces') IS NULL)

    CREATE TABLE #TestCleanSpaces

    (

    Id INT NOT NULL IDENTITY(1,1)

    PRIMARY KEY CLUSTERED

    ,Data VARCHAR(8000)

    );

    ELSE

    TRUNCATE TABLE #TestCleanSpaces;

    ---==================================================================

    -- create some test data

    WITH random (r1, r2, r3) AS

    (

    SELECT

    ABS(BINARY_CHECKSUM(NEWID()))

    ,ABS(BINARY_CHECKSUM(NEWID()))

    ,ABS(BINARY_CHECKSUM(NEWID()))

    ),

    Numbers (Num) AS

    (

    SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1))

    FROM dbo.Numbers n1

    CROSS JOIN dbo.Numbers n2

    )

    INSERT INTO #TestCleanSpaces

    -- configure count of test data

    SELECT TOP(100000)

    REPLICATE(LEFT(CONVERT(NCHAR(36), NEWID()), r1 % 36) + SPACE(r2 % 20), r3 % 20)

    FROM Numbers n1

    CROSS JOIN random

    ---==================================================================

    -- test clr

    PRINT ('---==========================================================');

    PRINT ('-- CLR');

    SET STATISTICS TIME ON;

    SELECT

    dbo.RemoveDoubleSpacesClr(Data)

    FROM #TestCleanSpaces;

    SET STATISTICS TIME OFF;

    ---==================================================================

    -- test Jeff's function

    PRINT ('---==========================================================');

    PRINT ('-- Jeff''s function');

    SET STATISTICS TIME ON;

    SELECT REPLACE(

    REPLACE(

    REPLACE(

    LTRIM(RTRIM(Data))

    ,' ',' '+CHAR(7)) --Changes 2 spaces to the OX model

    ,CHAR(7)+' ','') --Changes the XO model to nothing

    ,CHAR(7),'')

    FROM #TestCleanSpaces;

    SET STATISTICS TIME OFF;

    I get the following result:

    ---==========================================================

    -- CLR

    SQL Server Execution Times:

    CPU time = 1015 ms, elapsed time = 2667 ms.

    ---==========================================================

    -- Jeff's function

    SQL Server Execution Times:

    CPU time = 1125 ms, elapsed time = 2505 ms.

    Both functions are almost equal independent how often I execute the script. Sometimes CLR is slightly faster, sometimes Jeff's function is faster.

    Greets

    Flo

Viewing 15 posts - 46 through 60 (of 425 total)

You must be logged in to reply to this topic. Login to reply