REPLACE Multiple Spaces with One

  • Paul White (11/20/2009)


    My thanks to those who have voted for the Connect item about the collation performance thing. I'll post the response here as soon as I hear anything

    Paul,

    I have played around a bit and can't do any better other than confirm what you are saying.

    The only good news I have is that if your database collation is actually LATIN1_GENERAL_BIN or SQL_LATIN1_GENERAL_CP1_CI_AS then calling replace without a collate qualifier gives you the good performance you are looking for.

    Luckily we are dealing with SQL Server 2005 where replace always works no matter what combination you throw at it.

    In SQL Server 2000 it's a real nightmare. Performance is fairly stable but for many situations the calls to replace actually don't work, in particular when the database collation is different from the replace collate qualifier.

    There are even a couple of database collations (LATIN1_GENERAL_CI_AS and LATIN1_GENERAL_BIN) where Jeff's code will not work unless you place a collate qualifier next to the column using the same database collation.

    So what I'm wondering is if all this is specific to the replace function or if there are other things that are affected by collation in terms of performance and producing correct results.

  • Michael Meierruth (11/21/2009)


    In SQL Server 2000 it's a real nightmare. Performance is fairly stable but for many situations the calls to replace actually don't work, in particular when the database collation is different from the replace collate qualifier.

    Do you have some code and data that shows where "calls to replace actually don't work"?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Joe Celko (11/21/2009)


    Back in 2004, we found that a Fibonacci series ran better a binary nesting of REPLACE() functions. It was like an old polyphase merge on a tape drive.

    That sounds very interesting, Joe. Would it be possible to show how this particular problem of replacing multiple spaces with one could be done in SQL Server using such a method? Thanks.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Jeff Moden (11/21/2009)


    Michael Meierruth (11/21/2009)


    In SQL Server 2000 it's a real nightmare. Performance is fairly stable but for many situations the calls to replace actually don't work, in particular when the database collation is different from the replace collate qualifier.

    Do you have some code and data that shows where "calls to replace actually don't work"?

    It's the code straight from your article.

    I have SQL Server 2000 (SP4) installed as a named instance with collation SQL_Latin1_General_CP1_CI_CA.

    I create a database with collation Latin1_General_CI_AS or Latin1_General_BIN. Running your code under either one causes the result to be unchanged. Adding the collation qualifier to your field (which is redundant) causes the result to be displayed correctly.

  • C# Screw (11/16/2009)


    That is interesting,

    do you have Visual Studio there ....or I can create T-SQL CREATE ASEMBLY line so u can also test the C# CLR version ... ?

    Cheers

    Technically, you don't need VS, but you'd need all of the libraries and a compiler, etc... so VS makes it a whole lot easier.

    I've extensively tested CLR regex functions in these cases (actually there's a bg thread a few years back with Jeff and I doing the friendly footrace competition on CLR versus native T-sql solutions. Rege can truly get up and going, and - its speed seems to have gotten even better in 2008 (over that in 2005). As lnog as you give it some memory to work with (so play on 64-bit to avoid the resource limitations), that sucker can get up and really go to town.

    That said - it still has some limitations, so it has some perf penalties with having to do a lot of data updates, so it was often a "neck and neck" photo finish..

    I like the regex for readbility, but the performance winner could in this case be a toss-up (sometimes CLR, sometimes, T-SQL).

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Matt Miller (#4) (11/21/2009)


    actually there's a bg thread a few years back with Jeff and I doing the friendly footrace competition on CLR versus native T-sql solutions.

    Matt, that sounds interesting .. please could you add a link if poss...

    Thanks

  • C#Screw: I would check the query plans for each. I wouldn't be surprised that the non-function method uses a scan-based plan thus less IO but longer time while the function uses a lookup-based plan which can indeed be faster in time but take significantly more IO. It can be faster since those IO are many iterative hits on the same pages that are already in cache. Just a thought.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Michael Meierruth (11/21/2009)


    Jeff Moden (11/21/2009)


    Michael Meierruth (11/21/2009)


    In SQL Server 2000 it's a real nightmare. Performance is fairly stable but for many situations the calls to replace actually don't work, in particular when the database collation is different from the replace collate qualifier.

    Do you have some code and data that shows where "calls to replace actually don't work"?

    It's the code straight from your article.

    I have SQL Server 2000 (SP4) installed as a named instance with collation SQL_Latin1_General_CP1_CI_CA.

    I create a database with collation Latin1_General_CI_AS or Latin1_General_BIN. Running your code under either one causes the result to be unchanged. Adding the collation qualifier to your field (which is redundant) causes the result to be displayed correctly.

    Heh... confirmed... and it must be how collation is distributed across formulas in SQL Server 2000 because this doesn't work...

    --===== Create and populate a test table.

    -- This is NOT a part of the solution.

    DECLARE @Demo TABLE(OriginalString VARCHAR(8000))

    INSERT INTO @Demo (OriginalString)

    SELECT ' This has multiple unknown spaces in it. ' UNION ALL

    SELECT 'So does this!' UNION ALL

    SELECT 'As does this' UNION ALL

    SELECT 'This, that, and the other thing.' UNION ALL

    SELECT 'This needs no repair.'

    --===== Reduce each group of multiple spaces to a single space

    -- for a whole table without functions, loops, or other

    -- forms of slow RBAR. In the following example, CHAR(7)

    -- is the "unlikely" character that "X" was used for in

    -- the explanation.

    SELECT REPLACE(

    REPLACE(

    REPLACE(

    LTRIM(RTRIM(OriginalString COLLATE LATIN1_GENERAL_BIN ))

    ,' ',' '+CHAR(7)) --Changes 2 spaces to the OX model

    ,CHAR(7)+' ','') --Changes the XO model to nothing

    ,CHAR(7),'') AS CleanString --Changes the remaining X's to nothing

    FROM @Demo

    WHERE CHARINDEX(' ',OriginalString) > 0

    ... but this does...--===== Create and populate a test table.

    -- This is NOT a part of the solution.

    DECLARE @Demo TABLE(OriginalString VARCHAR(8000))

    INSERT INTO @Demo (OriginalString)

    SELECT ' This has multiple unknown spaces in it. ' UNION ALL

    SELECT 'So does this!' UNION ALL

    SELECT 'As does this' UNION ALL

    SELECT 'This, that, and the other thing.' UNION ALL

    SELECT 'This needs no repair.'

    --===== Reduce each group of multiple spaces to a single space

    -- for a whole table without functions, loops, or other

    -- forms of slow RBAR. In the following example, CHAR(7)

    -- is the "unlikely" character that "X" was used for in

    -- the explanation.

    SELECT REPLACE(

    REPLACE(

    REPLACE(

    LTRIM(RTRIM(OriginalString COLLATE LATIN1_GENERAL_BIN ))

    ,' 'COLLATE LATIN1_GENERAL_BIN,' '+CHAR(7)COLLATE LATIN1_GENERAL_BIN) --Changes 2 spaces to the OX model

    ,CHAR(7)+' 'COLLATE LATIN1_GENERAL_BIN,''COLLATE LATIN1_GENERAL_BIN) --Changes the XO model to nothing

    ,CHAR(7)COLLATE LATIN1_GENERAL_BIN,''COLLATE LATIN1_GENERAL_BIN) AS CleanString --Changes the remaining X's to nothing

    FROM @Demo

    WHERE CHARINDEX(' ',OriginalString) > 0

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Jeff,

    By the way, once you have attached the collate qualifier to OriginalString the others are not necessary.

    In any case, in my opinion this is a bug in SQL Server 2000 or maybe CHAR(7) has no meaning in those collations.

    This bug does not exist in SQL Server 2005.

    But here Paul has discovered this performance issue which is leggitimate because it can also be demonstrated without the explicit use of the collate qualifier.

    But what about this collate qualifier? Who is using it and when? Why has collate become an issue here?

    The only time I have ever used it was when I was running something in my database and had to refer to something in another database with a different collation.

    I'm starting to get very curious about this collation thing. I understand it in principle. But now I want to know the underlying details. Maybe we should start a new thread or someone can point me to one that has banged this around a bit.

  • C# Screw (11/21/2009)


    Matt Miller (#4) (11/21/2009)


    actually there's a bg thread a few years back with Jeff and I doing the friendly footrace competition on CLR versus native T-sql solutions.

    Matt, that sounds interesting .. please could you add a link if poss...

    Thanks

    Here's one (not the one I was thinking about it, but I can't get any meaningful results from 2.5 years ago on the forums which is when the original testing was.)

    http://qa.sqlservercentral.com/Forums/Topic470379-338-1.aspx

    The original one was interesting. It came down to: complicated patterns do better in CLR, but reasonably simple patterns (like this one) do better in T-sql (no overhead penalties, etc...). Which is kind of what this one brings out as well: even on large sets, Jeff's solution edges out the CLR functions (34 second for Jeff's solution vs 43 for the best CLR code currently posted on 3M rows) with simple patterns. Of course - when you start throwing in exceptions like "leave 2 spaces after a period", then the results start getting even closer.

    Edit: just noticed your comment on the data content. That's also a good point, the kind of data you have might also dictate the "best method".

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Michael Meierruth (11/21/2009)


    Jeff,

    By the way, once you have attached the collate qualifier to OriginalString the others are not necessary.

    In any case, in my opinion this is a bug in SQL Server 2000 or maybe CHAR(7) has no meaning in those collations.

    This bug does not exist in SQL Server 2005.

    But here Paul has discovered this performance issue which is leggitimate because it can also be demonstrated without the explicit use of the collate qualifier.

    But what about this collate qualifier? Who is using it and when? Why has collate become an issue here?

    The only time I have ever used it was when I was running something in my database and had to refer to something in another database with a different collation.

    I'm starting to get very curious about this collation thing. I understand it in principle. But now I want to know the underlying details. Maybe we should start a new thread or someone can point me to one that has banged this around a bit.

    Actually, on my machine, it was necessary to put it everywhere to get it to work correctly.

    The reason why the collate qualifier has become an issue is because the binary collation runs very fast compared to some of the others and you don't need to make a whole database or server or even a table case sensitive to do it. It's a simple performance enhancement.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Jeff,

    When you deal with two different but similar problems, things tend to get confusing.

    Problem 1: SQL Server 2000 and the replace function's reliability

    Problem 2: SQL Server 2005 and Paul's performance issue with the replace function

    Thus you might be interested to know that

    1. on SQL Server 2000 whenever you explicitly assign a collation to your column different from your database's collation neither your replace method nor my replace method works (and note, I only deal with blanks - shouldn't that be the same in the whole universe?)

    2- on SQL Server 2005 whenever you assign Latin1_General_BIN to your column, no matter what your database collation is, it always solves the performance problem (did you hear that Paul?)

    OK, now let me really dig into this collation thing...

    PS

    Jeff, try running a performance test in SS2K using a database collation of Latin1_General_CI_AS and assigning your column the same collation attribute. I couldn't believe it...

  • Michael Meierruth (11/21/2009)


    Jeff,

    When you deal with two different but similar problems, things tend to get confusing.

    Problem 1: SQL Server 2000 and the replace function's reliability

    Problem 2: SQL Server 2005 and Paul's performance issue with the replace function

    Thus you might be interested to know that

    1. on SQL Server 2000 whenever you explicitly assign a collation to your column different from your database's collation neither your replace method nor my replace method works (and note, I only deal with blanks - shouldn't that be the same in the whole universe?)

    Ummm... you think I'm lying or what? Maybe you're talking about something different but when I put the COLLATION clause everywhere, like in the second example of what I posted above, it works just fine in SQL Server 2000. At least on my 2k box it does. And, heh... there is more than one universe... there's the 2k universe and there's the 2k5 universe. 😉

    2- on SQL Server 2005 whenever you assign Latin1_General_BIN to your column, no matter what your database collation is, it always solves the performance problem (did you hear that Paul?)

    Heh... did you try them all? Never the less, based on what I've seen it do (or not do, in this case) in SQL Server 2000, I'd recommend the same thing I recommend with every code solution... test it and make sure it's working correctly whether you're using 2k, 2k5, 2k8..... 😉

    Jeff, try running a performance test in SS2K using a database collation of Latin1_General_CI_AS and assigning your column the same collation attribute. I couldn't believe it...

    Heh... save me a minute... What's it do? Same thing as the Latin1_General_BIN?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Michael Meierruth (11/21/2009)


    2- on SQL Server 2005 whenever you assign Latin1_General_BIN to your column, no matter what your database collation is, it always solves the performance problem (did you hear that Paul?)

    Yes, I mentioned it previously, and it is the 'workaround' given in my Connect item.

    As a matter of interest, Latin1_General_BIN is somewhat slower than true code-point Latin1_General_BIN2 (2008). The original behaviour of _BIN was to compare the first character as a WCHAR, the rest was a binary comparison. This is clearly slightly broken, but the behaviour is preserved for compatibility. _BIN2 collations do the job much better.

    In answer to the question of why collation is important before: collation is not just for sort order - collation must be considered when comparing two strings as well.

    Collation handling is much better in 2005 (and therefore 2008) than 2000 if I recall correctly. I don't have 2000 installed any more so can't test the stuff directly.

    Paul

  • Paul,

    BOL doesn't explain the guts of what goes on behind collate.

    I'm no expert on collate but I have a rough feeling of what it is all about.

    This little example makes your head spin a bit at first but after a while you'll feel pretty comfortable about collate.

    Note, the letter è, e with a grave accent, sitting somehwere in the above-ascii-127 zone, is very popular in the country where I'm living, Italy, whereas I haven't got a clue where upper case E with a grave accent is used, but I thought I'd throw it in for fun.

    --drop table t1

    go

    create table t1(

    rownum int identity(1,1),

    s varchar(10)

    )

    insert into t1 values('A')

    insert into t1 values('a')

    insert into t1 values('è')

    insert into t1 values('e')

    insert into t1 values (char(200)) -- grave accented E

    insert into t1 values('E')

    select * from t1 order by s collate Latin1_General_CS_AS,rownum

    select * from t1 order by s collate Latin1_General_CI_AI,rownum

    Now that you have tried the above, and on the same subject, if your db collation is Latin1_General_CS_AS (Case Sensitive - Accent Sensitive) and you do

    declare @S varchar(10)

    set @S = 'Aaèe'

    set @S = @S collate Latin1_General_CI_AI -- (Case Insensitive - Accent Insensitive)

    select @S

    do you think the value of @S will change?

    Obviously not!

    And the same is true the other way around, i.e. you are in a database with Latin1_General_CI_AI and you do

    declare @S varchar(10)

    set @S = 'Aaèe'

    set @S = @S collate Latin1_General_CS_AS

    select @S

    From this, my thinking is that collate is an attribute of the varchar variable or varchar column. When you compare two varchars in some context, SQL Server checks to see if they have the same collate attribute and complains if they don't. This is especially true when referencing varchar table columns in other databases with a different collation from the database you're in.

    But it's not easy to understand what SQL Server does behind the scenes when it needs to involve the string's collate attribute. Let's take CS vs CI. If my database is CS and I reference a column in a CI database like so:

    where mydatabase.'A' = otherdatabase.'a' collate CI

    or so:

    where mydatabase.'a' = otherdatabase.'A' collate CI

    Will it transform 'A' into 'a' or 'a' into 'A' and on which side? I'm stuck here.

    The problem is that even in a CI database you will have both 'a' and 'A' physically present as two different characters but if you compare them in the same CI database they are equal. How does it handle this behind the scenes?

    But the whole point of this comment is why should the replace function be impacted so heavily by the world of collation, especially when all three arguments have the same collation attribute? In my opinion, replace should be collation neutral if all three arguments have the same collation attribute.

    On the other hand, there may be no such thing as a varchar collation attribute. In which case: back to square one.

Viewing 15 posts - 226 through 240 (of 425 total)

You must be logged in to reply to this topic. Login to reply