Splitting Strings Based on Patterns

  • dwain.c (11/30/2012)


    Looks like I've got a bit of reading and exploring to do now. Looks like a heck of a library. Gonna keep me away from writing articles for a spell I guess. But then, people are probably tired of me prattling on anyway.

    I doubt it, your articles seem to have been very well received so far. It is a very cool commercial-quality library, yes.

  • SQL Kiwi (11/30/2012)


    dwain.c (11/30/2012)


    Looks like I've got a bit of reading and exploring to do now. Looks like a heck of a library. Gonna keep me away from writing articles for a spell I guess. But then, people are probably tired of me prattling on anyway.

    I doubt it, your articles seem to have been very well received so far. It is a very cool commercial-quality library, yes.

    Tough to respond to that without sounding like a conceited schmuck. 😛 But I'll take it as a compliment.

    BTW. I've said it before and I'll say it again. Your posts to the forums are sorely missed by those of us that relish the opportunity to learn from your skills.

    Happily, I was able to learn something from you today! 😀


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • gloing40 (11/30/2012)


    very good website:

    The website wholesale for many kinds of fashion shoes, like the nike, jordan, prada, also including the shirts, bags, hat and the decorations.

    WE ACCEPT CREDIT CARD /WESTERN UNION PAYMENT

    YOU MUST NOT MISS IT!!!

    😉

    SPAM reported.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Great work.

    I have another problem on same lines.

    Instead of having all the resultant split pieces in one column and multiple rows, how about splitting it into multiple columns in same row. with the constraint of limited number of columns - lets say 4 or 5 depending on the requirements or Max number of columns expected.

    For example:

    Input string = 'abcd9efghij9kl9mno'

    Pattern = '9'

    output would be one row with multiple columns as follows:

    abcd | efghij | kl | mno

    Would appreciate the help.

  • Use a cross-tab query to pivot the results, it's generally faster than PIVOT and more intuitive too:

    SELECT

    Col1 = MAX(CASE WHEN ItemNumber = 1 THEN Item END),

    Col2 = MAX(CASE WHEN ItemNumber = 3 THEN Item END),

    Col3 = MAX(CASE WHEN ItemNumber = 5 THEN Item END),

    Col4 = MAX(CASE WHEN ItemNumber = 7 THEN Item END),

    Col5 = MAX(CASE WHEN ItemNumber = 9 THEN Item END)

    FROM [dbo].[PatternSplitCM] ('abcd9efghij9kl9mno','[^9]')

    WHERE [Matched] = 1

    Results

    -------

    Col1 Col2 Col3 Col4 Col5

    abcd efghij kl mno NULL

    Personally, with this example, I'd use DelimitedSplit2K8 with '9' as the delimiter instead of the pattern-matching splitter.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Shalinder - Thanks for dropping by and having a read.

    Chris - Thanks for the cover. Back now from PNG so I hope to catch up on some earlier promises I made on this thread.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • Thanks Chris for a quick reply.

    This looks to me that we are breaking a string in multiple rows and then putting the components back on the same row using MAX function. I could be wrong but, wouldn't this make it slow performing. I appreciate that you replied quickly though.

    Is it possible to break the string in same row but spread in multiple columns in one shot.. something like:

    SELECT col1, col2, col3

    FROM SplitTheStringToColumns ( string, delimitter, SplitsCount )

    or maybe something like:

    SELECT

    StringExtract ( string, delimitter, index = 1) ,

    StringExtract ( string, delimitter, index = 2) ,

    StringExtract ( string, delimitter, index = 3) ,

    StringExtract ( string, delimitter, index = 4)

  • shalinder.verma (12/5/2012)


    Thanks Chris for a quick reply.

    This looks to me that we are breaking a string in multiple rows and then putting the components back on the same row using MAX function. I could be wrong but, wouldn't this make it slow performing. I appreciate that you replied quickly though.

    Is it possible to break the string in same row but spread in multiple columns in one shot.. something like:

    SELECT col1, col2, col3

    FROM SplitTheStringToColumns ( string, delimitter, SplitsCount )

    or maybe something like:

    SELECT

    StringExtract ( string, delimitter, index = 1) ,

    StringExtract ( string, delimitter, index = 2) ,

    StringExtract ( string, delimitter, index = 3) ,

    StringExtract ( string, delimitter, index = 4)

    It is possible, but you wil lhave severe performance issues beyond a certain number of columns. The trick is have the splitting action generate multiple columns directly. The first result column does a pattern search and isolates the result. The second column does the same pattern search, but uses this result as a starting point for finding the second part. This means that the expression for the second column must do two searches and be able to handle correctly the situations where there is no first part, and no second part. Moving on, a third column, must do 3 nested searches and work with even more exceptions.

    You can see, the complexity quicly rises as the number of columns does. Imagine having to return 20 columns this way! The 20th column would have to do 20 nested searches, and the total would be 210 searches and nearly all of them interdependent!

  • Shalinder, you could use a cross-apply cascade, something like this:

    SELECT d.TheString,

    Col1 = SUBSTRING(TheString,1,p1-1),

    Col2 = SUBSTRING(TheString,p1+1,p2-p1-1),

    Col3 = SUBSTRING(TheString,p2+1,p3-p2-1),

    Col4 = SUBSTRING(TheString,p3+1,p4-p3-1)

    FROM (SELECT TheString = 'abcd9efghij9kl9mno', TheDelimiter = '9') d

    CROSS APPLY (SELECT p1 = CHARINDEX(TheDelimiter,TheString,0)) d1

    CROSS APPLY (SELECT p2 = CHARINDEX(TheDelimiter,TheString,p1+1)) d2

    CROSS APPLY (SELECT p3 = CHARINDEX(TheDelimiter,TheString,p2+1)) d3

    CROSS APPLY (SELECT p4 = ISNULL(NULLIF(CHARINDEX(TheDelimiter,TheString,p3+1),0),LEN(TheString)+1)) d4

    - which will perform well with up to five or six fragments.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • dwain.c (11/29/2012)


    dplaut 49149 (11/29/2012)


    Is there a way to download these examples, copying text out of those scrolling boxes is a pain and carries along formatting code.

    Sorry but in this article I didn't post downloadable examples except for the final test harness for DelimitedSplit8K. I'll make an attempt to do so in a couple of days, as I am traveling out of the country over the weekend on business and I'll have limited time.

    I'll try to remember to PM you once that is done.

    Belatedly, I am making good on my promise to upload scripts for all the relevant SQL in the article.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • SQL Kiwi (11/29/2012)


    dwain.c (11/29/2012)


    fregatepallada (11/29/2012)


    IMHO this task would be better implemented in CLR UDF rather than in pure T-SQL. CLR UDF easily allows to apply REGULAR EXPRESSIONS.

    So how about an example?

    I'd love to put it through its paces and compare the performance against the pure SQL version.

    Hi Dwain,

    SSC author Solomon Rutzky[/url] has the free SQL# library available at http://www.sqlsharp.com. Among the many features available in the free version running under the SAFE permission set is the RegEx_Replace function. Taking the example from the original thread and the start of this article:

    Given an input string such as: 1234ABC123456XYZ1234567890ADS, I would like to replace any string of digits that is longer than 5 characters with some other character (e.g., ‘x’), while leaving the remaining characters in the string unchanged.

    The whole solution is:

    SELECT

    SQL#.RegEx_Replace(N'1234ABC123456XYZ1234567890ADS', N'\d{5,}', N'x', -1, 1, NULL);

    Producing the result:

    1234ABCxXYZxADS

    There are all sorts of powerful and useful functions in this library, for example, if you want to list the matches for a regular expression (which is a much more powerful superset of the CHARINDEX and PATINDEX syntax), you can use:

    SELECT * FROM SQL#.RegEx_Matches(N'1234ABC123456XYZ1234567890ADS', N'\d{5,}', 1, NULL);

    +----------------------------------------------------+

    ¦ MatchNum ¦ Value ¦ StartPos ¦ EndPos ¦ Length ¦

    ¦----------+------------+----------+--------+--------¦

    ¦ 1 ¦ 123456 ¦ 8 ¦ 13 ¦ 6 ¦

    ¦ 2 ¦ 1234567890 ¦ 17 ¦ 26 ¦ 10 ¦

    +----------------------------------------------------+

    Finally found some time to play around with the SQL# library, so I thought I'd try a bit of a performance test against PatternSplitCM. So I constructed the below test harness, after verifying that I was giving it the right RegEx match argument and that returned results were identical:

    CREATE TABLE #t1 (MyString VARCHAR(8000))

    DECLARE @Pattern VARCHAR(500) = '[0-9]'

    ;WITH Tally (n) AS (

    SELECT TOP 1000 ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

    FROM sys.all_columns a CROSS JOIN sys.all_columns b)

    INSERT INTO #t1

    SELECT REPLICATE('abc0123hij456nopq789uvw1y', 1 + ABS(CHECKSUM(NEWID())) % 315)

    FROM Tally

    DECLARE @MyString VARCHAR(8000), @ItemNumber INT, @Item VARCHAR(8000)

    PRINT 'PatternSplitCM'

    SET STATISTICS TIME ON

    SELECT @MyString=MyString, @ItemNumber=ItemNumber, @Item=Item

    --SELECT MyString, ItemNumber, Item, [Matched]

    FROM #t1

    CROSS APPLY PatternSplitCM(MyString, @Pattern)

    WHERE [Matched]=1

    SET STATISTICS TIME OFF

    PRINT 'SQL#.RegEx_Matches'

    SET STATISTICS TIME ON

    SELECT @MyString=MyString, @ItemNumber=MatchNum, @Item=Value

    --SELECT MyString, MatchNum, Value

    FROM #t1

    CROSS APPLY SQL#.RegEx_Matches(MyString, N'\d{1,}', 1, NULL);

    SET STATISTICS TIME OFF

    DROP TABLE #t1

    Of course, the RegEx match function doesn't return the unmatched strings, however you can limit the strings returned from PatternSplitCM using the [Matched] column. Here are the results.

    (1000 row(s) affected)

    PatternSplitCM

    SQL Server Execution Times:

    CPU time = 5102 ms, elapsed time = 5876 ms.

    SQL#.RegEx_Matches

    SQL Server Execution Times:

    CPU time = 28673 ms, elapsed time = 28997 ms.

    Even thought PatternSplitCM emerged victorious in the CPU/elapsed time wars, I do concede that the SQL# RegEx match will be much more flexible.

    I really like some of the other functions in this library too. Anybody that hasn't downloaded it, should do so now!

    Edit: As an afterthought, I realized I may have biased the tests using VARCHAR(8000) so I changed all the strings defined as VARCHAR(8000) to NVARCHAR(4000) [used by the SQL# function] but not the internal variable type of PatternSplitCM. This was the revised result (SQL# was now a little faster but not enough to beat PatternSplitCM, most likely because it is not as generalized):

    (1000 row(s) affected)

    PatternSplitCM

    SQL Server Execution Times:

    CPU time = 4009 ms, elapsed time = 4198 ms.

    SQL#.RegEx_Matches

    SQL Server Execution Times:

    CPU time = 16957 ms, elapsed time = 17392 ms.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • Dwain,

    I see rather different performance using your test rig on SQL Server 2012 (build 3321) and SQL Server 2008 R2 (build 4266):

    +--------------------------------------------------+

    ¦ SQL Server ¦ PatternSplitCM ¦ SQL#.RegEx_Matches ¦

    ¦------------+----------------+--------------------¦

    ¦ 2012 ¦ 24203 ¦ 7889 ¦

    ¦ 2008 R2 ¦ 23019 ¦ 24303 ¦

    +--------------------------------------------------+

    Taking the 2008 R2 result first the RegEx is slightly slower, but there is a very important reason for this: the test rig does not produce unique strings. In a typical test, the count of distinct strings generated was 303 (out of 1000 rows total). The optimizer notices this and introduces a 'performance spool' into the query plan for the PatternSplitCM execution only. This optimization introduces a sort on the MyString column above the nested loops join, and a lazy table spool below it. The effect is to group identical MyString occurrences together, so only the first of any set of duplicates is processed by the PatternSplitCM logic. For subsequent rows with the same value of MyString, the spool replays the spooled result.

    You can see this in a post-execution plan by inspecting the rewinds and rebinds counter of the table spool. Rewinds occur where the spool replays saved results; a rebind occurs when the correlated parameter (MyString) changes and the PatternSplitCM logic needs to be performed for real. In the test where there were 303 distinct MyString values, the table spool showed 303 rebinds and 697 rewinds. Likewise, all the operators below the table spool show 303 actual executions, not 1000. It is possible to disable the 'performance spool' optimization with trace flag 8690. Doing this forces PatternSplitCM to run on all 1000 input rows, degrading performance from 23019 ms to 40588 ms.

    The optimizer does not apply this same optimization to the RegEx execution, because it has no idea how expensive that function is. By default it assumes a small cost and calculates (wrongly) that the performance spool would cost more (due to the sort and spool to tempdb) than it would save by avoiding cheap-looking function executions. We can introduce the 'performance spool' optimization to the RegEx execution with trace flag 8691. Doing this reduces the number of executions of the RegEx function from 1000 to 303, with a corresponding reduction in execution time from 24303 ms to 9978 ms.

    The situation with SQL Server 2012 is slightly different because returning rows from a CLR streaming table-valued functions has been significantly optimized in this release. Even without the performance spool, processing all 1000 rows through the RegEx takes only 7520 ms. With a performance spool added using TF 8691, the elapsed time is reduced further to 4886 ms.

    A secondary consideration is that I see significant sort spills to tempdb when running the PatternSplitCM solution. This is easy to see with the new warning symbols in SQL Server 2012; in previous releases we need to monitor for sort warnings using Profiler. The underlying reason is the guessed 100-row cardinality from the Top operator, which is used to estimate the sort memory requirement. The strings in this test are quite long, and so require more than 100 numbers from the on-the-fly numbers table.

  • Paul,

    You have aptly demonstrated why you are an MVP and I am but a plebe. 🙂

    My mistake, using the same string replicated for the test. I knew I shouldn't - have read that at least a couple of times. But I didn't think to apply it here for some reason (too much holiday spirit I suspect).

    New test harness (excuse my crude attempt at generating more unique strings, there's probably a better way) and results.

    CREATE TABLE #t1 (MyString VARCHAR(8000))

    DECLARE @Pattern VARCHAR(500) = '[0-9]'

    ;WITH Tally (n) AS (

    SELECT TOP 1000 ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

    FROM sys.all_columns a CROSS JOIN sys.all_columns b)

    INSERT INTO #t1

    SELECT REPLICATE(

    SUBSTRING('ABCDEFGHIJKLMNOPQRSTUVWXYZ', 1 + ABS(CHECKSUM(NEWID())) % 10, 1 + ABS(CHECKSUM(NEWID())) % 16)

    + SUBSTRING('00112233445566778899', 1 + ABS(CHECKSUM(NEWID())) % 10, 1 + ABS(CHECKSUM(NEWID())) % 10)

    + REVERSE(SUBSTRING('ABCDEFGHIJKLMNOPQRSTUVWXYZ', 1 + ABS(CHECKSUM(NEWID())) % 10, 1 + ABS(CHECKSUM(NEWID())) % 16))

    + REVERSE(SUBSTRING('00112233445566778899', 1 + ABS(CHECKSUM(NEWID())) % 10, 1 + ABS(CHECKSUM(NEWID())) % 10))

    + SUBSTRING('JUYWXFZHQAMLKNOPIRSTBVDECG', 1 + ABS(CHECKSUM(NEWID())) % 10, 1 + ABS(CHECKSUM(NEWID())) % 16)

    + SUBSTRING('01012323454567678989', 1 + ABS(CHECKSUM(NEWID())) % 10, 1 + ABS(CHECKSUM(NEWID())) % 10)

    + REVERSE(SUBSTRING('JUYWXFZHQAMLKNOPIRSTBVDECG', 1 + ABS(CHECKSUM(NEWID())) % 10, 1 + ABS(CHECKSUM(NEWID())) % 16))

    + REVERSE(SUBSTRING('01012323454567678989', 1 + ABS(CHECKSUM(NEWID())) % 10, 1 + ABS(CHECKSUM(NEWID())) % 10))

    + SUBSTRING('AAABBBCCCDDDEEEFFFGGGHHHII', 1 + ABS(CHECKSUM(NEWID())) % 10, 1 + ABS(CHECKSUM(NEWID())) % 16)

    + SUBSTRING('00011122233344455566677', 1 + ABS(CHECKSUM(NEWID())) % 10, 1 + ABS(CHECKSUM(NEWID())) % 10)

    + REVERSE(SUBSTRING('AAABBBCCCDDDEEEFFFGGGHHHII', 1 + ABS(CHECKSUM(NEWID())) % 10, 1 + ABS(CHECKSUM(NEWID())) % 16))

    + REVERSE(SUBSTRING('00011122233344455566677', 1 + ABS(CHECKSUM(NEWID())) % 10, 1 + ABS(CHECKSUM(NEWID())) % 10))

    , 1 + ABS(CHECKSUM(NEWID())) % 75)

    --SELECT REPLICATE('abc0123hij456nopq789uvw1y', 1 + ABS(CHECKSUM(NEWID())) % 315)

    FROM Tally

    --SELECT * FROM #t1

    --SELECT MAX(DATALENGTH(MyString)) FROM #t1

    DECLARE @MyString VARCHAR(8000), @ItemNumber INT, @Item VARCHAR(8000)

    PRINT 'PatternSplitCM'

    SET STATISTICS TIME ON

    SELECT @MyString=MyString, @ItemNumber=ItemNumber, @Item=Item

    --SELECT MyString, ItemNumber, Item, [Matched]

    FROM #t1

    CROSS APPLY PatternSplitCM(MyString, @Pattern)

    WHERE [Matched]=1

    SET STATISTICS TIME OFF

    PRINT 'SQL#.RegEx_Matches'

    SET STATISTICS TIME ON

    SELECT @MyString=MyString, @ItemNumber=MatchNum, @Item=Value

    --SELECT MyString, MatchNum, Value

    FROM #t1

    CROSS APPLY SQL#.RegEx_Matches(MyString, N'\d{1,}', 1, NULL);

    SET STATISTICS TIME OFF

    DROP TABLE #t1

    (1000 row(s) affected)

    PatternSplitCM

    SQL Server Execution Times:

    CPU time = 11343 ms, elapsed time = 6576 ms.

    SQL#.RegEx_Matches

    SQL Server Execution Times:

    CPU time = 5975 ms, elapsed time = 6008 ms.

    Unfortunately, I don't have SQL 2012 to confirm your results on. Not that they need confirmation given the source.

    I find it diabolically clever that the SQL optimizer is able to identify replicated patterns in the strings when constructing the execution plan. Your analysis of that was not only spot on, but extremely easy to comprehend.

    Thanks for taking the time to look into this and correct an inappropriate conclusion.

    Happy New Year to you sir!


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • Does anybody try regex and XQUERY?

  • fregatepllada (9/12/2014)


    Does anybody try regex and XQUERY?

    For this problem I personally have not. The solution is designed for folks that can't use CLR on their servers.

    Thanks for taking a look though.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

Viewing 15 posts - 31 through 45 (of 52 total)

You must be logged in to reply to this topic. Login to reply