Tally OH! An Improved SQL 8K “CSV Splitter” Function

  • Lynn Pettis (12/30/2012)


    John Hardin (12/30/2012)


    L' Eomot Inversé (12/30/2012)


    Jeff Moden (12/28/2012)


    If you look at the execution plan, it doesn't actually use the "same list". CTEs have some wonderful advantages but they also have a hidden fault. Instead of using the same internal result set from a CTE when things like self joins are used, the entire CTE is re-executed.

    Now, if only one could force the optimiser to spool the CTE and reuse it, CTEs would be far more useful.

    Don't become so blinded by CTEs that you forget about temporary tables or table-type variables; another problem with CTEs is they are not indexed.

    On page 27 I posted a suggested version of this that avoided CHARINDEX() and used an indexed table-type variable and it showed performance gains over the pure-CTE version, but I don't think anyone has seen fit to comment on that suggested version... 🙁

    Page 27 passed on how many posts per page? I ask as I display 50 posts per page. It would help if you posted the url of your post so we could go directly to that post.

    D'oh! I didn't consider non-default posts-per-page settings. Sorry. Here's the direct link: http://qa.sqlservercentral.com/Forums/FindPost1129283.aspx

    The second code block is what I am referring to.

  • L' Eomot Inversé (12/30/2012)


    Why can't the optimiser look and see if spooling would enhance performance (only in the cases where it make no semantic difference, of course) and use it when it does? Or if that's too difficult a task for the optimiser (it shouldn't be, but they haven't done it so I can imagine them claiming it is) why can't we have a query hint that tells them to do it? I hope Paul or someone will jump in and explain all this - maybe tell me I've got it all wrong, but that's OK too, I like learning.

    You're right that the optimizer doesn't consider CTE materialization today (they are strictly in-line view definitions, expanded once per reference) and there is no hint for materialization either.

    There is a 2006 active Connect suggestion by Adam Machanic for a hint. Though some duplicates of that item have been closed as Won't Fix (= perhaps for the time being) my overall impression from public Microsoft comments is that something along these lines is being actively considered for a future release. We will have to wait and see.

    Related: more generalized subtree matching and reuse has been trialled in the product, see this Microsoft Research paper (PDF).

  • Tony.l (12/29/2012)


    Hi

    unbelievable - so good. I have been doing these puter things for 30 years now and i am still amazed how little i know or probably at just how clever other people are. Really well explained and totally useful.

    keep up the good work.

    APL. My initials not the language.

    Thanks for the great feedback, Tony. I feel the same way. I've been working with computers in one form or another since 1968 and I'm still learning something new about them everyday.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • John Hardin (12/30/2012)


    On page 27 I posted a suggested version of this that avoided CHARINDEX() and used an indexed table-type variable and it showed performance gains over the pure-CTE version, but I don't think anyone has seen fit to comment on that suggested version... 🙁

    You'll have to forgive me a bit on that. If you look at even the most recent pages of this thread, there are still people saying "Try this" or "How about this"? Even you said...

    How about this to get rid of the CHARINDEX() string operation:

    {snip}

    ...how does that affect performance?

    Two days later (lots of water had flowed under the bridge by then and we're not notified by edits), you added...

    edited to add: Actually performance goes down a bit because the subquery in making cteStartEnd is not indexed. This version does exhibit performance gains in my very modest testing:

    ...and followed that up with...

    (Note: I'm doing my version as a SP so the above exact code hasn't actually been tested by me...)

    I just didn't (and still don't) have the time to test everyone's suggestion. That's why I spent so much time on building a test harness for everyone to use.

    If you think your method has merit in the area of performance, please test it and post the test results. Who knows? You might be on to something.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • SQL Kiwi (12/30/2012)


    There is a 2006 active Connect suggestion by Adam Machanic for a hint. Though some duplicates of that item have been closed as Won't Fix (= perhaps for the time being) my overall impression from public Microsoft comments is that something along these lines is being actively considered for a future release. We will have to wait and see.

    Well, I've added my vote to that. It will probably make no difference, but that isn't an excuse for not doing so whem I think it's right.

    Related: more generalized subtree matching and reuse has been trialled in the product, see this Microsoft Research paper (PDF).

    For some reason the link doesn't work. But looking in the obvious place indicated by your link (http://research.microsoft.com/en-us/um/people/jrzhou/publications.htm) finds it and the link from there, although it appears to be identical to your link, works fine. I have no idea what's happening there.

    Edit: yes, I see what was happening: your link is relative, not absolute, so it tries to connect to qa.sqlservercentral.com/Forums/research.microsoft.com/en-us/um/people/jrzhou/pub/cse.pdf instead of to the real target. I suspect I've done that sometimes too - probably every time I've typed the link instead of copyying and pasting from the address field of my browser.

    Tom

  • L' Eomot Inversé (12/31/2012)


    For some reason the link doesn't work. But looking in the obvious place indicated by your link (http://research.microsoft.com/en-us/um/people/jrzhou/publications.htm) finds it and the link from there, although it appears to be identical to your link, works fine. I have no idea what's happening there.

    Edit: yes, I see what was happening: your link is relative, not absolute, so it tries to connect to qa.sqlservercentral.com/Forums/research.microsoft.com/en-us/um/people/jrzhou/pub/cse.pdf instead of to the real target. I suspect I've done that sometimes too - probably every time I've typed the link instead of copyying and pasting from the address field of my browser.

    Thanks Tom, I have corrected the link in the original post.

  • Nice update to this article Jeff!

    I especially liked your mysteriously labelled black line (????). Nice trick with the 0 based Tally table too - will need to remember that one.

    I have already replaced the old DelimitedSplit8K in my sandbox and now I must seek out other places where it may have been used and do the same.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • Interesting stuff, thank you.

    Table type parameters for stored procedures have eliminated much of my need for doing this within SQL, but it's always nice to have options.

  • Jeff or Paul:

    I didn't see the source code for the CLR function. I also didn't read all 473 posts before mine, so maybe there's a link to it somewhere..

    Paul, would you mind sharing that with us? I would love to see its guts.

    Kurt

  • klini (1/3/2013)


    I didn't see the source code for the CLR function. I also didn't read all 473 posts before mine, so maybe there's a link to it somewhere..

    It's in one of the zip files at the bottom of the original article: Tally OH! An Improved SQL 8K “CSV Splitter” Function[/url]

     

  • hi sorry i haven't tested for performance or sure if this has been done but removed the ctelen

    IF OBJECT_ID('dbo.DelimitedSplit8K') IS NOT NULL

    DROP FUNCTION [dbo].[DelimitedSplit8K]

    go

    CREATE FUNCTION [dbo].[DelimitedSplit8K]

    (@pString VARCHAR(8000), @pDelimiter CHAR(1))

    RETURNS TABLE WITH SCHEMABINDING AS

    RETURN

    --===== "Inline" CTE Driven "Tally Table" produces values from 0 up to 10,000...

    -- enough to cover NVARCHAR(4000)

    WITH E1(N) AS (

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

    ), --10E+1 or 10 rows

    E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows

    E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max

    cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front

    -- for both a performance gain and prevention of accidental "overruns"

    SELECT TOP (ISNULL(DATALENGTH(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4

    )

    , cteStart(N1,L1) AS (

    SELECT 1,CHARINDEX(@pDelimiter,@pString,1) -1

    UNION ALL

    SELECT t.N+1

    , ISNULL(NULLIF(CHARINDEX(@pDelimiter,@pString,t.N+1),0)-t.N,8001) -1

    FROM cteTally t

    WHERE SUBSTRING(@pString,t.N,1) = @pDelimiter

    )

    --select * from cteStart

    SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),

    Item = SUBSTRING(@pString, l.N1, l.L1)

    FROM cteStart l

  • telcogod (1/4/2013)


    hi sorry i haven't tested for performance or sure if this has been done but removed the ctelen

    IF OBJECT_ID('dbo.DelimitedSplit8K') IS NOT NULL

    DROP FUNCTION [dbo].[DelimitedSplit8K]

    go

    CREATE FUNCTION [dbo].[DelimitedSplit8K]

    (@pString VARCHAR(8000), @pDelimiter CHAR(1))

    RETURNS TABLE WITH SCHEMABINDING AS

    RETURN

    --===== "Inline" CTE Driven "Tally Table" produces values from 0 up to 10,000...

    -- enough to cover NVARCHAR(4000)

    WITH E1(N) AS (

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

    ), --10E+1 or 10 rows

    E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows

    E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max

    cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front

    -- for both a performance gain and prevention of accidental "overruns"

    SELECT TOP (ISNULL(DATALENGTH(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4

    )

    , cteStart(N1,L1) AS (

    SELECT 1,CHARINDEX(@pDelimiter,@pString,1) -1

    UNION ALL

    SELECT t.N+1

    , ISNULL(NULLIF(CHARINDEX(@pDelimiter,@pString,t.N+1),0)-t.N,8001) -1

    FROM cteTally t

    WHERE SUBSTRING(@pString,t.N,1) = @pDelimiter

    )

    --select * from cteStart

    SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),

    Item = SUBSTRING(@pString, l.N1, l.L1)

    FROM cteStart l

    Pretty good performance compared to DelimitedSplit8K! However, when running the tests using Jeff Moden's test harness the function failed whenever there was only one element and was still not quite as fast. Very close in performance though, so the change you made didn't really make much difference.

    RowNum SplitterName NumberOfRows NumberOfElements MinElementLength MaxElementLength Duration MinLength AvgLength MaxLength

    1 DelimitedSplit8K 1000 1 1 10 0.12300 1 5 10

    2 DelimitedSplit_TELCO 1000 1 1 10 0.07600 1 5 10

    3 DelimitedSplit8K 1000 2 1 10 0.15300 3 11 21

    4 DelimitedSplit_TELCO 1000 2 1 10 0.18600 3 11 21

    5 DelimitedSplit8K 1000 4 1 10 0.26000 9 24 43

    6 DelimitedSplit_TELCO 1000 4 1 10 0.15000 9 24 43

    7 DelimitedSplit8K 1000 8 1 10 0.33000 21 50 73

    8 DelimitedSplit_TELCO 1000 8 1 10 0.13000 21 50 73

    9 DelimitedSplit8K 1000 16 1 10 0.34600 64 102 140

    10 DelimitedSplit_TELCO 1000 16 1 10 0.46600 64 102 140

    11 DelimitedSplit8K 1000 32 1 10 0.33300 158 206 268

    12 DelimitedSplit_TELCO 1000 32 1 10 0.41600 158 206 268

    13 DelimitedSplit8K 1000 64 1 10 0.53600 349 415 511

    14 DelimitedSplit_TELCO 1000 64 1 10 0.48600 349 415 511

    15 DelimitedSplit8K 1000 128 1 10 0.87600 701 830 920

    16 DelimitedSplit_TELCO 1000 128 1 10 0.96600 701 830 920

    17 DelimitedSplit8K 1000 256 1 10 1.59600 1530 1663 1835

    18 DelimitedSplit_TELCO 1000 256 1 10 1.96000 1530 1663 1835

    19 DelimitedSplit8K 1000 512 1 10 3.09300 3075 3323 3541

    20 DelimitedSplit_TELCO 1000 512 1 10 3.32300 3075 3323 3541

    21 DelimitedSplit8K 1000 1150 1 10 7.23000 7141 7466 7790

    22 DelimitedSplit_TELCO 1000 1150 1 10 8.41600 7141 7466 7790

    23 DelimitedSplit8K 1000 1 10 20 0.01000 10 14 20

    24 DelimitedSplit_TELCO 1000 1 10 20 0.01000 10 14 20

    25 DelimitedSplit8K 1000 2 10 20 0.03600 21 30 41

    26 DelimitedSplit_TELCO 1000 2 10 20 0.05300 21 30 41

    27 DelimitedSplit8K 1000 4 10 20 0.10000 47 62 82

    28 DelimitedSplit_TELCO 1000 4 10 20 0.08000 47 62 82

    29 DelimitedSplit8K 1000 8 10 20 0.12000 93 126 155

    30 DelimitedSplit_TELCO 1000 8 10 20 0.12600 93 126 155

    31 DelimitedSplit8K 1000 16 10 20 0.20000 219 254 292

    32 DelimitedSplit_TELCO 1000 16 10 20 0.23600 219 254 292

    33 DelimitedSplit8K 1000 32 10 20 0.36000 458 511 559

    34 DelimitedSplit_TELCO 1000 32 10 20 0.39300 458 511 559

    35 DelimitedSplit8K 1000 64 10 20 0.71000 931 1021 1099

    36 DelimitedSplit_TELCO 1000 64 10 20 0.76000 931 1021 1099

    37 DelimitedSplit8K 1000 128 10 20 1.45300 1943 2047 2158

    38 DelimitedSplit_TELCO 1000 128 10 20 1.48300 1943 2047 2158

    39 DelimitedSplit8K 1000 256 10 20 2.92000 3946 4095 4238

    40 DelimitedSplit_TELCO 1000 256 10 20 2.98300 3946 4095 4238

    41 DelimitedSplit8K 1000 480 10 20 5.37000 7437 7676 7900

    42 DelimitedSplit_TELCO 1000 480 10 20 5.66300 7437 7676 7900

    43 DelimitedSplit8K 1000 1 20 30 0.05000 20 25 30

    44 DelimitedSplit_TELCO 1000 1 20 30 0.01000 20 25 30

    45 DelimitedSplit8K 1000 2 20 30 0.02300 41 51 61

    46 DelimitedSplit_TELCO 1000 2 20 30 0.05600 41 51 61

    47 DelimitedSplit8K 1000 4 20 30 0.07000 84 102 120

    48 DelimitedSplit_TELCO 1000 4 20 30 0.10300 84 102 120

    49 DelimitedSplit8K 1000 8 20 30 0.13600 179 206 232

    50 DelimitedSplit_TELCO 1000 8 20 30 0.16300 179 206 232

    51 DelimitedSplit8K 1000 16 20 30 0.26600 372 414 449

    52 DelimitedSplit_TELCO 1000 16 20 30 0.32600 372 414 449

    53 DelimitedSplit8K 1000 32 20 30 0.52600 768 830 896

    54 DelimitedSplit_TELCO 1000 32 20 30 0.57000 768 830 896

    55 DelimitedSplit8K 1000 64 20 30 1.04300 1593 1663 1733

    56 DelimitedSplit_TELCO 1000 64 20 30 1.09600 1593 1663 1733

    57 DelimitedSplit8K 1000 128 20 30 2.09000 3182 3328 3441

    58 DelimitedSplit_TELCO 1000 128 20 30 2.20000 3182 3328 3441

    59 DelimitedSplit8K 1000 256 20 30 4.24000 6497 6654 6803

    60 DelimitedSplit_TELCO 1000 256 20 30 4.30300 6497 6654 6803

    61 DelimitedSplit8K 1000 290 20 30 4.75300 7331 7538 7696

    62 DelimitedSplit_TELCO 1000 290 20 30 4.88600 7331 7538 7696

    63 DelimitedSplit8K 1000 1 30 40 0.01300 30 34 40

    64 DelimitedSplit_TELCO 1000 1 30 40 0.01000 30 34 40

    65 DelimitedSplit8K 1000 2 30 40 0.02600 61 71 81

    66 DelimitedSplit_TELCO 1000 2 30 40 0.03000 61 71 81

    67 DelimitedSplit8K 1000 4 30 40 0.09300 126 143 161

    68 DelimitedSplit_TELCO 1000 4 30 40 0.12300 126 143 161

    69 DelimitedSplit8K 1000 8 30 40 0.18300 263 287 315

    70 DelimitedSplit_TELCO 1000 8 30 40 0.23300 263 287 315

    71 DelimitedSplit8K 1000 16 30 40 0.35300 529 575 619

    72 DelimitedSplit_TELCO 1000 16 30 40 0.38300 529 575 619

    73 DelimitedSplit8K 1000 32 30 40 0.69300 1098 1150 1212

    74 DelimitedSplit_TELCO 1000 32 30 40 0.73300 1098 1150 1212

    75 DelimitedSplit8K 1000 64 30 40 1.38300 2219 2301 2391

    76 DelimitedSplit_TELCO 1000 64 30 40 1.43300 2219 2301 2391

    77 DelimitedSplit8K 1000 128 30 40 2.87300 4496 4605 4736

    78 DelimitedSplit_TELCO 1000 128 30 40 2.83300 4496 4605 4736

    79 DelimitedSplit8K 1000 210 30 40 4.57600 7412 7561 7710

    80 DelimitedSplit_TELCO 1000 210 30 40 4.67000 7412 7561 7710

    81 DelimitedSplit8K 1000 1 40 50 0.01600 40 44 50

    82 DelimitedSplit_TELCO 1000 1 40 50 0.01000 40 44 50

    83 DelimitedSplit8K 1000 2 40 50 0.04300 81 91 101

    84 DelimitedSplit_TELCO 1000 2 40 50 0.03000 81 91 101

    85 DelimitedSplit8K 1000 4 40 50 0.14000 167 182 201

    86 DelimitedSplit_TELCO 1000 4 40 50 0.14300 167 182 201

    87 DelimitedSplit8K 1000 8 40 50 0.22000 342 366 395

    88 DelimitedSplit_TELCO 1000 8 40 50 0.25300 342 366 395

    89 DelimitedSplit8K 1000 16 40 50 0.43600 693 735 779

    90 DelimitedSplit_TELCO 1000 16 40 50 0.49000 693 735 779

    91 DelimitedSplit8K 1000 32 40 50 0.86000 1399 1470 1530

    92 DelimitedSplit_TELCO 1000 32 40 50 0.91300 1399 1470 1530

    93 DelimitedSplit8K 1000 64 40 50 1.71600 2852 2942 3022

    94 DelimitedSplit_TELCO 1000 64 40 50 1.81300 2852 2942 3022

    95 DelimitedSplit8K 1000 128 40 50 3.56600 5767 5888 6012

    96 DelimitedSplit_TELCO 1000 128 40 50 3.59000 5767 5888 6012

    97 DelimitedSplit8K 1000 165 40 50 4.44600 7465 7590 7717

    98 DelimitedSplit_TELCO 1000 165 40 50 4.55000 7465 7590 7717

     

  • ah well, I have seen that before where you think removing the extra select will improve performance but it doesn't. Anyway I am working on a look ma no ddl version as well. This is brilliant stuff I always knew had to be a better way than select substring(string,charindex(delimiter,string), substring(string(charindex(delimiter,string,(charindex(delimiter,string))) etc. Thanks for checking I do not have a sandbox at work where is ok to freeproccache.

  • SQL Kiwi (12/30/2012)


    Related: more generalized subtree matching and reuse has been trialled in the product, see this Microsoft Research paper (PDF).

    More generalized matching is exactly what I always wanted, and would hate to see a query hint specific for subexpression reuse. CTE's expanding as they do now is logical as on each site, the "inline view" defined in the CTE can be used differently and thus must be optimised seperately as well. A more generalised optimisation as in the paper will exploit the similarities still found afer that process.

    As for the paper, I only quickly scanned it to see what they were up to, and hope their table signatures are based on what is there after expanding contributing code to a query/batch. You want CTE's and non inline views expanded to have a wide as possible base for finding matches. Thus the optimisation should be 100% seperate from the SQL code and no query hint should be wished for!

  • The performance of this is amazing, as is the thought behind the code. These kind of posts really make you think and learn a whole new set of features in SQL

    Thanks Jeff.

    A couple of questions (sorry if it has been addressed - I didn't see it mentioned in the thread):

    I would not expect a record back if the string being sent in is NULL, a single space or all spaces. In my case, I have records where the string being split is often blank and I would want no records coming back from the CROSS APPLY. I could filter them out using in the WHERE clause. I fixed this by adding LTRIM and RTRIM to the source string in the function.

    Also, some of my "not so good" data also had multiple spaces in between the values causing numerous records returned that were blank.

    For example:

    dbo.DelimitedSplit8k('123 467 789', ' ') -- There are two spaces between 467 and 789

    ItemNumber Item

    1 123

    2 467

    3 <-- Empty String

    4 789

    Thanks Again,

    Anton

Viewing 15 posts - 466 through 480 (of 981 total)

You must be logged in to reply to this topic. Login to reply