Tally OH! An Improved SQL 8K “CSV Splitter” Function

  • dwain.c (7/4/2012)


    +1 to Chris for trying!

    I guess now I'll need to check whether I'm using the right version of DelimitedSplit8K!

    Cheers mate. It had to be done, and it was fun playing with this toy. Its performance in my little "real world" test harness is indistinguishable from the most recent DS8K.

    The plans are sufficiently different to hint that "absolute zero" has been reached - TSQL won't do the job any faster.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Jeff,

    I just scanned all 35 pages of this discussion thread seeking the revised string splitter but unfortunately was unable to find it posted by you.

    Can you point me to the page in the discussion where it is or repost it?

    Thanks.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • dwain.c (7/6/2012)


    Jeff,

    I just scanned all 35 pages of this discussion thread seeking the revised string splitter but unfortunately was unable to find it posted by you.

    Can you point me to the page in the discussion where it is or repost it?

    Thanks.

    The alternatives from the discussion have been made part of the test set. You can find the test set in de article under the header resources (all the way to the bottom).

    There are several zip files there, the one you seek is called Build all objects and run all tests.zip. The file in this zip containing all contributions is called Updated Build all objects and run all tests.sql

  • peter-757102 (7/6/2012)


    dwain.c (7/6/2012)


    Jeff,

    I just scanned all 35 pages of this discussion thread seeking the revised string splitter but unfortunately was unable to find it posted by you.

    Can you point me to the page in the discussion where it is or repost it?

    Thanks.

    The alternatives coming from the discussion have been made part of the test set.

    You can find the test set in de article under the header resources (all the way to the bottom).

    There are several zip files there, the one you seek is called Build all objects and run all tests.zip

    The file in this zip containing all contributions is called Updated Build all objects and run all tests.sql

    Cool! Thanks Peter.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • ChrisM@Work (7/5/2012)


    The plans are sufficiently different to hint that "absolute zero" has been reached - TSQL won't do the job any faster.

    I wish I could take the credit for that but it was the "team" effort over many years (see the revision history in the code if you want to know more) culminated with the help of a couple of sharp folks in this very discussion. My hat is off to the community at large.

    And thank you for trying once more. I'm not sure that "absolute zero" has actually been reached and your idea still has strong and logical merit. Someday, someone will look at that idea and find a way to implement it with great speed.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • peter-757102 (7/6/2012)


    dwain.c (7/6/2012)


    Jeff,

    I just scanned all 35 pages of this discussion thread seeking the revised string splitter but unfortunately was unable to find it posted by you.

    Can you point me to the page in the discussion where it is or repost it?

    Thanks.

    The alternatives from the discussion have been made part of the test set. You can find the test set in de article under the header resources (all the way to the bottom).

    There are several zip files there, the one you seek is called Build all objects and run all tests.zip. The file in this zip containing all contributions is called Updated Build all objects and run all tests.sql

    Thank you for the "cover", Peter.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • While this article is a bit old I did want to comment on a trick that I've discovered to build an itvf for a xml splitter. This will perform better than the tally based splitter for most strings.

    The basic trick is to create an intermediate UDF. This will force sql server to calculate the xml result first, instead of the usual bit where if you try to itvf it starts streaming and thus recalculating the result once it has to start chunking. (Last I heard it was every 64 bytes,yikes).

    Here are my functions

    CREATE function xmlify(@str varchar(max),@d varchar(1))

    returns xml

    begin

    return '<r>'+replace(@str,@d,'</r><r>')+'</r>'

    end

    CREATE FUNCTION xmlSplit

    (

    @STR VARCHAR(max),

    @d VARCHAR(1)

    )

    RETURNS table

    as

    return(

    SELECT Item = x.i.value('text()[1]', 'varchar(8000)')

    FROM (select xml=dbo.xmlify(@str,@d)) cte

    cross apply cte.xml.nodes('r') x(i)

    )

    ANd here is some test code I used to build an extremely large string and how long it took to split it up.

    declare @STR nvarchar(max),@start datetime,@res varchar(8000);

    select @STR=stuff((select top (1000*1000) concat(';',newid()) from sys.all_columns a,sys.all_columns b for xml path('')),1,1,'')

    select datalength(@str)

    select @start=getdate()

    select @res=Item from xmlSplit(@str,';')

    select datediff(ms,@start,getdate())

    This created a 70mb string

    Here is the timing from my machine: 11.636 sec

    Not bad at all!

  • mburbea (8/22/2012)


    While this article is a bit old I did want to comment on a trick that I've discovered to build an itvf for a xml splitter. This will perform better than the tally based splitter for most strings.

    The basic trick is to create an intermediate UDF. This will force sql server to calculate the xml result first, instead of the usual bit where if you try to itvf it starts streaming and thus recalculating the result once it has to start chunking. (Last I heard it was every 64 bytes,yikes).

    ...

    Brilliant thinking - and it doesn't half work, around 75-125 x faster than without the udf (when tested on single strings of different sizes).

    But what if you want to force the datatype conversion without the limitations of a udf, and gain ItemID to boot?

    This does just that:

    SELECT

    ItemID = ROW_NUMBER() OVER(ORDER BY (SELECT NULL)),

    Item = x.i.value('text()[1]', 'varchar(8000)')

    FROM (

    SELECT Stringy = CAST('<r>' + REPLACE(@str,';','</r><r>')+'</r>' AS XML)

    UNION ALL SELECT NULL -- eliminated by CROSS APPLY

    ) cte (xmlstring)

    CROSS APPLY cte.xmlstring.nodes('r') x(i)

    I wonder how well this works against a large number of rows, as opposed to a large number of characters in a string? Might find time later.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • ChrisM@Work (8/23/2012)


    mburbea (8/22/2012)


    While this article is a bit old I did want to comment on a trick that I've discovered to build an itvf for a xml splitter. This will perform better than the tally based splitter for most strings.

    The basic trick is to create an intermediate UDF. This will force sql server to calculate the xml result first, instead of the usual bit where if you try to itvf it starts streaming and thus recalculating the result once it has to start chunking. (Last I heard it was every 64 bytes,yikes).

    ...

    Brilliant thinking - and it doesn't half work, around 75-125 x faster than without the udf (when tested on single strings of different sizes).

    But what if you want to force the datatype conversion without the limitations of a udf, and gain ItemID to boot?

    This does just that:

    SELECT

    ItemID = ROW_NUMBER() OVER(ORDER BY (SELECT NULL)),

    Item = x.i.value('text()[1]', 'varchar(8000)')

    FROM (

    SELECT Stringy = CAST('<r>' + REPLACE(@str,';','</r><r>')+'</r>' AS XML)

    UNION ALL SELECT NULL -- eliminated by CROSS APPLY

    ) cte (xmlstring)

    CROSS APPLY cte.xmlstring.nodes('r') x(i)

    I wonder how well this works against a large number of rows, as opposed to a large number of characters in a string? Might find time later.

    Just curious, but why this ?

    UNION ALL SELECT NULL -- eliminated by CROSS APPLY

    ____________________________________________________

    Deja View - The strange feeling that somewhere, sometime you've optimised this query before

    How to get the best help on a forum

    http://www.sqlservercentral.com/articles/Best+Practices/61537
  • Mark-101232 (8/23/2012)


    ChrisM@Work (8/23/2012)


    mburbea (8/22/2012)


    While this article is a bit old I did want to comment on a trick that I've discovered to build an itvf for a xml splitter. This will perform better than the tally based splitter for most strings.

    The basic trick is to create an intermediate UDF. This will force sql server to calculate the xml result first, instead of the usual bit where if you try to itvf it starts streaming and thus recalculating the result once it has to start chunking. (Last I heard it was every 64 bytes,yikes).

    ...

    Brilliant thinking - and it doesn't half work, around 75-125 x faster than without the udf (when tested on single strings of different sizes).

    But what if you want to force the datatype conversion without the limitations of a udf, and gain ItemID to boot?

    This does just that:

    SELECT

    ItemID = ROW_NUMBER() OVER(ORDER BY (SELECT NULL)),

    Item = x.i.value('text()[1]', 'varchar(8000)')

    FROM (

    SELECT Stringy = CAST('<r>' + REPLACE(@str,';','</r><r>')+'</r>' AS XML)

    UNION ALL SELECT NULL -- eliminated by CROSS APPLY

    ) cte (xmlstring)

    CROSS APPLY cte.xmlstring.nodes('r') x(i)

    I wonder how well this works against a large number of rows, as opposed to a large number of characters in a string? Might find time later.

    Just curious, but why this ?

    UNION ALL SELECT NULL -- eliminated by CROSS APPLY

    Try with and without, Mark. A single row is streamed, as mburbea (8/22/2012) points out. Two rows causes complete conversion of the expression to xml.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • ChrisM@Work (8/23/2012)


    Mark-101232 (8/23/2012)


    ChrisM@Work (8/23/2012)


    mburbea (8/22/2012)


    While this article is a bit old I did want to comment on a trick that I've discovered to build an itvf for a xml splitter. This will perform better than the tally based splitter for most strings.

    The basic trick is to create an intermediate UDF. This will force sql server to calculate the xml result first, instead of the usual bit where if you try to itvf it starts streaming and thus recalculating the result once it has to start chunking. (Last I heard it was every 64 bytes,yikes).

    ...

    Brilliant thinking - and it doesn't half work, around 75-125 x faster than without the udf (when tested on single strings of different sizes).

    But what if you want to force the datatype conversion without the limitations of a udf, and gain ItemID to boot?

    This does just that:

    SELECT

    ItemID = ROW_NUMBER() OVER(ORDER BY (SELECT NULL)),

    Item = x.i.value('text()[1]', 'varchar(8000)')

    FROM (

    SELECT Stringy = CAST('<r>' + REPLACE(@str,';','</r><r>')+'</r>' AS XML)

    UNION ALL SELECT NULL -- eliminated by CROSS APPLY

    ) cte (xmlstring)

    CROSS APPLY cte.xmlstring.nodes('r') x(i)

    I wonder how well this works against a large number of rows, as opposed to a large number of characters in a string? Might find time later.

    Just curious, but why this ?

    UNION ALL SELECT NULL -- eliminated by CROSS APPLY

    Try with and without, Mark. A single row is streamed, as mburbea (8/22/2012) points out. Two rows causes complete conversion of the expression to xml.

    Amazing!

    PRINT '========== Multi Row - Not Streamed ==========================================='

    DECLARE @STR VARCHAR(8000)

    DECLARE @ItemID VARCHAR(8000)

    DECLARE @Item VARCHAR(8000)

    SET @STR = 'a;b;c;d;e;f;g;h;i;j;k;l;m;n;o;p;q;r;s;t;u;v;w;x;y;z;'

    SET @STR = REPLICATE(@str,32)

    SET STATISTICS TIME ON

    SELECT

    @ItemID = ROW_NUMBER() OVER(ORDER BY (SELECT NULL)),

    @Item = x.i.value('text()[1]', 'varchar(8000)')

    FROM (

    SELECT Stringy = CAST('<r>' + REPLACE(@str,';','</r><r>')+'</r>' AS XML)

    UNION ALL SELECT NULL -- eliminated by CROSS APPLY

    ) cte (xmlstring)

    CROSS APPLY cte.xmlstring.nodes('r') x(i)

    SET STATISTICS TIME OFF;

    GO

    PRINT '========== Single Row - Streamed ==========================================='

    DECLARE @STR VARCHAR(8000)

    DECLARE @ItemID VARCHAR(8000)

    DECLARE @Item VARCHAR(8000)

    SET @STR = 'a;b;c;d;e;f;g;h;i;j;k;l;m;n;o;p;q;r;s;t;u;v;w;x;y;z;'

    SET @STR = REPLICATE(@str,32)

    SET STATISTICS TIME ON

    SELECT

    @ItemID = ROW_NUMBER() OVER(ORDER BY (SELECT NULL)),

    @Item = x.i.value('text()[1]', 'varchar(8000)')

    FROM (

    SELECT Stringy = CAST('<r>' + REPLACE(@str,';','</r><r>')+'</r>' AS XML)

    --****UNION ALL SELECT NULL -- eliminated by CROSS APPLY

    ) cte (xmlstring)

    CROSS APPLY cte.xmlstring.nodes('r') x(i)

    SET STATISTICS TIME OFF;

    GO

    ____________________________________________________

    Deja View - The strange feeling that somewhere, sometime you've optimised this query before

    How to get the best help on a forum

    http://www.sqlservercentral.com/articles/Best+Practices/61537
  • Great thinking pattern !

    Marvelous results !

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution 😀

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

  • I'm not seeing the XML trouncing the Delimittedsplit8k tvf though.

    My test code I'm using DateTime2 for the greater accuracy

    DECLARE @STR VARCHAR(8000)

    DECLARE @ItemID VARCHAR(8000)

    DECLARE @Item VARCHAR(8000)

    SET @STR = 'a;b;c;d;e;f;g;h;i;j;k;l;m;n;o;p;q;r;s;t;u;v;w;x;y;z;'

    SET @STR = REPLICATE(@str,32)

    DECLARE @start DATETIME2

    DECLARE @end DATETIME2

    DECLARE @start2 datetime2

    DECLARE @end2 datetime2

    SELECT @start=SYSDATETIME()

    SELECT

    ItemID = ROW_NUMBER() OVER(ORDER BY (SELECT NULL)),

    Item = x.i.value('text()[1]', 'varchar(8000)')

    FROM (

    SELECT Stringy = CAST('<r>' + REPLACE(@str,';','</r><r>')+'</r>' AS XML)

    UNION ALL SELECT NULL -- eliminated by CROSS APPLY

    ) cte (xmlstring)

    CROSS APPLY cte.xmlstring.nodes('r') x(i)

    SELECT @end= SYSDATETIME()

    SELECT @start2=SYSDATETIME()

    SELECT

    *

    FROM

    dbo.DelimitedSplit8K(@str,';')

    SELECT @end2= SYSDATETIME()

    SELECT

    'xml',

    DATEDIFF(ms,@start,@end)

    UNION ALL

    SELECT

    'delimitedsplit8k',

    DATEDIFF(ms,@start2,@end2)

    Results after 100 runs (in ms)

    delimitedsplit8k 98

    xml 115

    Looks like a tie to me.

    Ran another 100 samples and it comes in as

    delimitedsplit8k 83

    xml 152

    not as much of a tie... but 152 ms is pretty handy. What I find interesting is that the Delimitedsplit8k seems to be more consistent. Or weirder yet it keeps getting faster 🙂

    Ran yet another 100 samples and it comes in as

    delimitedsplit8k 69

    xml 155

    and again

    delimitedsplit8k 36

    xml 159

    So that's 400 samples I've done...

    delimitedsplit8k 71 ms

    xml 145 ms

    I did this without clearing any part of the plan or data cache after I had run both the delimitedsplit8k udf and the xml method.



    --Mark Tassin
    MCITP - SQL Server DBA
    Proud member of the Anti-RBAR alliance.
    For help with Performance click this link[/url]
    For tips on how to post your problems[/url]

  • Realized I skipped the xml function tricky thing.

    First run 100 times

    runtype durationms

    delimitedsplit8k 89

    xml 99

    xmlfunction 165

    Second run 100 times

    runtype durationms

    delimitedsplit8k 29

    xml 151

    xmlfunction 252

    CREATE TABLE #Results(

    runtype VARCHAR(50),

    durationinms INT)

    TRUNCATE TABLE #results

    DECLARE @count INT=0

    WHILE @count < 100

    BEGIN

    DECLARE @STR VARCHAR(8000)

    DECLARE @ItemID VARCHAR(8000)

    DECLARE @Item VARCHAR(8000)

    SET @STR = 'a;b;c;d;e;f;g;h;i;j;k;l;m;n;o;p;q;r;s;t;u;v;w;x;y;z'

    SET @STR = REPLICATE(@str,32)

    DECLARE @start DATETIME2

    DECLARE @end DATETIME2

    DECLARE @start2 datetime2

    DECLARE @end2 datetime2

    DECLARE @start3 datetime2

    DECLARE @end3 datetime2

    SELECT @start=SYSDATETIME()

    SELECT

    ItemID = ROW_NUMBER() OVER(ORDER BY (SELECT NULL)),

    Item = x.i.value('text()[1]', 'varchar(8000)')

    FROM (

    SELECT Stringy = CAST('<r>' + REPLACE(@str,';','</r><r>')+'</r>' AS XML)

    UNION ALL SELECT NULL -- eliminated by CROSS APPLY

    ) cte (xmlstring)

    CROSS APPLY cte.xmlstring.nodes('r') x(i)

    SELECT @end= SYSDATETIME()

    SELECT @start2=SYSDATETIME()

    SELECT

    *

    FROM

    dbo.DelimitedSplit8K(@str,';')

    SELECT @end2= SYSDATETIME()

    SELECT @start3=SYSDATETIME()

    select * from xmlSplit(@str,';')

    SELECT @end3= SYSDATETIME()

    INSERT #results

    SELECT

    'xml',

    DATEDIFF(ms,@start,@end)

    UNION ALL

    SELECT

    'delimitedsplit8k',

    DATEDIFF(ms,@start2,@end2)

    UNION ALL

    SELECT

    'xmlfunction',

    DATEDIFF(ms,@start3,@end3)

    SET @count+=1

    END

    go

    SELECT

    runtype,

    AVG(durationinms)

    FROM #results

    GROUP BY

    runtype



    --Mark Tassin
    MCITP - SQL Server DBA
    Proud member of the Anti-RBAR alliance.
    For help with Performance click this link[/url]
    For tips on how to post your problems[/url]

  • mtassin (8/23/2012)


    I'm not seeing the XML trouncing the Delimittedsplit8k tvf though.

    ....

    Nor me. I've plugged it into the splitter test harness from the article attachent and on some permutations of the csv file it's dog slow. In fact, the dog appears to have missing limbs ("Take me out for a drag").

    It does have a couple of plus points though - as the OP pointed out, it appears to be pretty good at super-long strings, and if you want to split a string without using a function i.e. inlined with the rest of the query, then it's really compact.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Viewing 15 posts - 346 through 360 (of 981 total)

You must be logged in to reply to this topic. Login to reply