Tally OH! An Improved SQL 8K “CSV Splitter” Function

Question

Post reply

Tally OH! An Improved SQL 8K “CSV Splitter” Function

Viewing 15 posts - 346 through 360 (of 981 total)

You must be logged in to reply to this topic. Login to reply

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 1

dwain.c (7/4/2012)

+1 to Chris for trying!

I guess now I'll need to check whether I'm using the right version of DelimitedSplit8K!

Cheers mate. It had to be done, and it was fun playing with this toy. Its performance in my little "real world" test harness is indistinguishable from the most recent DS8K.

The plans are sufficiently different to hint that "absolute zero" has been reached - TSQL won't do the job any faster.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Dwain Camps SSC Guru Points: 86913 More actions · Answer 2

Jeff,

I just scanned all 35 pages of this discussion thread seeking the revised string splitter but unfortunately was unable to find it posted by you.

Can you point me to the page in the discussion where it is or repost it?

Thanks.

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

peter-757102 SSCertifiable Points: 6877 More actions · Answer 3

dwain.c (7/6/2012)

Jeff,

I just scanned all 35 pages of this discussion thread seeking the revised string splitter but unfortunately was unable to find it posted by you.

Can you point me to the page in the discussion where it is or repost it?

Thanks.

The alternatives from the discussion have been made part of the test set. You can find the test set in de article under the header resources (all the way to the bottom).

There are several zip files there, the one you seek is called Build all objects and run all tests.zip. The file in this zip containing all contributions is called Updated Build all objects and run all tests.sql

Dwain Camps SSC Guru Points: 86913 More actions · Answer 4

peter-757102 (7/6/2012)

dwain.c (7/6/2012)

Jeff,

I just scanned all 35 pages of this discussion thread seeking the revised string splitter but unfortunately was unable to find it posted by you.

Can you point me to the page in the discussion where it is or repost it?

Thanks.

The alternatives coming from the discussion have been made part of the test set.

You can find the test set in de article under the header resources (all the way to the bottom).

There are several zip files there, the one you seek is called Build all objects and run all tests.zip

The file in this zip containing all contributions is called Updated Build all objects and run all tests.sql

Cool! Thanks Peter.

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 5

ChrisM@Work (7/5/2012)

The plans are sufficiently different to hint that "absolute zero" has been reached - TSQL won't do the job any faster.

I wish I could take the credit for that but it was the "team" effort over many years (see the revision history in the code if you want to know more) culminated with the help of a couple of sharp folks in this very discussion. My hat is off to the community at large.

And thank you for trying once more. I'm not sure that "absolute zero" has actually been reached and your idea still has strong and logical merit. Someday, someone will look at that idea and find a way to implement it with great speed.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 6

peter-757102 (7/6/2012)

dwain.c (7/6/2012)

Jeff,

I just scanned all 35 pages of this discussion thread seeking the revised string splitter but unfortunately was unable to find it posted by you.

Can you point me to the page in the discussion where it is or repost it?

Thanks.

The alternatives from the discussion have been made part of the test set. You can find the test set in de article under the header resources (all the way to the bottom).

There are several zip files there, the one you seek is called Build all objects and run all tests.zip. The file in this zip containing all contributions is called Updated Build all objects and run all tests.sql

Thank you for the "cover", Peter.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

mburbea Right there with Babe Points: 781 More actions · Answer 7

While this article is a bit old I did want to comment on a trick that I've discovered to build an itvf for a xml splitter. This will perform better than the tally based splitter for most strings.

The basic trick is to create an intermediate UDF. This will force sql server to calculate the xml result first, instead of the usual bit where if you try to itvf it starts streaming and thus recalculating the result once it has to start chunking. (Last I heard it was every 64 bytes,yikes).

Here are my functions



CREATE function xmlify(@str varchar(max),@d varchar(1))

returns xml

begin

	return '<r>'+replace(@str,@d,'</r><r>')+'</r>'

end

CREATE FUNCTION xmlSplit

(

      @STR VARCHAR(max),

      @d VARCHAR(1)

)

RETURNS table

as

return(

SELECT  Item  = x.i.value('text()[1]', 'varchar(8000)')

           FROM (select xml=dbo.xmlify(@str,@d)) cte

           cross apply cte.xml.nodes('r') x(i)

)

ANd here is some test code I used to build an extremely large string and how long it took to split it up.



declare @STR nvarchar(max),@start datetime,@res varchar(8000);

select @STR=stuff((select top (1000*1000) concat(';',newid()) from sys.all_columns a,sys.all_columns b for xml path('')),1,1,'')

select datalength(@str)

select @start=getdate()

select @res=Item from xmlSplit(@str,';')

select datediff(ms,@start,getdate())

This created a 70mb string

Here is the timing from my machine: 11.636 sec

Not bad at all!

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 8

mburbea (8/22/2012)

While this article is a bit old I did want to comment on a trick that I've discovered to build an itvf for a xml splitter. This will perform better than the tally based splitter for most strings.

The basic trick is to create an intermediate UDF. This will force sql server to calculate the xml result first, instead of the usual bit where if you try to itvf it starts streaming and thus recalculating the result once it has to start chunking. (Last I heard it was every 64 bytes,yikes).

...

Brilliant thinking - and it doesn't half work, around 75-125 x faster than without the udf (when tested on single strings of different sizes).

But what if you want to force the datatype conversion without the limitations of a udf, and gain ItemID to boot?

This does just that:

SELECT



	ItemID = ROW_NUMBER() OVER(ORDER BY (SELECT NULL)),

	Item = x.i.value('text()[1]', 'varchar(8000)')

FROM (

	SELECT Stringy = CAST('<r>' + REPLACE(@str,';','</r><r>')+'</r>' AS XML)

	UNION ALL SELECT NULL -- eliminated by CROSS APPLY

) cte (xmlstring)

CROSS APPLY cte.xmlstring.nodes('r') x(i)

I wonder how well this works against a large number of rows, as opposed to a large number of characters in a string? Might find time later.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Mark Cowne One Orange Chip Points: 26865 More actions · Answer 9

ChrisM@Work (8/23/2012)

mburbea (8/22/2012)

While this article is a bit old I did want to comment on a trick that I've discovered to build an itvf for a xml splitter. This will perform better than the tally based splitter for most strings.

The basic trick is to create an intermediate UDF. This will force sql server to calculate the xml result first, instead of the usual bit where if you try to itvf it starts streaming and thus recalculating the result once it has to start chunking. (Last I heard it was every 64 bytes,yikes).

...

Brilliant thinking - and it doesn't half work, around 75-125 x faster than without the udf (when tested on single strings of different sizes).

But what if you want to force the datatype conversion without the limitations of a udf, and gain ItemID to boot?

This does just that:

SELECT
ItemID = ROW_NUMBER() OVER(ORDER BY (SELECT NULL)), Item = x.i.value('text()[1]', 'varchar(8000)') FROM ( SELECT Stringy = CAST('<r>' + REPLACE(@str,';','</r><r>')+'</r>' AS XML) UNION ALL SELECT NULL -- eliminated by CROSS APPLY ) cte (xmlstring)
CROSS APPLY cte.xmlstring.nodes('r') x(i)

I wonder how well this works against a large number of rows, as opposed to a large number of characters in a string? Might find time later.

Just curious, but why this ?

UNION ALL SELECT NULL -- eliminated by CROSS APPLY

____________________________________________________

Deja View - The strange feeling that somewhere, sometime you've optimised this query before

How to get the best help on a forum

http://www.sqlservercentral.com/articles/Best+Practices/61537

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 10

Mark-101232 (8/23/2012)

ChrisM@Work (8/23/2012)

mburbea (8/22/2012)

While this article is a bit old I did want to comment on a trick that I've discovered to build an itvf for a xml splitter. This will perform better than the tally based splitter for most strings.

The basic trick is to create an intermediate UDF. This will force sql server to calculate the xml result first, instead of the usual bit where if you try to itvf it starts streaming and thus recalculating the result once it has to start chunking. (Last I heard it was every 64 bytes,yikes).

...

Brilliant thinking - and it doesn't half work, around 75-125 x faster than without the udf (when tested on single strings of different sizes).

But what if you want to force the datatype conversion without the limitations of a udf, and gain ItemID to boot?

This does just that:

SELECT
ItemID = ROW_NUMBER() OVER(ORDER BY (SELECT NULL)), Item = x.i.value('text()[1]', 'varchar(8000)') FROM ( SELECT Stringy = CAST('<r>' + REPLACE(@str,';','</r><r>')+'</r>' AS XML) UNION ALL SELECT NULL -- eliminated by CROSS APPLY ) cte (xmlstring)
CROSS APPLY cte.xmlstring.nodes('r') x(i)

I wonder how well this works against a large number of rows, as opposed to a large number of characters in a string? Might find time later.

Just curious, but why this ?

UNION ALL SELECT NULL -- eliminated by CROSS APPLY

Try with and without, Mark. A single row is streamed, as mburbea (8/22/2012) points out. Two rows causes complete conversion of the expression to xml.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Mark Cowne One Orange Chip Points: 26865 More actions · Answer 11

ChrisM@Work (8/23/2012)

Mark-101232 (8/23/2012)

ChrisM@Work (8/23/2012)

mburbea (8/22/2012)

While this article is a bit old I did want to comment on a trick that I've discovered to build an itvf for a xml splitter. This will perform better than the tally based splitter for most strings.

The basic trick is to create an intermediate UDF. This will force sql server to calculate the xml result first, instead of the usual bit where if you try to itvf it starts streaming and thus recalculating the result once it has to start chunking. (Last I heard it was every 64 bytes,yikes).

...

Brilliant thinking - and it doesn't half work, around 75-125 x faster than without the udf (when tested on single strings of different sizes).

But what if you want to force the datatype conversion without the limitations of a udf, and gain ItemID to boot?

This does just that:

SELECT
ItemID = ROW_NUMBER() OVER(ORDER BY (SELECT NULL)), Item = x.i.value('text()[1]', 'varchar(8000)') FROM ( SELECT Stringy = CAST('<r>' + REPLACE(@str,';','</r><r>')+'</r>' AS XML) UNION ALL SELECT NULL -- eliminated by CROSS APPLY ) cte (xmlstring)
CROSS APPLY cte.xmlstring.nodes('r') x(i)

I wonder how well this works against a large number of rows, as opposed to a large number of characters in a string? Might find time later.

Just curious, but why this ?

UNION ALL SELECT NULL -- eliminated by CROSS APPLY

Try with and without, Mark. A single row is streamed, as mburbea (8/22/2012) points out. Two rows causes complete conversion of the expression to xml.

Amazing!

PRINT '========== Multi Row - Not Streamed ==========================================='



DECLARE @STR VARCHAR(8000)

DECLARE @ItemID VARCHAR(8000)

DECLARE @Item VARCHAR(8000)

SET @STR = 'a;b;c;d;e;f;g;h;i;j;k;l;m;n;o;p;q;r;s;t;u;v;w;x;y;z;'

SET @STR = REPLICATE(@str,32)

SET STATISTICS TIME ON

SELECT

	@ItemID = ROW_NUMBER() OVER(ORDER BY (SELECT NULL)),

	@Item = x.i.value('text()[1]', 'varchar(8000)')

FROM (

	SELECT Stringy = CAST('<r>' + REPLACE(@str,';','</r><r>')+'</r>' AS XML)

	UNION ALL SELECT NULL -- eliminated by CROSS APPLY

) cte (xmlstring)

CROSS APPLY cte.xmlstring.nodes('r') x(i)

SET STATISTICS TIME OFF;

GO

PRINT '========== Single Row - Streamed ==========================================='

DECLARE @STR VARCHAR(8000)

DECLARE @ItemID VARCHAR(8000)

DECLARE @Item VARCHAR(8000)

SET @STR = 'a;b;c;d;e;f;g;h;i;j;k;l;m;n;o;p;q;r;s;t;u;v;w;x;y;z;'

SET @STR = REPLICATE(@str,32)

SET STATISTICS TIME ON

SELECT

	@ItemID = ROW_NUMBER() OVER(ORDER BY (SELECT NULL)),

	@Item = x.i.value('text()[1]', 'varchar(8000)')

FROM (

	SELECT Stringy = CAST('<r>' + REPLACE(@str,';','</r><r>')+'</r>' AS XML)

	--****UNION ALL SELECT NULL -- eliminated by CROSS APPLY

) cte (xmlstring)

CROSS APPLY cte.xmlstring.nodes('r') x(i)

SET STATISTICS TIME OFF;

GO

____________________________________________________

Deja View - The strange feeling that somewhere, sometime you've optimised this query before

How to get the best help on a forum

http://www.sqlservercentral.com/articles/Best+Practices/61537

Johan Bijnens SSC Guru Points: 134873 More actions · Answer 12

Great thinking pattern !

Marvelous results !

Johan

Learn to play, play to learn !

Dont drive faster than your guardian angel can fly ...
but keeping both feet on the ground wont get you anywhere :w00t:

- How to post Performance Problems
- How to post data/code to get the best help[/url]

- How to prevent a sore throat after hours of presenting ppt

press F1 for solution, press shift+F1 for urgent solution 😀

Need a bit of Powershell? How about this

Who am I ? Sometimes this is me but most of the time this is me

mtassin SSC-Insane Points: 23099 More actions · Answer 13

I'm not seeing the XML trouncing the Delimittedsplit8k tvf though.

My test code I'm using DateTime2 for the greater accuracy



DECLARE @STR VARCHAR(8000)

DECLARE @ItemID VARCHAR(8000)

DECLARE @Item VARCHAR(8000)

SET @STR = 'a;b;c;d;e;f;g;h;i;j;k;l;m;n;o;p;q;r;s;t;u;v;w;x;y;z;'

SET @STR = REPLICATE(@str,32)

DECLARE @start DATETIME2

DECLARE @end DATETIME2

DECLARE @start2 datetime2

DECLARE @end2 datetime2

SELECT @start=SYSDATETIME()

SELECT

	ItemID = ROW_NUMBER() OVER(ORDER BY (SELECT NULL)),

	Item = x.i.value('text()[1]', 'varchar(8000)')

FROM (

	SELECT Stringy = CAST('<r>' + REPLACE(@str,';','</r><r>')+'</r>' AS XML)

	UNION ALL SELECT NULL -- eliminated by CROSS APPLY

) cte (xmlstring)

CROSS APPLY cte.xmlstring.nodes('r') x(i)

SELECT @end= SYSDATETIME()

SELECT @start2=SYSDATETIME()

SELECT

*

FROM

dbo.DelimitedSplit8K(@str,';')

SELECT @end2= SYSDATETIME()

SELECT

'xml',

DATEDIFF(ms,@start,@end)

UNION ALL

SELECT

'delimitedsplit8k',

DATEDIFF(ms,@start2,@end2)

Results after 100 runs (in ms)

delimitedsplit8k 98

xml 115

Looks like a tie to me.

Ran another 100 samples and it comes in as

delimitedsplit8k 83

xml 152

not as much of a tie... but 152 ms is pretty handy. What I find interesting is that the Delimitedsplit8k seems to be more consistent. Or weirder yet it keeps getting faster 🙂

Ran yet another 100 samples and it comes in as

delimitedsplit8k 69

xml 155

and again

delimitedsplit8k 36

xml 159

So that's 400 samples I've done...

delimitedsplit8k 71 ms

xml 145 ms

I did this without clearing any part of the plan or data cache after I had run both the delimitedsplit8k udf and the xml method.

--Mark Tassin
MCITP - SQL Server DBA
Proud member of the Anti-RBAR alliance.
For help with Performance click this link[/url]
For tips on how to post your problems[/url]

mtassin SSC-Insane Points: 23099 More actions · Answer 14

Realized I skipped the xml function tricky thing.

First run 100 times

runtype durationms

delimitedsplit8k 89

xml 99

xmlfunction 165

Second run 100 times

runtype durationms

delimitedsplit8k 29

xml 151

xmlfunction 252



CREATE TABLE #Results(

runtype VARCHAR(50),

durationinms INT)

TRUNCATE TABLE #results

DECLARE @count INT=0

WHILE @count < 100

BEGIN

DECLARE @STR VARCHAR(8000)

DECLARE @ItemID VARCHAR(8000)

DECLARE @Item VARCHAR(8000)

SET @STR = 'a;b;c;d;e;f;g;h;i;j;k;l;m;n;o;p;q;r;s;t;u;v;w;x;y;z'

SET @STR = REPLICATE(@str,32)

DECLARE @start DATETIME2

DECLARE @end DATETIME2

DECLARE @start2 datetime2

DECLARE @end2 datetime2

DECLARE @start3 datetime2

DECLARE @end3 datetime2

SELECT @start=SYSDATETIME()

SELECT

	ItemID = ROW_NUMBER() OVER(ORDER BY (SELECT NULL)),

	Item = x.i.value('text()[1]', 'varchar(8000)')

FROM (

	SELECT Stringy = CAST('<r>' + REPLACE(@str,';','</r><r>')+'</r>' AS XML)

	UNION ALL SELECT NULL -- eliminated by CROSS APPLY

) cte (xmlstring)

CROSS APPLY cte.xmlstring.nodes('r') x(i)

SELECT @end= SYSDATETIME()

SELECT @start2=SYSDATETIME()

SELECT

*

FROM

dbo.DelimitedSplit8K(@str,';')

SELECT @end2= SYSDATETIME()

SELECT @start3=SYSDATETIME()

select * from xmlSplit(@str,';')

SELECT @end3= SYSDATETIME()

INSERT #results

SELECT

'xml',

DATEDIFF(ms,@start,@end)

UNION ALL

SELECT

'delimitedsplit8k',

DATEDIFF(ms,@start2,@end2)

UNION ALL

SELECT

'xmlfunction',

DATEDIFF(ms,@start3,@end3)

SET @count+=1

END

go

SELECT

runtype,

AVG(durationinms)

FROM #results

GROUP BY

runtype

--Mark Tassin
MCITP - SQL Server DBA
Proud member of the Anti-RBAR alliance.
For help with Performance click this link[/url]
For tips on how to post your problems[/url]

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 15

mtassin (8/23/2012)

I'm not seeing the XML trouncing the Delimittedsplit8k tvf though.

....

Nor me. I've plugged it into the splitter test harness from the article attachent and on some permutations of the csv file it's dog slow. In fact, the dog appears to have missing limbs ("Take me out for a drag").

It does have a couple of plus points though - as the OP pointed out, it appears to be pretty good at super-long strings, and if you want to split a string without using a function i.e. inlined with the rest of the query, then it's really compact.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden