Tally OH! An Improved SQL 8K “CSV Splitter” Function

Question

Post reply

Tally OH! An Improved SQL 8K “CSV Splitter” Function

Viewing 15 posts - 31 through 45 (of 981 total)

You must be logged in to reply to this topic. Login to reply

mtassin SSC-Insane Points: 23099 More actions · Answer 1

Jeff Moden (5/2/2011)

mtassin (5/2/2011)

Today I get to learn something, I figured the issue with the concatenation was that as it was part of a WHERE clause that was re-applied as the tally table progressed through the set, that this was causing the concatentation to be reapplied for each value of N in the tally table.

Ah... let me make a correction here. I was talking about pre-concatenation within a function and you're talking about doing it outside the function. You are correct... pre-concatenation does definitely speed things up for the reasons you stated and allows things like the XML method to run fairly quickly and in a linear fashion. I see no reason why it wouldn't also do so for the code you offered.

To be absolutely honest, I didn't try pre-concatenation outside of the function. It's likely going to be very fast... maybe even faster than this new function because the new function does an ISNULL/NULLIF on every element whereas your fine example does not. But, can you imagine the outcry if I settled for pre-concatenation outside of the function? :Whistling:

What outcry? You did a little bit of work that would also apply to a table of values you had to split and would let you do it fast and set based. All you did at that point would be to make your data conform to the same set of similarly appearing values. 🙂

To me that's not any different than making a developer specify RPC for calling a stored proc and specifying the collection of parameter values. You're preparing the data before you work on it. That's a good thing if performance benefits 🙂

Oh and if you're really dealing with varchar(8000) strings, the it's time for you to write us the varchar(max) tally table split function 😛

--Mark Tassin
MCITP - SQL Server DBA
Proud member of the Anti-RBAR alliance.
For help with Performance click this link[/url]
For tips on how to post your problems[/url]

Robert Livermore SSC-Addicted Points: 403 More actions · Answer 2

Irony.codeplex.com LRLA(1) parser allows the definition for any BNF based grammar. Provided the grammar specifies the literals for double quote and CRLF characters then parser should handle the double quotes including the tricky cases like CRLF inside the double quotes. I pointed you to Comma Separated Value Spec http://www.ietf.org/rfc/rfc4180.txt because the IETF defines CSV in BNF in there specification. It will provide a good tempate to start from to configure the grammar.

WayneS SSC Guru Points: 95405 More actions · Answer 3

Jeff Moden (5/1/2011)

WayneS (5/1/2011)

Jeff, I had to rate this article 5 stars, because that's as high as the ratings allow. It deserves a 15!

BTW, I'm aghast --- you've actually used a SQL CLR now! 😀 It would be nice if we could get Paul to contribute it, and to have it added to the resources of this article.

Again, excellent job. Extremely well done, and extremely thorough. Thank you so much for this article, and for the new DelimitedSplit8K function.

I couldn't have done it without the help of you, Ron, and Paul. All 3 of you did some awesome work with testing and verification of the code. You guys ROCK!

So far as the CLR goes... yeah, I know I stuck my toe into the other world, but I had to know if the new function even came close. Paul was very cool in his understanding that I don't even know how to spell "C" and he made it real easy for me to install. 🙂

Jeff, I have to say that it was a pleasure to work with you on this article.

BTW, since the new code was not compared to a physical, permanent tally table, I thought that I would mention that while testing this code out for you, I decided to see how the new splitter with a permanent tally table (as compared to the virtual cte tally table in the article) would compare with all of the others - it's worse than the ctetally splitter, but better than all of the others.

Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes

If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!

Links:
For better assistance in answering your questions
Performance Problems
Common date/time routines
Understanding and Using APPLY Part 1 & Part 2

sdorris-1150234 SSC-Addicted Points: 457 More actions · Answer 4

sdorris-1150234

SSC-Addicted

Points: 457

May 2, 2011 at 9:48 am

#1318498

Like I said before, when Jeff speaks - I listen.

Excellent article, Jeff! If this article was a "tsql how-to" in Barnes & Noble, I would buy it.

mtassin SSC-Insane Points: 23099 More actions · Answer 5

WayneS (5/2/2011)

BTW, since the new code was not compared to a physical, permanent tally table, I thought that I would mention that while testing this code out for you, I decided to see how the new splitter with a permanent tally table (as compared to the virtual cte tally table in the article) would compare with all of the others - it's worse than the ctetally splitter, but better than all of the others.

See, here's something I still don't get... I'm sure somebody can point me to another article that explains it, but how does a CTE tally outperform a table based tally? Wouldn't all the calculations and the lack of a tight clustered index mean that the table based tally should outperform? The memory footprint of the cte tally vs the table tally is about the same, and once the table tally is in cache, that's not an issue, so I don't get it.

--Mark Tassin
MCITP - SQL Server DBA
Proud member of the Anti-RBAR alliance.
For help with Performance click this link[/url]
For tips on how to post your problems[/url]

WayneS SSC Guru Points: 95405 More actions · Answer 6

mtassin (5/2/2011)

WayneS (5/2/2011)

BTW, since the new code was not compared to a physical, permanent tally table, I thought that I would mention that while testing this code out for you, I decided to see how the new splitter with a permanent tally table (as compared to the virtual cte tally table in the article) would compare with all of the others - it's worse than the ctetally splitter, but better than all of the others.

See, here's something I still don't get... I'm sure somebody can point me to another article that explains it, but how does a CTE tally outperform a table based tally? Wouldn't all the calculations and the lack of a tight clustered index mean that the table based tally should outperform? The memory footprint of the cte tally vs the table tally is about the same, and once the table tally is in cache, that's not an issue, so I don't get it.

Mark, the only possible explanation that I can give is the elimination of the disk IO. When I ran this test, I fully expected the physical tally table to eek out the ctetally. I was pretty surprised by the results, and ran it multiple times on several VMs to test it.

Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes

If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!

Links:
For better assistance in answering your questions
Performance Problems
Common date/time routines
Understanding and Using APPLY Part 1 & Part 2

mtassin SSC-Insane Points: 23099 More actions · Answer 7

WayneS (5/2/2011)

Mark, the only possible explanation that I can give is the elimination of the disk IO. When I ran this test, I fully expected the physical tally table to eek out the ctetally. I was pretty surprised by the results, and ran it multiple times on several VMs to test it.

Curious, were you doing a typical test process and clearing cache for a fairer test for each run?

If so, is there a change in performance when cache isn't cleared? I realize that for perfect benchmarking we need to clear the cache, etc. But in the real world, I expect that somewhere in my cache is an in-memory table with 10,000 rows numbered 1 to 10,000 in it. 🙂 Too much of my code is using a table named dbo.Tally for it to not pretty much stay in cache.

Just trying to figure more about what's going on, than anything more, if I get some free time for my lunch break I'll have to peek more into it.

--Mark Tassin
MCITP - SQL Server DBA
Proud member of the Anti-RBAR alliance.
For help with Performance click this link[/url]
For tips on how to post your problems[/url]

JJ B SSCertifiable Points: 5078 More actions · Answer 8

Thanks for the article.

While not the focus of the article, I wanted to thank you for the Nibbler and Inch Worm sections. I don't know if you came up with those terms, but they are perfect imagery. I didn't even have to read the sections to know exactly what you were talking about. Very nice.

Congrats on solving the problem.

mark hutchinson SSC Eights! Points: 878 More actions · Answer 9

Jeff,

Thanks for writing both articles. I still don't understand why you use the CharIndex() when you could use a cursor to iterate through the delimiter positions. I think your repeated use of CharIndex() was the unnecessary perf-killer. Please enlighten me.

I was tempted to suggest joining the delimiter-position table to itself to produce pairs of delimiter positions, but it seems like a less efficient method.

====

Addendum:

Upon some reflection (after posting the above), I would think that the minimization of all function calls would optimize the process. To that end, I think the fastest process would be to use CharIndex() in a loop to determine the delimited positions, inserting those into a temp table. (maybe adding the zero value along the way). Then use the cursor to iterate the delimiter positions, invoking the SubString() functions.

====

Mark Hutchinson

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 10

TheSQLGuru

SSC Guru

Points: 134017

May 2, 2011 at 10:11 am

#1318520

Great stuff. I too would like to see the CLR code.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

spaghettidba SSC Guru Points: 105712 More actions · Answer 11

spaghettidba

SSC Guru

Points: 105712

May 2, 2011 at 10:37 am

#1318543

TheSQLGuru (5/2/2011)

Great stuff. I too would like to see the CLR code.

+1

The queue starts to overflow...

-- Gianluca Sartori

virtualjosh SSC Journeyman Points: 75 More actions · Answer 12

virtualjosh

SSC Journeyman

Points: 75

May 2, 2011 at 10:43 am

#1318549

Good article. I come from an application environment where I had more flexibility to use outside coding. I'm happy to see we can do interesting things while still in the T-SQL domain.

WayneS SSC Guru Points: 95405 More actions · Answer 13

mtassin (5/2/2011)

WayneS (5/2/2011)

Mark, the only possible explanation that I can give is the elimination of the disk IO. When I ran this test, I fully expected the physical tally table to eek out the ctetally. I was pretty surprised by the results, and ran it multiple times on several VMs to test it.

Curious, were you doing a typical test process and clearing cache for a fairer test for each run?

If so, is there a change in performance when cache isn't cleared? I realize that for perfect benchmarking we need to clear the cache, etc. But in the real world, I expect that somewhere in my cache is an in-memory table with 10,000 rows numbered 1 to 10,000 in it. 🙂 Too much of my code is using a table named dbo.Tally for it to not pretty much stay in cache.

Just trying to figure more about what's going on, than anything more, if I get some free time for my lunch break I'll have to peek more into it.

Mark,

Actually, I didn't clear the cache between runs. My thinking was pretty much along the same lines as yours. Plus, with the test (Jeff included it in the references section of the article), it would pretty much be in cache after the first row was processed anyway.

Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes

If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!

Links:
For better assistance in answering your questions
Performance Problems
Common date/time routines
Understanding and Using APPLY Part 1 & Part 2

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 14

Sorry folks. I had to take a break and actually work for a living. 😛 I'll be back tonight.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

tfifield SSCrazy Eights Points: 9655 More actions · Answer 15

Jeff,

Brilliant piece of work! I did something a bit similar but cheated by always appending the last delimiter to the string in the function call. I always felt that it was a hack but I was under some pressure to get the project done. Yours is a much cleaner and way more versatile approach.

The article is so well written that I was coming up with the ISNULL/NULLIF right along with you - a testament to the logical progression of the thing.

The switch to always getting N+1 and starting with zero was also brilliant.

By the way, I always referred to the 'Nibbler' approach as a 'Shuffler' approach. I think I like 'Nibbler' a bit better.

Todd Fifield