Tally OH! An Improved SQL 8K “CSV Splitter” Function

Question

Post reply

Tally OH! An Improved SQL 8K “CSV Splitter” Function

Viewing 15 posts - 466 through 480 (of 981 total)

You must be logged in to reply to this topic. Login to reply

John Hardin Grasshopper Points: 20 More actions · Answer 1

Lynn Pettis (12/30/2012)

John Hardin (12/30/2012)

L' Eomot Inversé (12/30/2012)

Jeff Moden (12/28/2012)

If you look at the execution plan, it doesn't actually use the "same list". CTEs have some wonderful advantages but they also have a hidden fault. Instead of using the same internal result set from a CTE when things like self joins are used, the entire CTE is re-executed.

Now, if only one could force the optimiser to spool the CTE and reuse it, CTEs would be far more useful.

Don't become so blinded by CTEs that you forget about temporary tables or table-type variables; another problem with CTEs is they are not indexed.

On page 27 I posted a suggested version of this that avoided CHARINDEX() and used an indexed table-type variable and it showed performance gains over the pure-CTE version, but I don't think anyone has seen fit to comment on that suggested version... 🙁

Page 27 passed on how many posts per page? I ask as I display 50 posts per page. It would help if you posted the url of your post so we could go directly to that post.

D'oh! I didn't consider non-default posts-per-page settings. Sorry. Here's the direct link: http://qa.sqlservercentral.com/Forums/FindPost1129283.aspx

The second code block is what I am referring to.

Paul White SSC Guru Points: 150457 More actions · Answer 2

L' Eomot Inversé (12/30/2012)

Why can't the optimiser look and see if spooling would enhance performance (only in the cases where it make no semantic difference, of course) and use it when it does? Or if that's too difficult a task for the optimiser (it shouldn't be, but they haven't done it so I can imagine them claiming it is) why can't we have a query hint that tells them to do it? I hope Paul or someone will jump in and explain all this - maybe tell me I've got it all wrong, but that's OK too, I like learning.

You're right that the optimizer doesn't consider CTE materialization today (they are strictly in-line view definitions, expanded once per reference) and there is no hint for materialization either.

There is a 2006 active Connect suggestion by Adam Machanic for a hint. Though some duplicates of that item have been closed as Won't Fix (= perhaps for the time being) my overall impression from public Microsoft comments is that something along these lines is being actively considered for a future release. We will have to wait and see.

Related: more generalized subtree matching and reuse has been trialled in the product, see this Microsoft Research paper (PDF).

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 3

Tony.l (12/29/2012)

Hi

unbelievable - so good. I have been doing these puter things for 30 years now and i am still amazed how little i know or probably at just how clever other people are. Really well explained and totally useful.

keep up the good work.

APL. My initials not the language.

Thanks for the great feedback, Tony. I feel the same way. I've been working with computers in one form or another since 1968 and I'm still learning something new about them everyday.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 4

John Hardin (12/30/2012)

On page 27 I posted a suggested version of this that avoided CHARINDEX() and used an indexed table-type variable and it showed performance gains over the pure-CTE version, but I don't think anyone has seen fit to comment on that suggested version... 🙁

You'll have to forgive me a bit on that. If you look at even the most recent pages of this thread, there are still people saying "Try this" or "How about this"? Even you said...

How about this to get rid of the CHARINDEX() string operation:

{snip}

...how does that affect performance?

Two days later (lots of water had flowed under the bridge by then and we're not notified by edits), you added...

edited to add: Actually performance goes down a bit because the subquery in making cteStartEnd is not indexed. This version does exhibit performance gains in my very modest testing:

...and followed that up with...

(Note: I'm doing my version as a SP so the above exact code hasn't actually been tested by me...)

I just didn't (and still don't) have the time to test everyone's suggestion. That's why I spent so much time on building a test harness for everyone to use.

If you think your method has merit in the area of performance, please test it and post the test results. Who knows? You might be on to something.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

TomThomson SSC Guru Points: 104773 More actions · Answer 5

SQL Kiwi (12/30/2012)

There is a 2006 active Connect suggestion by Adam Machanic for a hint. Though some duplicates of that item have been closed as Won't Fix (= perhaps for the time being) my overall impression from public Microsoft comments is that something along these lines is being actively considered for a future release. We will have to wait and see.

Well, I've added my vote to that. It will probably make no difference, but that isn't an excuse for not doing so whem I think it's right.

Related: more generalized subtree matching and reuse has been trialled in the product, see this Microsoft Research paper (PDF).

For some reason the link doesn't work. But looking in the obvious place indicated by your link (http://research.microsoft.com/en-us/um/people/jrzhou/publications.htm) finds it and the link from there, although it appears to be identical to your link, works fine. I have no idea what's happening there.

Edit: yes, I see what was happening: your link is relative, not absolute, so it tries to connect to qa.sqlservercentral.com/Forums/research.microsoft.com/en-us/um/people/jrzhou/pub/cse.pdf instead of to the real target. I suspect I've done that sometimes too - probably every time I've typed the link instead of copyying and pasting from the address field of my browser.

Tom

Paul White SSC Guru Points: 150457 More actions · Answer 6

L' Eomot Inversé (12/31/2012)

For some reason the link doesn't work. But looking in the obvious place indicated by your link (http://research.microsoft.com/en-us/um/people/jrzhou/publications.htm) finds it and the link from there, although it appears to be identical to your link, works fine. I have no idea what's happening there.

Edit: yes, I see what was happening: your link is relative, not absolute, so it tries to connect to qa.sqlservercentral.com/Forums/research.microsoft.com/en-us/um/people/jrzhou/pub/cse.pdf instead of to the real target. I suspect I've done that sometimes too - probably every time I've typed the link instead of copyying and pasting from the address field of my browser.

Thanks Tom, I have corrected the link in the original post.

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

Dwain Camps SSC Guru Points: 86913 More actions · Answer 7

Nice update to this article Jeff!

I especially liked your mysteriously labelled black line (????). Nice trick with the 0 based Tally table too - will need to remember that one.

I have already replaced the old DelimitedSplit8K in my sandbox and now I must seek out other places where it may have been used and do the same.

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

MarbryHardin Old Hand Points: 333 More actions · Answer 8

MarbryHardin

Old Hand

Points: 333

January 2, 2013 at 8:02 am

#1572895

Interesting stuff, thank you.

Table type parameters for stored procedures have eliminated much of my need for doing this within SQL, but it's always nice to have options.

klini SSC-Addicted Points: 458 More actions · Answer 9

klini

SSC-Addicted

Points: 458

January 3, 2013 at 7:26 am

#1573319

Jeff or Paul:

I didn't see the source code for the CLR function. I also didn't read all 473 posts before mine, so maybe there's a link to it somewhere..

Paul, would you mind sharing that with us? I would love to see its guts.

Kurt

Steven Willis SSCrazy Eights Points: 9893 More actions · Answer 10

Steven Willis

SSCrazy Eights

Points: 9893

January 3, 2013 at 9:42 am

#1573395

klini (1/3/2013)

I didn't see the source code for the CLR function. I also didn't read all 473 posts before mine, so maybe there's a link to it somewhere..

It's in one of the zip files at the bottom of the original article: Tally OH! An Improved SQL 8K “CSV Splitter” Function[/url]

telcogod SSC Enthusiast Points: 143 More actions · Answer 11

hi sorry i haven't tested for performance or sure if this has been done but removed the ctelen

IF OBJECT_ID('dbo.DelimitedSplit8K') IS NOT NULL

DROP FUNCTION [dbo].[DelimitedSplit8K]

go

CREATE FUNCTION [dbo].[DelimitedSplit8K]

(@pString VARCHAR(8000), @pDelimiter CHAR(1))

RETURNS TABLE WITH SCHEMABINDING AS

RETURN

--===== "Inline" CTE Driven "Tally Table" produces values from 0 up to 10,000...

-- enough to cover NVARCHAR(4000)

WITH E1(N) AS (

SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

), --10E+1 or 10 rows

E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows

E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max

cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front

-- for both a performance gain and prevention of accidental "overruns"

SELECT TOP (ISNULL(DATALENGTH(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4

)

, cteStart(N1,L1) AS (

SELECT 1,CHARINDEX(@pDelimiter,@pString,1) -1

UNION ALL

SELECT t.N+1

, ISNULL(NULLIF(CHARINDEX(@pDelimiter,@pString,t.N+1),0)-t.N,8001) -1

FROM cteTally t

WHERE SUBSTRING(@pString,t.N,1) = @pDelimiter

)

--select * from cteStart

SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),

Item = SUBSTRING(@pString, l.N1, l.L1)

FROM cteStart l

Steven Willis SSCrazy Eights Points: 9893 More actions · Answer 12

telcogod (1/4/2013)

hi sorry i haven't tested for performance or sure if this has been done but removed the ctelen

IF OBJECT_ID('dbo.DelimitedSplit8K') IS NOT NULL

DROP FUNCTION [dbo].[DelimitedSplit8K]

go

CREATE FUNCTION [dbo].[DelimitedSplit8K]

(@pString VARCHAR(8000), @pDelimiter CHAR(1))

RETURNS TABLE WITH SCHEMABINDING AS

RETURN

--===== "Inline" CTE Driven "Tally Table" produces values from 0 up to 10,000...

-- enough to cover NVARCHAR(4000)

WITH E1(N) AS (

SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

), --10E+1 or 10 rows

E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows

E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max

cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front

-- for both a performance gain and prevention of accidental "overruns"

SELECT TOP (ISNULL(DATALENGTH(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4

)

, cteStart(N1,L1) AS (

SELECT 1,CHARINDEX(@pDelimiter,@pString,1) -1

UNION ALL

SELECT t.N+1

, ISNULL(NULLIF(CHARINDEX(@pDelimiter,@pString,t.N+1),0)-t.N,8001) -1

FROM cteTally t

WHERE SUBSTRING(@pString,t.N,1) = @pDelimiter

)

--select * from cteStart

SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),

Item = SUBSTRING(@pString, l.N1, l.L1)

FROM cteStart l

Pretty good performance compared to DelimitedSplit8K! However, when running the tests using Jeff Moden's test harness the function failed whenever there was only one element and was still not quite as fast. Very close in performance though, so the change you made didn't really make much difference.



RowNum	SplitterName	NumberOfRows	NumberOfElements	MinElementLength	MaxElementLength	Duration	MinLength	AvgLength	MaxLength

1	DelimitedSplit8K	1000	1	1	10	0.12300	1	5	10

2	DelimitedSplit_TELCO	1000	1	1	10	0.07600	1	5	10

3	DelimitedSplit8K	1000	2	1	10	0.15300	3	11	21

4	DelimitedSplit_TELCO	1000	2	1	10	0.18600	3	11	21

5	DelimitedSplit8K	1000	4	1	10	0.26000	9	24	43

6	DelimitedSplit_TELCO	1000	4	1	10	0.15000	9	24	43

7	DelimitedSplit8K	1000	8	1	10	0.33000	21	50	73

8	DelimitedSplit_TELCO	1000	8	1	10	0.13000	21	50	73

9	DelimitedSplit8K	1000	16	1	10	0.34600	64	102	140

10	DelimitedSplit_TELCO	1000	16	1	10	0.46600	64	102	140

11	DelimitedSplit8K	1000	32	1	10	0.33300	158	206	268

12	DelimitedSplit_TELCO	1000	32	1	10	0.41600	158	206	268

13	DelimitedSplit8K	1000	64	1	10	0.53600	349	415	511

14	DelimitedSplit_TELCO	1000	64	1	10	0.48600	349	415	511

15	DelimitedSplit8K	1000	128	1	10	0.87600	701	830	920

16	DelimitedSplit_TELCO	1000	128	1	10	0.96600	701	830	920

17	DelimitedSplit8K	1000	256	1	10	1.59600	1530	1663	1835

18	DelimitedSplit_TELCO	1000	256	1	10	1.96000	1530	1663	1835

19	DelimitedSplit8K	1000	512	1	10	3.09300	3075	3323	3541

20	DelimitedSplit_TELCO	1000	512	1	10	3.32300	3075	3323	3541

21	DelimitedSplit8K	1000	1150	1	10	7.23000	7141	7466	7790

22	DelimitedSplit_TELCO	1000	1150	1	10	8.41600	7141	7466	7790

23	DelimitedSplit8K	1000	1	10	20	0.01000	10	14	20

24	DelimitedSplit_TELCO	1000	1	10	20	0.01000	10	14	20

25	DelimitedSplit8K	1000	2	10	20	0.03600	21	30	41

26	DelimitedSplit_TELCO	1000	2	10	20	0.05300	21	30	41

27	DelimitedSplit8K	1000	4	10	20	0.10000	47	62	82

28	DelimitedSplit_TELCO	1000	4	10	20	0.08000	47	62	82

29	DelimitedSplit8K	1000	8	10	20	0.12000	93	126	155

30	DelimitedSplit_TELCO	1000	8	10	20	0.12600	93	126	155

31	DelimitedSplit8K	1000	16	10	20	0.20000	219	254	292

32	DelimitedSplit_TELCO	1000	16	10	20	0.23600	219	254	292

33	DelimitedSplit8K	1000	32	10	20	0.36000	458	511	559

34	DelimitedSplit_TELCO	1000	32	10	20	0.39300	458	511	559

35	DelimitedSplit8K	1000	64	10	20	0.71000	931	1021	1099

36	DelimitedSplit_TELCO	1000	64	10	20	0.76000	931	1021	1099

37	DelimitedSplit8K	1000	128	10	20	1.45300	1943	2047	2158

38	DelimitedSplit_TELCO	1000	128	10	20	1.48300	1943	2047	2158

39	DelimitedSplit8K	1000	256	10	20	2.92000	3946	4095	4238

40	DelimitedSplit_TELCO	1000	256	10	20	2.98300	3946	4095	4238

41	DelimitedSplit8K	1000	480	10	20	5.37000	7437	7676	7900

42	DelimitedSplit_TELCO	1000	480	10	20	5.66300	7437	7676	7900

43	DelimitedSplit8K	1000	1	20	30	0.05000	20	25	30

44	DelimitedSplit_TELCO	1000	1	20	30	0.01000	20	25	30

45	DelimitedSplit8K	1000	2	20	30	0.02300	41	51	61

46	DelimitedSplit_TELCO	1000	2	20	30	0.05600	41	51	61

47	DelimitedSplit8K	1000	4	20	30	0.07000	84	102	120

48	DelimitedSplit_TELCO	1000	4	20	30	0.10300	84	102	120

49	DelimitedSplit8K	1000	8	20	30	0.13600	179	206	232

50	DelimitedSplit_TELCO	1000	8	20	30	0.16300	179	206	232

51	DelimitedSplit8K	1000	16	20	30	0.26600	372	414	449

52	DelimitedSplit_TELCO	1000	16	20	30	0.32600	372	414	449

53	DelimitedSplit8K	1000	32	20	30	0.52600	768	830	896

54	DelimitedSplit_TELCO	1000	32	20	30	0.57000	768	830	896

55	DelimitedSplit8K	1000	64	20	30	1.04300	1593	1663	1733

56	DelimitedSplit_TELCO	1000	64	20	30	1.09600	1593	1663	1733

57	DelimitedSplit8K	1000	128	20	30	2.09000	3182	3328	3441

58	DelimitedSplit_TELCO	1000	128	20	30	2.20000	3182	3328	3441

59	DelimitedSplit8K	1000	256	20	30	4.24000	6497	6654	6803

60	DelimitedSplit_TELCO	1000	256	20	30	4.30300	6497	6654	6803

61	DelimitedSplit8K	1000	290	20	30	4.75300	7331	7538	7696

62	DelimitedSplit_TELCO	1000	290	20	30	4.88600	7331	7538	7696

63	DelimitedSplit8K	1000	1	30	40	0.01300	30	34	40

64	DelimitedSplit_TELCO	1000	1	30	40	0.01000	30	34	40

65	DelimitedSplit8K	1000	2	30	40	0.02600	61	71	81

66	DelimitedSplit_TELCO	1000	2	30	40	0.03000	61	71	81

67	DelimitedSplit8K	1000	4	30	40	0.09300	126	143	161

68	DelimitedSplit_TELCO	1000	4	30	40	0.12300	126	143	161

69	DelimitedSplit8K	1000	8	30	40	0.18300	263	287	315

70	DelimitedSplit_TELCO	1000	8	30	40	0.23300	263	287	315

71	DelimitedSplit8K	1000	16	30	40	0.35300	529	575	619

72	DelimitedSplit_TELCO	1000	16	30	40	0.38300	529	575	619

73	DelimitedSplit8K	1000	32	30	40	0.69300	1098	1150	1212

74	DelimitedSplit_TELCO	1000	32	30	40	0.73300	1098	1150	1212

75	DelimitedSplit8K	1000	64	30	40	1.38300	2219	2301	2391

76	DelimitedSplit_TELCO	1000	64	30	40	1.43300	2219	2301	2391

77	DelimitedSplit8K	1000	128	30	40	2.87300	4496	4605	4736

78	DelimitedSplit_TELCO	1000	128	30	40	2.83300	4496	4605	4736

79	DelimitedSplit8K	1000	210	30	40	4.57600	7412	7561	7710

80	DelimitedSplit_TELCO	1000	210	30	40	4.67000	7412	7561	7710

81	DelimitedSplit8K	1000	1	40	50	0.01600	40	44	50

82	DelimitedSplit_TELCO	1000	1	40	50	0.01000	40	44	50

83	DelimitedSplit8K	1000	2	40	50	0.04300	81	91	101

84	DelimitedSplit_TELCO	1000	2	40	50	0.03000	81	91	101

85	DelimitedSplit8K	1000	4	40	50	0.14000	167	182	201

86	DelimitedSplit_TELCO	1000	4	40	50	0.14300	167	182	201

87	DelimitedSplit8K	1000	8	40	50	0.22000	342	366	395

88	DelimitedSplit_TELCO	1000	8	40	50	0.25300	342	366	395

89	DelimitedSplit8K	1000	16	40	50	0.43600	693	735	779

90	DelimitedSplit_TELCO	1000	16	40	50	0.49000	693	735	779

91	DelimitedSplit8K	1000	32	40	50	0.86000	1399	1470	1530

92	DelimitedSplit_TELCO	1000	32	40	50	0.91300	1399	1470	1530

93	DelimitedSplit8K	1000	64	40	50	1.71600	2852	2942	3022

94	DelimitedSplit_TELCO	1000	64	40	50	1.81300	2852	2942	3022

95	DelimitedSplit8K	1000	128	40	50	3.56600	5767	5888	6012

96	DelimitedSplit_TELCO	1000	128	40	50	3.59000	5767	5888	6012

97	DelimitedSplit8K	1000	165	40	50	4.44600	7465	7590	7717

98	DelimitedSplit_TELCO	1000	165	40	50	4.55000	7465	7590	7717

telcogod SSC Enthusiast Points: 143 More actions · Answer 13

ah well, I have seen that before where you think removing the extra select will improve performance but it doesn't. Anyway I am working on a look ma no ddl version as well. This is brilliant stuff I always knew had to be a better way than select substring(string,charindex(delimiter,string), substring(string(charindex(delimiter,string,(charindex(delimiter,string))) etc. Thanks for checking I do not have a sandbox at work where is ok to freeproccache.

peter-757102 SSCertifiable Points: 6877 More actions · Answer 14

SQL Kiwi (12/30/2012)

Related: more generalized subtree matching and reuse has been trialled in the product, see this Microsoft Research paper (PDF).

More generalized matching is exactly what I always wanted, and would hate to see a query hint specific for subexpression reuse. CTE's expanding as they do now is logical as on each site, the "inline view" defined in the CTE can be used differently and thus must be optimised seperately as well. A more generalised optimisation as in the paper will exploit the similarities still found afer that process.

As for the paper, I only quickly scanned it to see what they were up to, and hope their table signatures are based on what is there after expanding contributing code to a query/batch. You want CTE's and non inline views expanded to have a wide as possible base for finding matches. Thus the optimisation should be 100% seperate from the SQL code and no query hint should be wished for!

aochss SSCommitted Points: 1677 More actions · Answer 15

The performance of this is amazing, as is the thought behind the code. These kind of posts really make you think and learn a whole new set of features in SQL

Thanks Jeff.

A couple of questions (sorry if it has been addressed - I didn't see it mentioned in the thread):

I would not expect a record back if the string being sent in is NULL, a single space or all spaces. In my case, I have records where the string being split is often blank and I would want no records coming back from the CROSS APPLY. I could filter them out using in the WHERE clause. I fixed this by adding LTRIM and RTRIM to the source string in the function.

Also, some of my "not so good" data also had multiple spaces in between the values causing numerous records returned that were blank.

For example:

dbo.DelimitedSplit8k('123 467 789', ' ') -- There are two spaces between 467 and 789

ItemNumber Item

1 123

2 467

3 <-- Empty String

4 789

Thanks Again,

Anton