String tokenizing / splitting

Question

Post reply

String tokenizing / splitting

Viewing 12 posts - 16 through 26 (of 26 total)

You must be logged in to reply to this topic. Login to reply

Gatekeeper SSCommitted Points: 1583 More actions · Answer 1

Gatekeeper

SSCommitted

Points: 1583

March 2, 2011 at 10:21 pm

#1293847

Interesting about the varying times with smaller sets of data. I'll do some testing and see what I can find.

/* Anything is possible but is it worth it? */

nhaberl SSC Veteran Points: 261 More actions · Answer 2

nhaberl

SSC Veteran

Points: 261

March 3, 2011 at 4:44 am

#1293915

Hey guys,

look at here --> udf_ListToTable[/url]

maybe you could run it within your performance tests,

Norbert

PS: For the way upsidedown take this --> udf_TableToList[/url]

davecason SSC-Addicted Points: 411 More actions · Answer 3

That function looks pretty similar to the one from the original article, which I called "fn_Split". It does fine, too, and it will work on older versions of SQL Server.

For joining, there are a couple of basic ways to do it. The tough part can be when you want to pass the table name involved as a variable, as you either you have to do some SQL-Server-2008-only stuff using a newly declared type, or you have to use string concatenation with EXEC, so be careful with that one regarding SQL Injection... AND some of the XML functionality might let you find a back door, too.

The basic premise for joining can use either CTE (as I wrote in my very first blog post) or a COALESCE() or perhaps a CURSOR (which will be a little slow). Here is a quick and dirty example using a table variable and a 2008-only insert statement to fill that test table variable, and then a COALESCE() to join all the rows in that table:



DECLARE @table TABLE (token VARCHAR(max));

INSERT INTO @table VALUES ('xyz'),('pdq'),('abc'),('123')

DECLARE @joined VARCHAR(max)

SELECT @joined=COALESCE(@joined+',', '')+token FROM @table

SELECT @joined

davecason SSC-Addicted Points: 411 More actions · Answer 4

davecason

SSC-Addicted

Points: 411

March 5, 2011 at 2:47 pm

#1295046

I ran the three types again, with "SET EXECUTION TIME ON" and...

...8788 tokens:

XML: CPU time = 203722 ms

Tally: CPU time = 47 ms

Original: CPU time = 686 ms

Craig Sunderland SSC Enthusiast Points: 153 More actions · Answer 5

Craig Sunderland

SSC Enthusiast

Points: 153

March 7, 2011 at 2:34 am

#1295211

Surely other factors like CPU, Memory and Disk Configuration could affect these timings, hence why from my tests the XML was a better fit. There probably isn't one size fits all.

Cheap toilet paper is a false economy, beware!

Gatekeeper SSCommitted Points: 1583 More actions · Answer 6

Craig Sunderland (3/7/2011)

Surely other factors like CPU, Memory and Disk Configuration could affect these timings, hence why from my tests the XML was a better fit. There probably isn't one size fits all.

Yes, it would skew the tests from one machine to the other but he's been testing on the same machine and instance. If the results widely vary, then it shows it doesn't scale well. However, the three ideas tested has their own disadvantages. One requires CLR procs to be enabled, another needs a tally table, and the other works well with smaller delimited lists. If you plan to never have a list of hundreds or thousands, XML will work without any other object. But if you need to have that flexibility, it's not a big performance hit to use the tally function for smaller sets too. Testing and re-visiting those tests from time to time will let you know if you're still using the best function for your current hardware/data distribution.

/* Anything is possible but is it worth it? */

davecason SSC-Addicted Points: 411 More actions · Answer 7

My theory on the XML problem was that it was actually a CLR call-out. Tally seems to work well, it shows an up front execution cost that is not comparable to the others, but it seems to perform better.

I'm going to try to tweak each idea a bit and put them under more competition scenarios to see which ones fare better at scale with parallel use.

davecason SSC-Addicted Points: 411 More actions · Answer 8

I had a little more fun... again using 17,000+ tokens.

I put the CTE version inside the original function by creating 200 character tokens using the original method (sort of), assuming a single character delimiter, then using CTE to further split them into arrays. It still has a few bugs but it did improve the overall results (hopefully, not due to the bugs):

Original + CTE Hybrid: CPU time = 468 ms, elapsed time = 473 ms.

Original: CPU time = 686 ms, elapsed time = 734 ms.

Tally: CPU time = 172 ms, elapsed time = 157 ms.



CREATE FUNCTION [dbo].[fn_Split_Hybrid]

(

        @InputString VARCHAR(max),

        @Delimiter char(1)

)

RETURNS @Values TABLE (

        VALUE VARCHAR(max)

  --      ,IPS VARCHAR(MAX)

		--,DCI BIGINT

)

AS

BEGIN

        DECLARE @DelimitierLen tinyint = 200

        DECLARE @DelimiterCharIndex bigint = @DelimitierLen --BIGINT = CHARINDEX(@Delimiter,@InputString)

        WHILE (@DelimiterCharIndex >= @DelimitierLen )

        BEGIN

            declare @newval varchar(200) = left(@InputString, @DelimitierLen)

            SET @InputString = SUBSTRING(@InputString,@DelimitierLen+1, LEN(@InputString)-@DelimitierLen)

            SET @DelimiterCharIndex = LEN(@inputString)

            --INSERT INTO @Values VALUES (@newval)--, @InputString,@DelimiterCharIndex)

                BEGIN

					with split(i, token, remainder) as

					(select 1

					, left(@newval,charindex(@Delimiter,@newval)-1)

					, LTRIM(right(@newval,len(@newval)-CHARINDEX(@Delimiter,@newval)))

					union all

					select i + 1

					,case when charindex(@Delimiter,remainder) > 0 then

					left(remainder,charindex(@Delimiter,remainder)-1)

					else remainder end as token

					,LTRIM(right(remainder,len(remainder)-CHARINDEX(@Delimiter,remainder))) as remainder

					from split

					where charindex(@Delimiter,remainder) >= 0 and token != remainder

					)

					insert into @Values

					Select token

					from split

				END

        END

        set @newval = @InputString

        BEGIN

				with split(i, token, remainder) as

				(select 1

				, left(@newval,charindex(@Delimiter,@newval)-1)

				, LTRIM(right(@newval,len(@newval)-CHARINDEX(@Delimiter,@newval)))

				union all

				select i + 1

				,case when charindex(@Delimiter,remainder) > 0 then

				left(remainder,charindex(@Delimiter,remainder)-1)

				else remainder end as token

				,LTRIM(right(remainder,len(remainder)-CHARINDEX(@Delimiter,remainder))) as remainder

				from split

				where charindex(@Delimiter,remainder) >= 0 and token != remainder

				)

				insert into @Values

				Select token

				from split

		END

        RETURN

END

GO

Gatekeeper SSCommitted Points: 1583 More actions · Answer 9

As promised, I did some testing (sorry it took this long). This was run from my laptop in a Win2K3 Server VM. The results showed a pretty consistent run time of ~40 ms. Running the script in a loop of 20 times, the max ms were 50 and min was 30. The average was 40.45 and a median of 41. I will note that yes, there is a penalty for the first load of the tally table, which I saw the first time I ran it in tempdb. However, if it's used often, it will probably stay in memory, which it does in our system.

But to be fair, I did a DBCC freeproccache and DBCC dropcleanbuffers between the tests below and did experience the occasional doubling of time you experienced but it wasn't consistent.

Other changes made to your test script:

1) The udf takes nvarchar so I changed chars to nchar to remove the implicit conversion

2) Used datetime2

3) DATEDIFF for time difference calculation

4) Added GO 20 to repeat the script 20 times



declare @i int = 26, @x nvarchar(max) = N'', @d nchar(1) = N' ', @j-2 int;

declare @t table (id tinyint primary key, tokens int, which nvarchar(32), start datetime2, finish datetime2);

--Note: this is going to take a while, so if you want to run this more than once, store this data somewhere...

set @j-2 = @i*@i;

while @j-2 > 0 BEGIN

while @i > 0 BEGIN

 set @x = @x + @d +NCHAR(91 - @i);

 set @i = @i - 1;

END

set @j-2 = @j-2 - 1;

set @i = 26

END

declare @C int;

update @t set tokens = @C, finish = sysdatetime() where id = 1;

insert into @t (id,which,start) values (2,'udf_StrList2Table',getdate());

select @C = COUNT(*) from ..[udf_StrList2Table] (@x,@d)

update @t set tokens = @C, finish = sysdatetime() where id = 2;

select *, DATEDIFF(millisecond, start, finish) as runtime from @t;

GO 20

/* Anything is possible but is it worth it? */

Gatekeeper SSCommitted Points: 1583 More actions · Answer 10

Jeff Moden posted a very nice article about CSV parsing. His code was written for a max input string of 8K so if you need more, stay tuned for an upcoming article from him. I tested his code vs mine and it completely blows it away. Not to mention his CTE tally table did much better than relying on a physical Tally table.

Tally Oh![/url]

/* Anything is possible but is it worth it? */

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 11

Gatekeeper (5/2/2011)

Not to mention his CTE tally table...

Thank you for the wonderful kudo but I absolutely cannot take the credit for the cteTally. The original concept was first published by Itzik Ben-Gan and he used a "base 2" generator. The "base 10" generator that I included in my article was the result of a lot of good people's (Lynn Petis started the ball rolling in this area) efforts here on SSC where we tested many different "bases". I chose to stick with the "base 10" generator because, as with many different based generators, it appeared to have a minor speed advantage over the "base 2" generator, took fewer Cross Joins to ramp up to the larger numbers and, as a result, took a bit less code.

Still, I have to give the credit to Itzik for the wonderful and very useful idea especially since it generates virtually no reads. The use of a physical Tally Table will still beat it by a bit in some applications but you simply have to love the "no reads" aspect of Itzik's fine idea.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

Gatekeeper SSCommitted Points: 1583 More actions · Answer 12

Jeff Moden (2/25/2012)

Still, I have to give the credit to Itzik for the wonderful and very useful idea especially since it generates virtually no reads. The use of a physical Tally Table will still beat it by a bit in some applications but you simply have to love the "no reads" aspect of Itzik's fine idea.

Yes, the no reads is very hard to not pass up when thinking about tally tables, thanks to Itzik.

/* Anything is possible but is it worth it? */