Performance issue with tally solution

Question

Post reply

Performance issue with tally solution

Viewing 15 posts - 406 through 420 (of 522 total)

You must be logged in to reply to this topic. Login to reply

peter-757102 SSCertifiable Points: 6877 More actions · Answer 1

Jeff Moden (5/9/2009)

Peter,

Sorry man. I tested against 1000 lines of 800 comma separated random numbers each in a VARCHAR(8000) environment... your code did the split into a table at a minute and 50 seconds. Many of the other solutions come in at less than 10 seconds for the same thing.

Heh, I was still running against the solutions listed in the initial few pages, you know all those that build an intermediate table due to the non-inline table valued functions. I bet it was significantly faster on those on your machine too...if not that is would worry me. As I did some further tweaking, I kept hitting a wall where SQL Sever refuses to act smart.

Now I have to look at your latest finding and understand how that impacts things. If all it takes is some implicit sorting to speed up code then I am all for it :). I did take a quick look at the later SQL solutions, but converting those to be exactly how I would want a generic splitter will take some doing and could possibly kill the optimizer when used in larger queries.

Thanks for your testing tho 🙂

Note:

I found that my otherwise fast tally function does not work well with your cross join queries.

It slows it down from 2 seconds to at least 3 minutes at which point I aborted it!

So much more testing to do 😉

Suffice to say, this topic is going to live forever until I find the best solution for all cases 🙂

UMG Developer SSChampion Points: 13482 More actions · Answer 2

I know I am coming in late, but boy has it been a read so far and I can't wait to see the final outcome.

In the particular problem I am trying to solve I have 4,558,077 comment pieces, that have to be combined so that I have the full comment in one field. It takes about 2 minutes and 12 seconds using FOR XML (with .VALUE) to create the table with 1,173,035 comments all assembled.

The comments are as short as 43 characters, as long as 19,629 characters, with an average of 502 characters. (They have as little as 1 line, up to 434 lines, with an average of 11 lines.)

Then I have to take all of that and burst it into individual lines for each comment. (The lines may have been broken apart in different pieces in the source.)

Using the last Tally solution from Jeff Moden it took 15 minutes and 1 second.

The resulting 13,519,756 lines are as short as 0 characters, as long as 733 characters, with an average of 42 characters.

One interesting thing I found is that if I have a clustered index on the comment table (dbo.JBMTest in the Jeff's example) that SQL Server decides to uni-task, and took 41 minutes 19 seconds to finish. By dropping the PK it used parallelism, and was almost 3 times faster.

I'm interested in trying other T-SQL solutions, but I'm not sure which I should try or where the best/latest versions are. (I'm not ready to go down the path of getting CLR turned on.)

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 3

I'm confused... why would you put over 4 million comments together only to break them apart again?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

UMG Developer SSChampion Points: 13482 More actions · Answer 4

Jeff Moden (5/10/2009)

I'm confused... why would you put over 4 million comments together only to break them apart again?

Because in the source data an individual line may be broken across multiple parts. I figured rather than split the parts, and then try to figure out what line fragments have to be put together it would be easier to make one huge string out of all the parts for a given comment and then break that into the individual lines. (i.e. Part 1 of Comment 1 may contain lines 1 and 2, and the first portion of line 3, while part 2 may contain the rest of line 3 as well as line 4.)

Florian Reischl SSC-Dedicated Points: 37299 More actions · Answer 5

Florian Reischl

SSC-Dedicated

Points: 37299

May 10, 2009 at 3:11 pm

#991837

Hi UMG

Maybe I have an idea. Just for ensure. Your data are text data? No binary.

Greets

Flo

The more I learn, the more I know what I do not know
Blog: Things about Software Architecture, .NET development and T-SQL

How to Post Data/Code to get the best Help How to Post Performance Problems

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 6

UMG Developer (5/10/2009)

Jeff Moden (5/10/2009)

I'm confused... why would you put over 4 million comments together only to break them apart again?

Because in the source data an individual line may be broken across multiple parts. I figured rather than split the parts, and then try to figure out what line fragments have to be put together it would be easier to make one huge string out of all the parts for a given comment and then break that into the individual lines. (i.e. Part 1 of Comment 1 may contain lines 1 and 2, and the first portion of line 3, while part 2 may contain the rest of line 3 as well as line 4.)

Heh... step 1 for me would be to pork chop the data provider. What a terrible thing to do to the data.

That notwithstanding, what is the unique character being used to signify the parts of a comment? And, how long are the individual lines? Last but not least, are you importing this data from a file?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

Florian Reischl SSC-Dedicated Points: 37299 More actions · Answer 7

Hi UMG

For now just a simple approach.

Since your source data are not enclosed with a leading/trailing delimiter I took my UNION ALL Tally version to do the initial split. This includes the first and the last item. It is a little bit slower than the traditional tally version but it doesn't require the delimiters at start and end of the text.

Rows which end with the specified delimiter (I took a comma for now) will cause a empty row as last item. After that move the fragments up to the previous result row. Delete the moved rows. Delete the empty rows at the end of each row-split.

[font="Courier New"]---==================================

-- Some source data

DECLARE @Source TABLE (Id INT NOT NULL IDENTITY, Txt VARCHAR(100))

INSERT INTO @Source

SELECT 'aaa,bb'

UNION ALL SELECT 'b,ccc,'

UNION ALL SELECT 'ddd'

---==================================

-- Destination table

DECLARE @Result TABLE (Id INT NOT NULL IDENTITY, SourceId INT, Txt VARCHAR(100), IsJoin BIT)

---==================================

-- Delimiter

DECLARE @Delimiter VARCHAR(20)

DECLARE @DelimiterLen INT

SELECT @Delimiter = ','

SELECT @DelimiterLen = LEN(@Delimiter)

---==================================

-- Split items without a required leading/trailing delimiter

--

-- IMPORATNT:

-- If the source row was determied correct (with the delimiter) this will return an empty row

INSERT INTO @Result (

SourceId,

Txt

)

SELECT

s.Id,

l.Item

FROM @Source s

CROSS APPLY

(

SELECT

SUBSTRING(s.Txt, 1, ISNULL(NULLIF(CHARINDEX(@Delimiter, s.Txt, 1) - 1, -1), LEN(s.Txt))) Item,

1 Sorting

UNION ALL

SELECT TOP 100 PERCENT

SUBSTRING(s.Txt, t.N + @DelimiterLen, ISNULL(NULLIF(CHARINDEX(@Delimiter, s.Txt, t.N + 1) - t.N - 1, -t.N - 1), LEN(s.Txt) - t.N)) Item,

2 Sorting

FROM Tally t

WHERE t.N <= LEN(s.Txt)

AND SUBSTRING(s.Txt, t.N, @DelimiterLen) = @Delimiter

ORDER BY Sorting, t.N

) l

---==================================

-- Move fragments from next row up to the previous

; WITH

last_item (Id) AS

(

SELECT MAX(Id) FROM @Result GROUP BY SourceId

)

UPDATE r1 SET

Txt = r1.Txt + r2.Txt,

IsJoin = 1

FROM last_item li

JOIN @Result r1 ON li.Id = r1.Id

JOIN @Result r2 ON r1.Id + 1 = r2.Id

WHERE

r1.Txt != ''

---==================================

-- Delete the rows which have been joined up to the previous

DELETE r1

FROM @Result r1

JOIN @Result r2 ON r1.Id = r2.Id + 1

WHERE r2.IsJoin = 1

---==================================

-- Delete the empty last rows which resulted by the initial

-- tally split

; WITH

last_item AS ( SELECT MAX(Id) Id FROM @Result GROUP BY SourceId )

DELETE r

FROM last_item li

JOIN @Result r ON li.Id = r.Id

WHERE r.Txt = ''

---==================================

-- Result

SELECT * FROM @Source

SELECT * FROM @Result

[/font]

Greets

Flo

The more I learn, the more I know what I do not know
Blog: Things about Software Architecture, .NET development and T-SQL

How to Post Data/Code to get the best Help How to Post Performance Problems

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 8

I was kind of hoping the data was being imported from a file and that each comment, despite the number of whole and partial lines it may occupy, would have a unique delimiter. Then you could just bulk insert the whole bloody thing using the comment delimiter as the end of line delimiter and simply replace any CrLf's with empty strings.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

Florian Reischl SSC-Dedicated Points: 37299 More actions · Answer 9

Hi Jeff

I saw your question and you are correct. If the data are imported by a file this would be the best approach.

But I already started with my solution and didn't want to hit "Ctrl+A" and "Del" 😛

Greets

Flo

The more I learn, the more I know what I do not know
Blog: Things about Software Architecture, .NET development and T-SQL

How to Post Data/Code to get the best Help How to Post Performance Problems

UMG Developer SSChampion Points: 13482 More actions · Answer 10

Flo,

Yes the data is all text. I'll have to take a look at your proposed solution, but it would need some work to handle all the possible crap data cases. If you use this for the source, you loose a hole set of the Ds:

INSERT INTO @Source

SELECT 'aaa,bb'

UNION ALL SELECT 'b,ccc,ddd'

UNION ALL SELECT 'ddd'

UNION ALL SELECT 'ddd,e'

UNION ALL SELECT 'eee'

UNION ALL SELECT ',fff'

This is why I thought it would just be easier to roll the parts all up and then split it. (But maybe the resulting really long strings make the splitting go slower?)

Jeff,

I've never heard the phrase to "pork chop" someone what does it mean? There isn't much I can do about where it comes from. It isn't even complete data because of a "bug" in their software that stores chunks longer than 4000 characters sometimes, but they ETL it out of the source system into Oracle into a VARCHAR2 field which has a limit of 4000 characters. This is only a problem for about 150 of the 4+ million rows so it isn't a huge deal. (And it isn't like we can do anything about it, they are thinking about how they might solve the problem...)

Each row has a comment ID and a part number in addition to the comment part so that they can be put back together in the correct order.

The delimiter between lines is a: |

The source parts are between 18 and 4000 characters, with an average of 220 characters. (This could change at any time as new data is added.)

The lines (once split) are between 0 and 733 characters, with an average of 42 characters. (Again this could change as new data is added.)

I get the data from an Oracle 10g DB. (I could do some processing on Oracle, but I use SQL Server more and that is where the final analysis needs to be anyhow.)

There are no CR/LFs in the data. (At least at this point.)

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 11

Florian Reischl (5/10/2009)

Hi Jeff

I saw your question and you are correct. If the data are imported by a file this would be the best approach.

But I already started with my solution and didn't want to hit "Ctrl+A" and "Del" 😛

Greets

Flo

LOL! Man, do I ever understand that! 😀

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 12

UMG Developer (5/10/2009)

I've never heard the phrase to "pork chop" someone what does it mean?

There's a pretty long story behind it, but it basically means taking the person responsible for your dilema out for a nice pork chop dinner... where you will tie that person to a chair and feed that person pork chops... at point blank range with a hefty sling shot. Most folks on this forum have come to know the phrase and I constantly forget not to use it around someone new to the forum. My apologies.

There isn't much I can do about where it comes from. It isn't even complete data because of a "bug" in their software that stores chunks longer than 4000 characters sometimes, but they ETL it out of the source system into Oracle into a VARCHAR2 field which has a limit of 4000 characters. This is only a problem for about 150 of the 4+ million rows so it isn't a huge deal. (And it isn't like we can do anything about it, they are thinking about how they might solve the problem...)

Understood.

Each row has a comment ID and a part number in addition to the comment part so that they can be put back together in the correct order.

The delimiter between lines is a: |

The source parts are between 18 and 4000 characters, with an average of 220 characters. (This could change at any time as new data is added.)

The lines (once split) are between 0 and 733 characters, with an average of 42 characters. (Again this could change as new data is added.)

I get the data from an Oracle 10g DB. (I could do some processing on Oracle, but I use SQL Server more and that is where the final analysis needs to be anyhow.)

There are no CR/LFs in the data. (At least at this point.)

You say you get the data from an Oracle 10 DB... do you mean directly or is it in a text file? Even if it's not in a text file, see if you can get them to give it to you that way because, from everything you described, what I said about using bulk insert will work very quickly and almost by magic.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

Phil Factor SSC-Insane Points: 20224 More actions · Answer 13

Jeff Moden

Any word on this, Phil? Copying code from the code windows makes some really ugly stuff with extra line breaks, no indentation, and all those nasty things I posted on the test thread for the code windows.

I spoke to Mel at RedGate last week. He hasn't given me a date yet, but I've given him the full force of the urgency with which you want it done. Sorry I've been a bit quiet on this thread but I've been v busy- but I am trying to follow things. It is getting very interesting.

Best wishes,
Phil Factor

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 14

Phil Factor (5/10/2009)

Jeff Moden

Any word on this, Phil? Copying code from the code windows makes some really ugly stuff with extra line breaks, no indentation, and all those nasty things I posted on the test thread for the code windows.

I spoke to Mel at RedGate last week. He hasn't given me a date yet, but I've given him the full force of the urgency with which you want it done. Sorry I've been a bit quiet on this thread but I've been v busy- but I am trying to follow things. It is getting very interesting.

Heh... Like Mel will listen to me. It's not for me... it's for all the others that may want to copy code from this good forum. To wit...

http://www.thecounter.com/stats/2009/March/browser.php

I do appreciate the feedback, though. Thanks, Phil.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

UMG Developer SSChampion Points: 13482 More actions · Answer 15

Jeff Moden (5/10/2009)You say you get the data from an Oracle 10 DB... do you mean directly or is it in a text file? Even if it's not in a text file, see if you can get them to give it to you that way because, from everything you described, what I said about using bulk insert will work very quickly and almost by magic.

I get it directly from the Oracle sever via linked server setup in SQL 2k5. I can certainly dump it to a text file if that makes it easier or faster. But frankly I'm pretty happy with the ~2 minutes to roll the parts together and the ~15 minutes to split it out into lines. On the other hand if I can make it even faster then all the better.

I'm curios about the bulk insert method, so you're saying it can take something like this:



ID Part Comment

-- ---- ---------

1  1    Line 1|Line 2 part 1

1  2    Line 2 part 2

1  3    Line 2 part 3|Line 3|

1  4    Line 4|

2  1    Line 1|

2  2    Line 2|

and give me something like this:



ID Line Comment

-- ---- ---------

1  1    Line 1

1  2    Line 2 part 1Line 2 part 2Line 2 part 3

1  3    Line 3

1  4    Line 4

2  1    Line 1

2  2    Line 2

with very little work?