Can I parallelise Foreach tasks?

Question

Post reply

Can I parallelise Foreach tasks?

David Data

SSCrazy

Points: 2965
More actions
June 16, 2010 at 10:56 am

#143590

Part of our ETL package loads a number of large data files into one table. The files all have the same format, but the file names change each run, and are given in a list loaded in a previous step. So the obvious way to do this is in a data flow in a Foreach file loop, which works fine.

Except that it's a serial process, and taking longer as the size and number of files grow. Our server has fast disks and 2 4-core hyperthreaded Xeon processors so it can run 16 threads at once - but while this load is running one thread runs at 100% and all others sit doing nothing. SSIS is great at multi-threading when a dataflow has multiple paths, but not in a Foreach.

Is there a practical way to split the data flow within the Foreach into 4 or 8 threads, or to fire off a dataflow task asynchronously? I've thought of using [font="Courier New"]ExecProc(CodePath)[/font] to fire off multiple completely independent [font="Courier New"]DTExec.exe[/font] tasks, but that seems rather top-heavy.

The other question is whether there is an easy way to re-synchronise after all the files are loaded, before starting a step which uses the loaded data.

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply

Elliott Whitlow SSC Guru Points: 102296 More actions · Answer 1

Elliott Whitlow

SSC Guru

Points: 102296

June 16, 2010 at 11:34 am

#1181193

The answer is yes, there is a lot to it though. One of the bigger issues is breaking up the list into chunks if there are no natural breaks.

I'll be able to write more later.

CEWII

Elliott Whitlow SSC Guru Points: 102296 More actions · Answer 2

Elliott Whitlow

SSC Guru

Points: 102296

June 23, 2010 at 9:12 am

#1184021

You can look at http://qa.sqlservercentral.com/articles/Integration+Services+(SSIS)/70346/

CEWII

David Data SSCrazy Points: 2965 More actions · Answer 3

Thanks. I'm about to read Part 2 (which was highlighted in today's email from SQLServerCentral).

I solved my problem after thinking about it more, and being inspired by Part 1 of your article. I built a system that saved the list of files in a table of ([Filename], [InUse]), which was then read by 8 identical copies of my data flow. Each copy starts with a script component which acts as the data source. It gets a file name to read in, does some processing on some fields, and sends the result to the output buffer. I had one copy of that data flow already; the new part is that each now runs a While loop that gets files to process using EXEC GetFILELIST, which runs this SQL to get the next name that's not already in use:

WITH Nextfile AS



(SELECT TOP 1 [Filename] AS Next FROM [dbo].[FILELIST] WHERE [InUse]=0)

  UPDATE [dbo].[FILELIST]

   SET [InUse] = 1

OUTPUT INSERTED.[Filename]

  FROM [dbo].[FILELIST] fl WITH (READCOMMITTEDLOCK,READPAST)

INNER JOIN Nextfile ON fl.[Filename] = Next;

At first, I was getting lots of deadlocks, so I put the whole thing in a Catch/Try block to retry on deadlock. I then added the WITH (READCOMMITTEDLOCK,READPAST) which makes the query skip rows that have locks on them, thus reading the next row that is NOT already being read/updated by another thread.

This works great, as each thread processes as many files as it can until they're all done. Now I'm going to read your article Part 2 and see if there's an even better way to do it!

My full SQL GetFILELIST procedure, including retry and logging (my ToLOG proc) has grown a bit; if anyone's interested, here's the full thing:

-- Create Procedure GetFILELIST - gets next file from the list (or nothing), and handles deadlocks

CREATE PROCEDURE [dbo].[GetFILELIST]

( @ClientId nchar(2) )

AS

-- Get the next file name from the FILELIST table

-- Retry on error 1205 based on http://msdn.microsoft.com/en-us/library/ms179296.aspx

DECLARE @retry INT = 5; -- Try up to 5 times (1 + 4 retries)

WHILE (@retry > 0) -- Keep trying to update table if this task is selected as the deadlock victim.

BEGIN

BEGIN TRY

BEGIN TRANSACTION;

WITH Nextfile AS

(SELECT TOP 1 [Filename] AS Next FROM [dbo].[FILELIST] WHERE [InUse]=0)

UPDATE [dbo].[FILELIST]

SET [InUse] = 1

OUTPUT INSERTED.[Filename]

FROM [dbo].[FILELIST] fl WITH (READCOMMITTEDLOCK,READPAST)

INNER JOIN Nextfile ON fl.[Filename] = Next;

SET @retry = 0; -- All done

COMMIT TRANSACTION;

END TRY

BEGIN CATCH

-- Error! If it was a Deadlock, try again - otherwise die now

IF (ERROR_NUMBER() = 1205)

BEGIN

SET @retry -= 1

IF XACT_STATE() <> 0 ROLLBACK TRANSACTION;

IF @retry > 0

BEGIN

DECLARE @msg varchar(100) = 'Deadlock! Will retry times ' + CONVERT(VARCHAR(6),@retry)

EXEC [dbo].[ToLOG] @ClientId,0,'GetFILELIST',@msg;

WAITFOR DELAY '00:00:01'; -- Wait before trying again

END

ELSE

EXEC [dbo].[ToLOG] @ClientId,6,'GetFILELIST','Deadlock retry limit reached';

END

ELSE

BEGIN

SET @retry = -1;

IF XACT_STATE() <> 0 ROLLBACK TRANSACTION;

SET @msg = 'Error ' + CONVERT(VARCHAR(6), ERROR_NUMBER()) + ' ' + ERROR_MESSAGE();

EXEC [dbo].[ToLOG] @ClientId,6,'GetFILELIST',@msg;

END

END CATCH;

END; -- End WHILE loop

GO[/Code]

All that just to get a file name for my script component to use - but it does it reliably, and I learnt a lot doing it!

Elliott Whitlow SSC Guru Points: 102296 More actions · Answer 4

I generally use the control structures to manage concurrency where I can, I only lock them long enough to get the work to do and mark that I have it. I try to lock the pages in the table for as short a time as possible for maximum concurrency.

And I treat the control structures seperate from my app structures. I get the work from the control structure and perform it on the app structures then I go back to the control structure and say that I completed the work.

CEWII

David Data SSCrazy Points: 2965 More actions · Answer 5

If by Control and App you mean control flow and data flow, I more or less do that. The file list is made and loaded into the FILELIST table by a script in the control flow, then the components in the data flow do the SELECT+UPDATE transaction to each get a unique next file name - and the query should indeed be fast.

I'm now getting 50-60% CPU occupancy on our server, which means all 8 cores are busy, sometimes with more than one thread. Before, I was getting 6-7% (just one core busy). And in practical terms, having each thread load as many files as it can works nicely too. The files are of varying sizes, so one thread might load just one big file, and another several small ones, but the total numbers of rows each thread loads are about the same. Total elapsed time is reduced by 3x - this is because the other steps in the process are not affected.

Elliott Whitlow SSC Guru Points: 102296 More actions · Answer 6

I was unclear then.. I was talking about app structures like what you are taking actions on, there can and probably will be locking/blocking on these as each thread runs. You reference the app structures by records in the control structure. I use the control structure to manage concurrency, in my example you have a table with an ID in it along with some datetime columns, this is a really simplified view but accurately portrays what I'm talking about. The control table does NOTHING but dispenses work to do, keeps track of what has completed, and makes sure that a failed process gets retried. All of these events are very discrete. You get a unit of work by locking it in the control table, this provides for restartability of a failed id, the locking of the record would be considered a semaphore lock, what that means is that the table/record is only locked briefly to record the fact it is taken and all processes honor this scheme.

An example would be that the id would be a list of customer ids to run these processes against, a process that set based processing isn't good for. The id value is the link between the control structure and the app structures.

Is that any clearer?

Also, I'm glad to see that your utilization of your H/W has gone up so significantly, how have the run-times been affected?

CEWII