How to filter a extremely large table using a result set as filter?

Question

Post reply

How to filter a extremely large table using a result set as filter?

MrAkki

SSCrazy

Points: 2574
More actions
December 6, 2011 at 8:35 am

#146781

Hi.

Please help....

Scenario:

SSIS gets a table filtered by date from Server A. The result is written in a local database.

Now I want to use this resultset to filter an extremely large table (100 millions of rows) on Server B.

I tried to pass the resultset (approx. 20.000 rows) as xml to the OLE DB Dataflow Source SELECT statement as Parameter.

The problem is, that SSIS doesn't know the XML Datatype and the User-Variable used is a string. And as string in SSIS can hold 4000 characters only. :crying:

Is there a better way (or a best practise) to filter an extremly large table using a result set? :blush:

Thanks for your help.

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply

MarcoAntonio SSC Veteran Points: 253 More actions · Answer 1

MarcoAntonio

SSC Veteran

Points: 253

December 6, 2011 at 2:14 pm

#1417673

If both servers are SQL Server, I'd be tempted to use a linked server to write one query joining the two tables/queries.

sam.dahl Hall of Fame Points: 3112 More actions · Answer 2

There are XML components in SSIS specifically for dealing with XML if you were to pursue that, however I haven't used them personally and so don't know much about them.

There are two ways to typically apply the filtering, each has it's own advantages so it's a matter of what suits you best.

Potentially, given that you have a small (I'm assuming) result set and a very large set, I would start with a Lookup component:

Create an OLEDB (or SQL server if it's local to the sql server instance) data source for your large set, and then direct the flow from your source to a Lookup component.

Then have your lookup component cache the small set and change the Error Output settings so that either non-matches or matches (which ever you are after) are sent to your data flow. Finally direct that flow to the destination of your choice.

The alternative is to create a source for each of your sources, sort them, direct them to a Merge Join component, then join them two flows on the fields in question.

Next you direct the Merge Join output to a Conditional split, test one of the key fields for nulls, and then direct the appropriate CSplit output to the destination of your choice.

If you end up trying more than one method, some feed back on what worked better for you would be interesting to us (or maybe just me).

Hopefully I'm barking up the right tree...

HTH,

Sam

MrAkki SSCrazy Points: 2574 More actions · Answer 3

Hi,

well linked server is not possible, because of security restrictions. :crying:

sam.dahl (12/6/2011)

Potentially, given that you have a small (I'm assuming) result set and a very large set, I would start with a Lookup component:

Create an OLEDB (or SQL server if it's local to the sql server instance) data source for your large set, and then direct the flow from your source to a Lookup component.

Then have your lookup component cache the small set and change the Error Output settings so that either non-matches or matches (which ever you are after) are sent to your data flow. Finally direct that flow to the destination of your choice.

I guess this would be an option. The problem I see is, that the large set is unfiltered and holds more then 1 billion rows. Does the SQL-Server use the Lookup component to filter the data in the database or is it filtered in the SSIS?

The second option with the merge is probably transporting the data to the memory and creating temp-tables (with 1 billion rows). I guess or server couldn't handle that. :unsure:

Thanks,

Regards,

Thorsten

Eric SSCommitted Points: 1803 More actions · Answer 4

You might try a partially cached lookup to the big table.

Set up a data source to Server A to get the 20,000 rows. Run the output into a lookup task to the big table on Server B. Set the cache mode to partial cache (or no cache if the lookup keys in your input set are unique). Each row entering the lookup will now hit Server B based on the key value to find a matching row.

The table on Server B needs to be indexed correctly for this to work. Assuming your 20,000 input rows all have unique lookup values you will be doing 20,000 individual lookups against the table on Server B.

Another option might be to have a staging table created somewhere on Server B. You can write result set A to that table then in another data flow task join it to the big table to do the filtering. Whoever is administrating Server B will probably appreciate you putting as little load as possible on it, so they might go for this option. Or not.

Good luck. 😀

sam.dahl Hall of Fame Points: 3112 More actions · Answer 5

MrAkki (12/7/2011)

The problem I see is, that the large set is unfiltered and holds more then 1 billion rows. Does the SQL-Server use the Lookup component to filter the data in the database or is it filtered in the SSIS?

The second option with the merge is probably transporting the data to the memory and creating temp-tables (with 1 billion rows). I guess or server couldn't handle that. :unsure:

SSIS operates on a pipeline basis (in memory on the execution server). Before execution metadata is validated, memory buffers are created, and then any necessary caches are populated (not sure if that's the exact order but near enough). Then the Data Sources start drawing records into the buffers. Typically a buffer stores about 10k records at a time .

So, what you would have in the Lookup scenario (assuming your small set is 20 k rows, shouldn't be too much trouble to fit in memory) is buffer after buffer of records coming in from the source and passing through the lookup, each record is checked against the cached set to see if it exists. In this way you identify whether each record of the large set exists in the big set.

In the case of the Merge Join the component is partially blocking and requires the sources to be sorted so what would happen is the component holds onto the incoming buffers and joins the records together as they can. Due to the size and sorted nature of the buffers you'll find that the small set is held onto until about half of the larger set has been passed through before all the records in the small set's first buffer have been joined. Hence the partial blocking.

If I understand Eric's suggestion properly then his method allows you to take only the 20,000 records into the data flow and look them up against the much larger set.

Here's a good description of partial caching: http://blogs.msdn.com/b/mattm/archive/2008/10/18/lookup-cache-modes.aspx

It depends on which data set is your reference set and what you want to do with the results.

MrAkki SSCrazy Points: 2574 More actions · Answer 6

Hi,

thank you all, for the provided informations.

I tried to use a lookup, but the lookup returns 1 row only! :crazy:

I don't get it, why Microsoft hasn't provided a DataFlowTask like the Lookup that returns more then one row of the resultset?! 🙁

Isn't there another way? XML is not working and I can not put a Staging table on the productive server, nor can I add a stored procedure or any other code.

I just need to filter the very large table with 20.000 different values.

Is there any way to pass an comma-separated string to a SELECT statement in dhe OLE-DB-Source (without the 4000 char restriction)?

Thanks,

Regards,

Thorsten

MrAkki SSCrazy Points: 2574 More actions · Answer 7

sam.dahl (12/7/2011)

In the case of the Merge Join the component is partially blocking and requires the sources to be sorted so what would happen is the component holds onto the incoming buffers and joins the records together as they can. Due to the size and sorted nature of the buffers you'll find that the small set is held onto until about half of the larger set has been passed through before all the records in the small set's first buffer have been joined. Hence the partial blocking.

Merge Join is not working for me. The large table has a size of approx. 900 GB and I do not want the SSIS to load that in the memory and sort it. (I guess SSIS was never ment to use tables that large... 😉 )

Regards,

Thorsten

sam.dahl Hall of Fame Points: 3112 More actions · Answer 8

As I said before, the merge join doesn't hold the entire table in memory unless you are using a fully blocking component (such as a Sort or Aggregation). It continuously reads and writes. SSIS is perfectly capable of this type of work.

And if you are only getting one row out of lookup component then either you only have one match in the data, or it is not set up correctly.

See if this helps: http://msdn.microsoft.com/en-us/library/cc952929.aspx

MrAkki SSCrazy Points: 2574 More actions · Answer 9

sam.dahl (12/15/2011)

As I said before, the merge join doesn't hold the entire table in memory unless you are using a fully blocking component (such as a Sort or Aggregation). It continuously reads and writes. SSIS is perfectly capable of this type of work.

And if you are only getting one row out of lookup component then either you only have one match in the data, or it is not set up correctly.

See if this helps: http://msdn.microsoft.com/en-us/library/cc952929.aspx

Hi Sam,

Sorry, but I guess I don't understand it correctly.

The lookup returns only one value even there are multiple values in the reference table.

"If there are multiple matches in the reference table, the Lookup transformation returns only the first match returned by the lookup query. If multiple matches are found, the Lookup transformation generates an error or warning only when the transformation has been configured to load all the reference dataset into the cache. In this case, the Lookup transformation generates a warning when the transformation detects multiple matches as the transformation fills the cache. " http://msdn.microsoft.com/en-us/library/ms141821.aspx

When I try to use a merge join I need to sort the data to join the data. I don't know how to "not sort" the data in a Join. Can you help me here? :unsure:

Thanks,

Kind Regrards,

Thorsten

sam.dahl Hall of Fame Points: 3112 More actions · Answer 10

MrAkki (12/28/2011)

The lookup returns only one value even there are multiple values in the reference table.

Correct, a lookup join will only return one value from the reference set for each value that you lookup.

In other words it doesn't do cross-joins. Is that what you were trying to do?

I thought the problem you had was that you only got one match in total.

Can you provide a small set of sample data that illustrates what you are trying to do exactly?

When I try to use a merge join I need to sort the data to join the data. I don't know how to "not sort" the data in a Join. Can you help me here?

The problem you have here is that a merge join requires sorted inputs, but you don't want to sort the large table in memory.

That leaves you two options (technically there're more but let's stay out of the occult).

1. Ensure the data is sorted in the database before you read it (If you are trying to join the data by a clustered Primary Key for example then your table would already be sorted) but you will still need an order by clause in your query the index will just make it faster.

2. Or else just do the sort in your query, and let the database figure it out (may be slower, but if you don't have an index you probably can't or don't want to add one)

Either way you still need to tell your merge join that the data is sorted, details here: http://www.ssistalk.com/2009/09/17/ssis-avoiding-the-sort-components/