Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes

Question

Post reply

Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes

Viewing 15 posts - 31 through 45 (of 66 total)

You must be logged in to reply to this topic. Login to reply

aruzhans SSC Enthusiast Points: 105 More actions · Answer 1

Excellent article! Ready to use to mentor development team!

Great elegant use of Conditional Split in Incremental loads

Have you compared the performance of this method with more traditional checksum method. The checksum column is created in destination table and compared with run time checksum calc of input data. The match ignored, mismatch results in update of entire row. What do you think are the advantages of this method vs checksum

Thanks

mario.felix Say Hey Kid Points: 697 More actions · Answer 2

mario.felix

Say Hey Kid

Points: 697

June 9, 2010 at 3:15 pm

#1178314

Excelente article.

Congratulations

vinay-474901 SSC Veteran Points: 222 More actions · Answer 3

vinay-474901

SSC Veteran

Points: 222

June 10, 2010 at 4:06 am

#1178522

Hi,

I have tried out this. I still observed truncate\insert is much faster that incremental load using MD5 process. For 800000 records, truncate\insert took 33sec to load while incremental process took 60sec. Please comment on this, if i am missing anything while following incremental process posted.

Thanks

Vinay

Koen Verbeeck SSC Guru Points: 259015 More actions · Answer 4

The Truncate/insert method will only work if your source data also keeps history. If not, you're stuck with incremental load.

Furthermore, at the beginning truncate/insert will surely work faster than the incremental load, but what if you have 300,000 records a day? After two years, the incremental load will be faster as it will only have to process those 300,000 records. Your insert statement will have to insert in the worst case 100+ billion records.

Need an answer? No, you need a question
My blog at https://sqlkover.com.
MCSE Business Intelligence - Microsoft Data Platform MVP

Brett Flippin Right there with Babe Points: 717 More actions · Answer 5

aruzhans (6/9/2010)

Excellent article! Ready to use to mentor development team!

Great elegant use of Conditional Split in Incremental loads

Have you compared the performance of this method with more traditional checksum method. The checksum column is created in destination table and compared with run time checksum calc of input data. The match ignored, mismatch results in update of entire row. What do you think are the advantages of this method vs checksum

Thanks

The advantages in using MD5 or SHA-1 over traditional checksum is simply the lower chance of hash collisions, which is going to be small either way if you are using the Hash + ID column to identify your records.

Brett Flippin Right there with Babe Points: 717 More actions · Answer 6

da-zero (6/10/2010)

The Truncate/insert method will only work if your source data also keeps history. If not, you're stuck with incremental load.

Furthermore, at the beginning truncate/insert will surely work faster than the incremental load, but what if you have 300,000 records a day? After two years, the incremental load will be faster as it will only have to process those 300,000 records. Your insert statement will have to insert in the worst case 100+ billion records.

Agreed on these points.

Also with a traditional truncate/load ETL process you lose the ability of the ETL process to fail gracefully and still retain your data in the destination. An incremental ETL process will be slower in some cases but your data integrity and availability is much much higher. This article is about making the incremental ETL as fast as it can be by checking only a hash for changes instead of checking every column.

Jason-299789 SSC-Insane Points: 21601 More actions · Answer 7

UMG, I've used the Binary_CheckSum function as well and normally keep it as a calculated column that forms part of the Business key Index as an include.

However there is a down side in using the Binary_Checksum, in that if you are dealing with Decimals that change by a fact of 10, eg 1001.00, becomes 100.10 or 10.01 (or the reverse) then the BCS doesnt recognise these as changes thus the Checksum remains the same.

Not an ideal situation when you are checking financials.

The MD5, or SHA1 are much better, however my concern is in regard to reusability of the code you have to cut and paste this in to every data stream, a better option would be a custom ssis task that allows you to select all the columns you want to be Hashed, and attach the hashed column on the output stream.

_________________________________________________________________________
SSC Guide to Posting and Best Practices

Nadrek SSC-Insane Points: 20039 More actions · Answer 8

Brett Flippin (6/10/2010)

...

Also with a traditional truncate/load ETL process you lose the ability of the ETL process to fail gracefully and still retain your data in the destination.

...

If you actually build a second table, work on that, then drop the older table and sp_rename the newer (or use synonyms if you're not still on SQL Server 2000), then you can indeed fail gracefully and have full availability during almost the entire update.

Brett Flippin Right there with Babe Points: 717 More actions · Answer 9

Nadrek (6/14/2010)

Brett Flippin (6/10/2010)

...

Also with a traditional truncate/load ETL process you lose the ability of the ETL process to fail gracefully and still retain your data in the destination.

...

If you actually build a second table, work on that, then drop the older table and sp_rename the newer (or use synonyms if you're not still on SQL Server 2000), then you can indeed fail gracefully and have full availability during almost the entire update.

That could work, but having database object or table structure manipulation be part of your ETL process wouldn't be optimal in my book. Would you really want to rebuild indexes or recreate statistics after each ETL run? What about if you're partitioning?

I could see that working in a very very limited set of circumstances but I don't know that even then I'd recommend it.

Todasdasdfd asfasf Ten Centuries Points: 1340 More actions · Answer 10

Jason Lees-299789 (6/14/2010)

...a better option would be a custom ssis task that allows you to select all the columns you want to be Hashed, and attach the hashed column on the output stream.

Check out the Multiple Hash component: SSIS Multiple Hash Component[/url].

Todd McDermid - SQL Server MVP, MCSD.Net, MCTS (SQL08BI)
Find great SSIS extensions - most are open source - at SSIS Community Tasks and Components[/url]
Are you doing Data Warehouse ETL with SSIS? Don't use the slow SCD Wizard, use the Kimball Method SCD component instead.

Jason-299789 SSC-Insane Points: 21601 More actions · Answer 11

Todd McDermid (6/14/2010)

Jason Lees-299789 (6/14/2010)

...a better option would be a custom ssis task that allows you to select all the columns you want to be Hashed, and attach the hashed column on the output stream.

Check out the Multiple Hash component: SSIS Multiple Hash Component[/url].

Thanks for the pointer Todd, I'll check it out.

_________________________________________________________________________
SSC Guide to Posting and Best Practices

mwpage SSC Enthusiast Points: 113 More actions · Answer 12

Hey, great article, just two quick comments, there is a SSISMultipleHash task on Codeplex that is very nice for this, (I'm not the developer, but I use it), you can find it here http://ssismhash.codeplex.com/[/url].

Then on the idea of doing the hash in the source, instead of in SSIS flow. I think that is a really good idea, and can make it so that you don't have to store the hash in the database table. There are some considerations to look at to make sure that the hash algorithm is always the same(http://weblogs.sqlteam.com/mladenp/archive/2009/04/28/Comparing-SQL-Server-HASHBYTES-function-and-.Net-hashing.aspx). Also, if you store the hash in the database, then you might want to add a trigger to the table that will set the hash to null if a change is made to the record, that doesn't change the hash. ie. a change made outside of your ssis package. If the hash doesn't get changed, then the ssis package won't recognize that the two records are out of sync. Hope that makes sense!

Neeraj Gupta-485504 SSC Enthusiast Points: 149 More actions · Answer 13

Neeraj Gupta-485504

SSC Enthusiast

Points: 149

July 13, 2010 at 7:56 am

#1192036

Hello Brett,

Thank you for the wonderful article.

I was trying to install your example but for some reason its not working for me. Is there any possibility that you can share the full code in your script component with us.

Thanks

Pravasis Default port Points: 1455 More actions · Answer 14

Its really a nice artice, but I think can fail in the following scenario.

1)There are more than two columns

2)Atleast two records are there.

3)Alternately each column has null values and the other column has the value contained in the previous record.

Ex.

Col-A Col-B Col-C Col-D

2 1 111 NULL

3 1 NULL 111

Both the records will generate same hashcode.

Regards,
Pravasis

Brett Flippin Right there with Babe Points: 717 More actions · Answer 15

pravasis (7/14/2010)

Its really a nice artice, but I think can fail in the following scenario.

1)There are more than two columns

2)Atleast two records are there.

3)Alternately each column has null values and the other column has the value contained in the previous record.

Ex.

Col-A Col-B Col-C Col-D

2 1 111 NULL

3 1 NULL 111

Both the records will generate same hashcode.

You can prevent this by using coalesce or isnull for the columns that could contain NULL values.