Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes

  • Excellent article! Ready to use to mentor development team!

    Great elegant use of Conditional Split in Incremental loads

    Have you compared the performance of this method with more traditional checksum method. The checksum column is created in destination table and compared with run time checksum calc of input data. The match ignored, mismatch results in update of entire row. What do you think are the advantages of this method vs checksum

    Thanks

  • Excelente article.

    Congratulations

  • Hi,

    I have tried out this. I still observed truncate\insert is much faster that incremental load using MD5 process. For 800000 records, truncate\insert took 33sec to load while incremental process took 60sec. Please comment on this, if i am missing anything while following incremental process posted.

    Thanks

    Vinay

  • The Truncate/insert method will only work if your source data also keeps history. If not, you're stuck with incremental load.

    Furthermore, at the beginning truncate/insert will surely work faster than the incremental load, but what if you have 300,000 records a day? After two years, the incremental load will be faster as it will only have to process those 300,000 records. Your insert statement will have to insert in the worst case 100+ billion records.

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • aruzhans (6/9/2010)


    Excellent article! Ready to use to mentor development team!

    Great elegant use of Conditional Split in Incremental loads

    Have you compared the performance of this method with more traditional checksum method. The checksum column is created in destination table and compared with run time checksum calc of input data. The match ignored, mismatch results in update of entire row. What do you think are the advantages of this method vs checksum

    Thanks

    The advantages in using MD5 or SHA-1 over traditional checksum is simply the lower chance of hash collisions, which is going to be small either way if you are using the Hash + ID column to identify your records.

  • da-zero (6/10/2010)


    The Truncate/insert method will only work if your source data also keeps history. If not, you're stuck with incremental load.

    Furthermore, at the beginning truncate/insert will surely work faster than the incremental load, but what if you have 300,000 records a day? After two years, the incremental load will be faster as it will only have to process those 300,000 records. Your insert statement will have to insert in the worst case 100+ billion records.

    Agreed on these points.

    Also with a traditional truncate/load ETL process you lose the ability of the ETL process to fail gracefully and still retain your data in the destination. An incremental ETL process will be slower in some cases but your data integrity and availability is much much higher. This article is about making the incremental ETL as fast as it can be by checking only a hash for changes instead of checking every column.

  • UMG, I've used the Binary_CheckSum function as well and normally keep it as a calculated column that forms part of the Business key Index as an include.

    However there is a down side in using the Binary_Checksum, in that if you are dealing with Decimals that change by a fact of 10, eg 1001.00, becomes 100.10 or 10.01 (or the reverse) then the BCS doesnt recognise these as changes thus the Checksum remains the same.

    Not an ideal situation when you are checking financials.

    The MD5, or SHA1 are much better, however my concern is in regard to reusability of the code you have to cut and paste this in to every data stream, a better option would be a custom ssis task that allows you to select all the columns you want to be Hashed, and attach the hashed column on the output stream.

    _________________________________________________________________________
    SSC Guide to Posting and Best Practices

  • Brett Flippin (6/10/2010)

    ...

    Also with a traditional truncate/load ETL process you lose the ability of the ETL process to fail gracefully and still retain your data in the destination.

    ...

    If you actually build a second table, work on that, then drop the older table and sp_rename the newer (or use synonyms if you're not still on SQL Server 2000), then you can indeed fail gracefully and have full availability during almost the entire update.

  • Nadrek (6/14/2010)


    Brett Flippin (6/10/2010)

    ...

    Also with a traditional truncate/load ETL process you lose the ability of the ETL process to fail gracefully and still retain your data in the destination.

    ...

    If you actually build a second table, work on that, then drop the older table and sp_rename the newer (or use synonyms if you're not still on SQL Server 2000), then you can indeed fail gracefully and have full availability during almost the entire update.

    That could work, but having database object or table structure manipulation be part of your ETL process wouldn't be optimal in my book. Would you really want to rebuild indexes or recreate statistics after each ETL run? What about if you're partitioning?

    I could see that working in a very very limited set of circumstances but I don't know that even then I'd recommend it.

  • Jason Lees-299789 (6/14/2010)


    ...a better option would be a custom ssis task that allows you to select all the columns you want to be Hashed, and attach the hashed column on the output stream.

    Check out the Multiple Hash component: SSIS Multiple Hash Component[/url].

    Todd McDermid - SQL Server MVP, MCSD.Net, MCTS (SQL08BI)
    Find great SSIS extensions - most are open source - at SSIS Community Tasks and Components[/url]
    Are you doing Data Warehouse ETL with SSIS? Don't use the slow SCD Wizard, use the Kimball Method SCD component instead.

  • Todd McDermid (6/14/2010)


    Jason Lees-299789 (6/14/2010)


    ...a better option would be a custom ssis task that allows you to select all the columns you want to be Hashed, and attach the hashed column on the output stream.

    Check out the Multiple Hash component: SSIS Multiple Hash Component[/url].

    Thanks for the pointer Todd, I'll check it out.

    _________________________________________________________________________
    SSC Guide to Posting and Best Practices

  • Hey, great article, just two quick comments, there is a SSISMultipleHash task on Codeplex that is very nice for this, (I'm not the developer, but I use it), you can find it here http://ssismhash.codeplex.com/[/url].

    Then on the idea of doing the hash in the source, instead of in SSIS flow. I think that is a really good idea, and can make it so that you don't have to store the hash in the database table. There are some considerations to look at to make sure that the hash algorithm is always the same(http://weblogs.sqlteam.com/mladenp/archive/2009/04/28/Comparing-SQL-Server-HASHBYTES-function-and-.Net-hashing.aspx). Also, if you store the hash in the database, then you might want to add a trigger to the table that will set the hash to null if a change is made to the record, that doesn't change the hash. ie. a change made outside of your ssis package. If the hash doesn't get changed, then the ssis package won't recognize that the two records are out of sync. Hope that makes sense!

  • Hello Brett,

    Thank you for the wonderful article.

    I was trying to install your example but for some reason its not working for me. Is there any possibility that you can share the full code in your script component with us.

    Thanks

  • Its really a nice artice, but I think can fail in the following scenario.

    1)There are more than two columns

    2)Atleast two records are there.

    3)Alternately each column has null values and the other column has the value contained in the previous record.

    Ex.

    Col-A Col-B Col-C Col-D

    2 1 111 NULL

    3 1 NULL 111

    Both the records will generate same hashcode.

    Regards,
    Pravasis

  • pravasis (7/14/2010)


    Its really a nice artice, but I think can fail in the following scenario.

    1)There are more than two columns

    2)Atleast two records are there.

    3)Alternately each column has null values and the other column has the value contained in the previous record.

    Ex.

    Col-A Col-B Col-C Col-D

    2 1 111 NULL

    3 1 NULL 111

    Both the records will generate same hashcode.

    You can prevent this by using coalesce or isnull for the columns that could contain NULL values.

Viewing 15 posts - 31 through 45 (of 66 total)

You must be logged in to reply to this topic. Login to reply