Speeding Up Incremental ETL Proceses in SSIS by Using MD5 Hashes

  • Thanks for the quick response.

    Regards,
    Pravasis

  • [font="Comic Sans MS"]

    This is an excellent article.

    However - I am facing a peculier problem that sometime the generated MD5 hashval is NOT same for the same data. This is happening intermittantly - say for 5 days it is working fine and not working next day. Also - I have noticed this to happen only if the data volume is high (5M).

    For around first 50K records it generated the hashValue same as previous day (say day 0) - but for rest of records it has generated different hasValues. It shows rest of the 5M records to be delta. Next day - it rectifies itself and create hashValue same as what we had had for day 0.

    It looks like:

    (columnA) (columnB) (columnC) columnHash

    A,B,C, hashValue1 ----> day 0

    A,B,C, hashValue2 ----> day 1 (record inserted)

    (I forced all insert even if the hashvalue same)

    A,B,C, hashValue1 ----> day 2

    ( I have 16+ columns - combination of varchar, int, datetime. Some of the columns contain Null value and the hashValue column is of nvarchar(35)

    I was unable to replicate this from one environment to another. I loaded those 3 days data into TEST and got desired result (generated ~800 delta records).

    This got me thinking of any environment related issue? Is it related to some sort of buffer overflow (note around ~50K records generated correct hashvalue)?

    I have loaded more than 500 files ranging from 1000 records to 5M reocrds and the problem has happened only 3 times with the files with 5M records. However - I have loaded the 5M file around 20 times without any issue.

    Any clue - or suggestion for doing the debug will be highly appreciated.

    [/font]

    [font="Comic Sans MS"]--
    Sabya[/font]

  • I've noticed that sometimes in certain character fields this can be a problem. The best solution I've found, and what I'm currently using in my environment is the SSIS Multiple Hash data flow task. It seems to be a little more reliable when dealing with these fields.

    The only change you'll have to make is to change the HashValue columns to the varbinary data type of whatever length of whatever hash you choose to use. And you'll need to convert the new binary hash value to text in the conditional split for comparison as the conditional split cannot compare binary values.

    [/url]

  • [font="Comic Sans MS"]

    Brett,

    Thanks for the quick reply.

    Problem is - I don't have the control on the environment I am working and I am not allowed to use any custom component. So I am stuck with VB.NET script component.

    Do you have any idea why this might have happened? Anything related to memory buffer?

    Can you provide any alternate code snippet that might help.

    [/font]

    [font="Comic Sans MS"]--
    Sabya[/font]

  • I will do a little more testing on it during this long holiday weekend and see what I might be able to come up with.

    One thing you could try is using a different type of hash such as SHA instead of MD5. I provided an example earlier in the thread.

    I don't think it's a buffer issue, though it could be if you are experiencing problems at the same threshold every time.

  • [font="Comic Sans MS"]

    Thanks!

    I am going to do that too!

    Meantime I am removing 4 columns on pipeline which will always be fixed string and see if this helps anyway.

    Let me thank you once more for this article. I saw it through RSS feed and bookmarked a while ago. It came really handy when we needed to use this 5M vs 5M data compare where any of these 15-16 column value can change. This made the end to end process run in less than 30 mins while it used to take more than 16 hrs!

    [/font]

    [font="Comic Sans MS"]--
    Sabya[/font]

  • Nice article.

    Consider adding the checksum column as an INCLUDE column to the PK index on the table. Your milage may vary because:

    1) The lookup transformation component only needs to read the PK index, not the actual data rows. [good for this process.]

    2) The PK index is larger because of the checksum values. [a minus--hurts storage and any access using the PK.]

    3) The PK index entry must be updated with a new checksum value whenever the data gets updated. Since the checksum value is always the same size, this will not fragment the index pages unless you use the "update trigger sets checksum column to null" idea. [a minus for any process that updates the data.]

    David Lathrop
    DBA
    WA Dept of Health

  • i wonder if you can do incremental load from MysQL to SQl server 2008. I used ADO .NET source connecting to MySQL and I could not do connection to Script component. I read another article Anatomy of an Incremental Load and i cannot use lookup transformation editor , it produced "cannot link data because of data type mismatch" error when I try mach MYSQl key to SQL server 2008 table? how I can convert unsigned 4 bite unsigned interger from mySQL to bigint key in MS SQL?

    thank you!

  • Nice article, wish you could write SSIS for our team

  • I actually resolved the problem. not sure about performance. I'll add my comments later when I get results. thank you all.

  • awesome article , however, I have a different scenario. What about the source and target tables are really large?

    close to 1 billion records? I am not sure if it's realistic to go through each record using script component to

    calcuate the HASH value. Also in lookup component, just full cache on even two columns of the target tables would pretty

    much suck most of the memory as it contains over 1 billions records.

    Any comments on large tables?

    thanks

    Hui

  • huishi.ca (3/19/2012)


    awesome article , however, I have a different scenario. What about the source and target tables are really large?

    close to 1 billion records? I am not sure if it's realistic to go through each record using script component to

    calcuate the HASH value. Also in lookup component, just full cache on even two columns of the target tables would pretty

    much suck most of the memory as it contains over 1 billions records.

    Any comments on large tables?

    thanks

    Hui

    For extremely large tables it might make more sense to implement change capturing in the source system to limit the number of records you have to work with in the package. If that's not an option this method will only really work with an SSIS box that has a ton of memory. You can modify the caching in the lookup as well to take pressure of the memory.

    Also I'd highly recommend using SSIS Multiple Hash instead of this method, it's much easier to incorporate into your ETL infrastructure than a custom script.

  • sabyasm (9/2/2010)


    [font="Comic Sans MS"]

    Brett,

    Thanks for the quick reply.

    Problem is - I don't have the control on the environment I am working and I am not allowed to use any custom component. So I am stuck with VB.NET script component.

    Do you have any idea why this might have happened? Anything related to memory buffer?

    Can you provide any alternate code snippet that might help.

    [/font]

    I am having the same issue with this script. In each buffer, rows with the same values have a different MD5 value to the previous buffer.

    I've requested the Multiple Hash component to be installed but in the meantime it would be great to get this to work correctly.

  • For testing purposes I created a spreadsheet with 3 columns, Column A, B and C each with the values A B C respectively in ~60,000 rows and used this as my data source.

    Using the script below:

    ' Microsoft SQL Server Integration Services Script Component

    ' Write scripts using Microsoft Visual Basic 2008.

    ' ScriptMain is the entry point class of the script.

    Imports System

    Imports System.Data

    Imports System.Math

    Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper

    Imports Microsoft.SqlServer.Dts.Runtime.Wrapper

    Imports Microsoft.SqlServer.Dts.Pipeline

    Imports System.Text

    Imports System.Security.Cryptography

    <Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute()> _

    <CLSCompliant(False)> _

    Public Class ScriptMain

    Inherits UserComponent

    Private InputBuffer As PipelineBuffer

    Public Overrides Sub ProcessInput(ByVal InputID As Integer, ByVal Buffer As PipelineBuffer)

    InputBuffer = Buffer

    MyBase.ProcessInput(InputID, Buffer)

    End Sub

    Public Overrides Sub PreExecute()

    MyBase.PreExecute()

    '

    ' Add your code here for preprocessing or remove if not needed

    ''

    End Sub

    Public Overrides Sub PostExecute()

    MyBase.PostExecute()

    '

    ' Add your code here for postprocessing or remove if not needed

    ' You can set read/write variables here, for example:

    ' Me.Variables.MyIntVar = 100

    ''

    End Sub

    Public Overrides Sub InputSource_ProcessInputRow(ByVal Row As InputSourceBuffer)

    Dim counter As Integer = 0

    Dim values As New StringBuilder

    For counter = 0 To inputBuffer.ColumnCount - 1

    Dim value As Object

    value = inputBuffer.Item(counter)

    values.Append(value)

    Next

    Row.Concat = values.ToString

    Row.HashValue = CreateHash(values.ToString())

    End Sub

    Public Shared Function CreateHash(ByVal data As String) As String

    Dim dataToHash As Byte() = (New UnicodeEncoding()).GetBytes(data)

    Dim md5 As MD5 = New MD5CryptoServiceProvider()

    Dim hashedData As Byte() = md5.ComputeHash(dataToHash)

    RNGCryptoServiceProvider.Create().GetBytes(dataToHash)

    Dim s As String = Convert.ToBase64String(hashedData, Base64FormattingOptions.None)

    Return s

    End Function

    End Class

    I get the following results:

    Row Row 1 onwards

    Column A A

    Column B B

    Column C C

    HashValue cWd0tTqSuYUlu6F5oVnMcQ==

    Concat ABC

    Row Row 11167 onwards

    Column A A

    Column B B

    Column C C

    HashValue EP2rhBgzqDEc0qmEjWlQBA==

    Concat ABCcWd0tTqSuYUlu6F5oVnMcQ==ABC

    Row Row 24194 onwards

    Column A A

    Column B B

    Column C C

    HashValue dLHcMhE2RJZ6ew7Jd8oezQ==

    Concat ABCEP2rhBgzqDEc0qmEjWlQBA==ABCcWd0tTqSuYUlu6F5oVnMcQ==ABC

    Row Row 37221 onwards

    Column A A

    Column B B

    Column C C

    HashValue 93k+mMDI1x5OoZ0cjtz7Hw==

    Concat ABCdLHcMhE2RJZ6ew7Jd8oezQ==ABCEP2rhBgzqDEc0qmEjWlQBA==ABCcWd0tTqSuYUlu6F5oVnMcQ==ABC

    Row Row 50248 onwards

    Column A A

    Column B B

    Column C C

    HashValue RHGVCjMX9usJb160IvP6RQ==

    Concat ABC93k+mMDI1x5OoZ0cjtz7Hw==ABCdLHcMhE2RJZ6ew7Jd8oezQ==ABCEP2rhBgzqDEc0qmEjWlQBA==ABCcWd0tTqSuYUlu6F5oVnMcQ==ABC

    Row Row 58881 onwards

    Column A A

    Column B B

    Column C C

    HashValue cWd0tTqSuYUlu6F5oVnMcQ==

    Concat ABC

    Looking at the concat results the previous buffers HashValue is getting include in the InputSource for some reason. When I add the line Row.ColumnCount = InputBuffer.ColumnCount the value is 5 for each row even though I've only selected 3 columns as part of InputSource.

  • I have implemented the code suggested and I am running into some trouble where every 3rd or 4th time I run the package, I get the same hash for columns that have not changed. Shouldn't I get the same hash "every" time I run the package if the columns have not changed?

    These tables outside of any OLTP environment so there is no change in the source data.

    I appreciate any help.

Viewing 15 posts - 46 through 60 (of 66 total)

You must be logged in to reply to this topic. Login to reply