HASHBYTES can help quickly load a Data Warehouse

  • Mike C (4/14/2010)I also prepend the binary length of the string to each column, which provides additional differentiation and provides a useful method for dealing with NULLs. For instance:

    "a " + "b" ==> 0x00000002 + "a " + "|" + 0x00000001 + "b"

    When you have a NULL you can deal with it by passing in the binary equivalent of -1 as the length, or you can concatenate a length of 0 yet append one or more data bytes (0x00000000 + 0x00), or come up with some other scenario that cannot be duplicated by any other combination of length + data for any other data types.

    Alternatively, you can compensate for nulls by cheating with FOR XML:

    select

    CustomerId

    ,FirstName

    ,MiddleName

    ,LastName

    ,md5 = hashbytes( 'md5',

    ( select c.CustomerId,c.FirstName,c.MiddleName,c.LastName for xml raw )

    )

    from Customer c

    /edit: yaddayadda performance yadda binary base64 yaddayadda character length yadda something about unicode yadda YMMV

  • Nice article. Thanks for putting it together for us to read. I will have to test the concepts presented here.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • Mike C (4/14/2010)


    So yes, you are correct, once you've generated 1,208,925,819,614,629,174,706,176 160-bit hashes in a single table you're fairly certain to have a collision and it's probably time to move up into a larger hash size. And we didn't even touch on the topic of 512 bit hashes yet.

    But keep in mind, the hash is not checked against the entire table, it is only checked against the natural key. The chances of a collision within a natural key, if implemented correctly, should be zero.

  • In my book any likelyhood = a certainty (at some point down the line). Do you have anything in place to catch collisions should they occur or are you just hoping they won't?

    Hope isn't a strategy 🙂

    Hope isn't a strategy but realistic risk management is perfectly valid. The odds are so low that it sometimes isn't worth additional effort and time in a daily (or more frequent) DW load to check every row. What's your alternative to finding the 5% of changes in a 500M row source extract in a reasonable amount of time? A complete compare of dozens of columns takes many hours longer than calculating and comparing one hash column.

  • Sean Terry (4/14/2010)


    Mike C (4/14/2010)I also prepend the binary length of the string to each column, which provides additional differentiation and provides a useful method for dealing with NULLs. For instance:

    "a " + "b" ==> 0x00000002 + "a " + "|" + 0x00000001 + "b"

    When you have a NULL you can deal with it by passing in the binary equivalent of -1 as the length, or you can concatenate a length of 0 yet append one or more data bytes (0x00000000 + 0x00), or come up with some other scenario that cannot be duplicated by any other combination of length + data for any other data types.

    Alternatively, you can compensate for nulls by cheating with FOR XML:

    select

    CustomerId

    ,FirstName

    ,MiddleName

    ,LastName

    ,md5 = hashbytes( 'md5',

    ( select c.CustomerId,c.FirstName,c.MiddleName,c.LastName for xml raw )

    )

    from Customer c

    /edit: yaddayadda performance yadda binary base64 yaddayadda character length yadda something about unicode yadda YMMV

    I do something similar using binary base64 + FOR XML + recursive CTEs to hash LOB data in 8,000 byte chunks. Keep in mind that HashBytes has an 8,000 byte limit and you can potentially exceed that with XML data. Scary part is HashBytes won't even error if you hit the 8,000 byte limit -- it silently truncates the data to 8,000 bytes so you have to be careful with it.

  • If you're planning to compare all of the columns anyway then you don't need to generate a hash code.

    I dont agree with this. An index is only useful for retrieving results using its leading column. The idea is to use the hash value as its leading column so the cardinality of the index is very high and matching is very quick. This is all about cardinality!!!! The index also includes the actual values being hashed so that a match is 100%. This may not be efficient if you have very large text fields as the index will be large and take up a lot of disk space.

  • I am getting this error

    Msg 156, Level 15, State 1, Line 1

    Incorrect syntax near the keyword 'as'.

    Msg 156, Level 15, State 1, Line 3

    Incorrect syntax near the keyword 'as'.

    while executing this lines

    merge fact_stage as target

    using (select fact_nk, dim_nk, fact_data, hashbytes('MD5',convert(varchar(max),fact_data))

    from fact_source) as source (fact_nk, dim_nk,fact_data,chksum)

    on target.fact_nk = source.fact_nk

    when matched and target.chksum != source.chksum

    then update

    set target.chksum = source.chksum

    , target.fact_data = source.fact_data

    , load_flag = 'Y'

    when not matched then insert

    (fact_nk,dim_nk,fact_data,load_flag,chksum)

    values (source.fact_nk,source.dim_nk,source.fact_data,'Y',source.chksum);

  • bohra_anand (4/15/2010)


    I am getting this error

    Msg 156, Level 15, State 1, Line 1

    Incorrect syntax near the keyword 'as'.

    Msg 156, Level 15, State 1, Line 3

    Incorrect syntax near the keyword 'as'.

    Are you doing this on SQL Server 2008?

  • tarquin (4/15/2010)


    If you're planning to compare all of the columns anyway then you don't need to generate a hash code.

    I dont agree with this. An index is only useful for retrieving results using its leading column. The idea is to use the hash value as its leading column so the cardinality of the index is very high and matching is very quick. This is all about cardinality!!!! The index also includes the actual values being hashed so that a match is 100%. This may not be efficient if you have very large text fields as the index will be large and take up a lot of disk space.

    Assuming you have a primary key defined, your cardinality for locating the row is handled already, would you agree? At that point you're simply checking for inequality in non-key columns. If you believe that the hash function can't be trusted to detect changes then what exactly have you saved? You've basically introduced the overhead of generating and storing a hash into your process on top of comparing all columns for inequality. So what is the advantage in this scenario where you don't trust the hash function?

  • Correct - the hash is not the lookup.

    To make this work correctly, you create an index on the natural key and include the hash column. This way, you lookup on the natural key, then compare the stored hash against the incoming hash.

    The hash should not be a key in an index. Ever. Data changes - as a result, the hash would change and so would its location in the index. That's not a good idea in general.

    Phil

  • tarquin (4/15/2010)


    The idea is to use the hash value as its leading column so the cardinality of the index is very high and matching is very quick. This is all about cardinality!!!! The index also includes the actual values being hashed so that a match is 100%.

    No, that's actually backwards. One indexes the natural key from the source to look up the hash value. This lets you determine if the record for that key is changed or does not yet exist. If the hash is looked up upon then it is only possible to tell if the row already exists as a match. One can't tell the difference between an update to an existing row and a new insert if the hash is not found.

    Consider the following chain of events:

    1. stage record has key 1, hash 'def'

    2. existing key 1 hash 'abc'

    3. Lookup 'def', not found

    4. Insert key 1, primary key violation

    versus:

    1. stage record has key 1, hash 'def'

    2. existing key 1 hash 'abc'

    3. lookup key 1, stored hash is not equal to incoming hash

    4. update key 1, success

    The alternative to flow 1, step 4 is upsert key 1 but this is often less efficient than sending all the updates to an update path in your etl and all inserts to another.

  • bohra_anand (4/15/2010)


    I am getting this error

    Msg 156, Level 15, State 1, Line 1

    Incorrect syntax near the keyword 'as'.

    You may be using an older version of SQL Server. Sorry, I forgot to mention in the article that the MERGE command requires any 2008 version (express and up). The HASHBYTES routine works in 2008, 2005, and I think even 2000 has an MD5 routine but it may be called something else.

  • Quoting "The Microsoft Data Warehouse Toolkit" by Ross and Kimball pg 245: "...This is a common technique for high performance dimension change management. Compute the hashes and store them in the dimension table. Then compute the hashes on the incoming rowset and compare to stored values. The comparison on a single, relatively short string column is far more efficient than pair wise comparison on dozens of seperate columns."

    So this technique can be used to do difference checking for dimensions and also attempt to speed up updates onto fact records in specific cases although I do admit this is stretching the above mentioned quote a bit.

    Fact data that has been written to a datawarehouse typicall does not change i.e. is not updated, it is flagged as non-current, or deleted or just left alone and reloaded data has a more recent load date. Updates are very very slow on fact tables that are very large. In our case when we reload a fact record we flag the existing record as non-current by updating a bit field to false - this hides the "old" record in our views used to expose the data (Kimball recommends DWH data is always exposed via views to abstract the underlying structure). So the challenge we face is to quickly find previously loaded records and update the bit flag and then load the new record. In this scenario its better to use the hash or checkum over hash value as the leading column of an index to find these records rather than match on multiple incoming logical keys. The does depend on the nature of the data and in some cases the logical key has the same cardinality as the hash value (when you load a small volume per day and effective date ends up being unique). You would need to check and see if you can significalty improve the cardinality of the index using a hash value.

  • tarquin,

    I'm not quite clear on how you know the hash of a record in a fact table ahead of time such that it helps with your updates.

  • tarquin (4/16/2010)


    Quoting "The Microsoft Data Warehouse Toolkit" by Ross and Kimball pg 245: "...This is a common technique for high performance dimension change management. Compute the hashes and store them in the dimension table. Then compute the hashes on the incoming rowset and compare to stored values. The comparison on a single, relatively short string column is far more efficient than pair wise comparison on dozens of seperate columns."

    So this technique can be used to do difference checking for dimensions and also attempt to speed up updates onto fact records in specific cases although I do admit this is stretching the above mentioned quote a bit.

    Fact data that has been written to a datawarehouse typicall does not change i.e. is not updated, it is flagged as non-current, or deleted or just left alone and reloaded data has a more recent load date. Updates are very very slow on fact tables that are very large. In our case when we reload a fact record we flag the existing record as non-current by updating a bit field to false - this hides the "old" record in our views used to expose the data (Kimball recommends DWH data is always exposed via views to abstract the underlying structure). So the challenge we face is to quickly find previously loaded records and update the bit flag and then load the new record. In this scenario its better to use the hash or checkum over hash value as the leading column of an index to find these records rather than match on multiple incoming logical keys. The does depend on the nature of the data and in some cases the logical key has the same cardinality as the hash value (when you load a small volume per day and effective date ends up being unique). You would need to check and see if you can significalty improve the cardinality of the index using a hash value.

    Quoting "Mike C": The only way Kimball's generalized suggestions will work is if you compute TWO hashes per row. That might be why he indicated you'd be computing "hashes" in your quote above.

    Consider the process. You have an inbound row. The first step is you need to locate the matching row (based on natural key/business key) in the target dimension table. You can generate a hash of the natural key/business key column(s) for this purpose. In that instance it makes sense to create an index with that hash column as the leading column. Apart from that it makes no sense at all. This provides no better cardinality than a unique constraint on the natural key/business key columns. Very rarely have I encountered natural keys in dimension tables that were "dozens of columns".

    Now consider the second step which is to determine whether the non-key columns of an existing row have changed. For this you need to calculate a hash on the non-key columns and compare. Once you've located your data in the table with the first hash you have to comape the non-key column hashes to actually detect changes. Cardinality is not an issue at that point, you're comparing 1 column of 1 row to 1 column of 1 row.

    I take exception to Kimball's point about the efficiency of comparing one (binary) string value, etc., for this reason: your natural key for a dimension might be an integer, a varchar(10) or even some combination of fields like (int, int, char(4)). An MD5 hash is 16 bytes long. An int is 4 bytes long. It's doubtful you'll achieve better performance comparing one 16 byte binary column as opposed to comparing one (or two or three or four) int columns, char(4) columns, etc.

    Kimball writes some interesting books, from an academic perspective. If you really want to see how and if it works in the real world, implement it yourself and give it a go.

Viewing 15 posts - 31 through 45 (of 108 total)

You must be logged in to reply to this topic. Login to reply