HASHBYTES can help quickly load a Data Warehouse

Question

Post reply

HASHBYTES can help quickly load a Data Warehouse

Viewing 15 posts - 31 through 45 (of 108 total)

You must be logged in to reply to this topic. Login to reply

Sean Terry Old Hand Points: 353 More actions · Answer 1

Mike C (4/14/2010)I also prepend the binary length of the string to each column, which provides additional differentiation and provides a useful method for dealing with NULLs. For instance:

"a " + "b" ==> 0x00000002 + "a " + "|" + 0x00000001 + "b"

When you have a NULL you can deal with it by passing in the binary equivalent of -1 as the length, or you can concatenate a length of 0 yet append one or more data bytes (0x00000000 + 0x00), or come up with some other scenario that cannot be duplicated by any other combination of length + data for any other data types.

Alternatively, you can compensate for nulls by cheating with FOR XML:

select



	 CustomerId

	,FirstName

	,MiddleName

	,LastName

	,md5 = hashbytes( 'md5',

		( select c.CustomerId,c.FirstName,c.MiddleName,c.LastName for xml raw )

	 )

from Customer c

/edit: yaddayadda performance yadda binary base64 yaddayadda character length yadda something about unicode yadda YMMV

SQLRNNR SSC Guru Points: 281324 More actions · Answer 2

SQLRNNR

SSC Guru

Points: 281324

April 14, 2010 at 10:57 am

#1151551

Nice article. Thanks for putting it together for us to read. I will have to test the concepts presented here.

Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events

Phil Brammer Ten Centuries Points: 1135 More actions · Answer 3

Mike C (4/14/2010)

So yes, you are correct, once you've generated 1,208,925,819,614,629,174,706,176 160-bit hashes in a single table you're fairly certain to have a collision and it's probably time to move up into a larger hash size. And we didn't even touch on the topic of 512 bit hashes yet.

But keep in mind, the hash is not checked against the entire table, it is only checked against the natural key. The chances of a collision within a natural key, if implemented correctly, should be zero.

magarity kerns SSCrazy Points: 2763 More actions · Answer 4

In my book any likelyhood = a certainty (at some point down the line). Do you have anything in place to catch collisions should they occur or are you just hoping they won't?

Hope isn't a strategy 🙂

Hope isn't a strategy but realistic risk management is perfectly valid. The odds are so low that it sometimes isn't worth additional effort and time in a daily (or more frequent) DW load to check every row. What's your alternative to finding the 5% of changes in a 500M row source extract in a reasonable amount of time? A complete compare of dozens of columns takes many hours longer than calculating and comparing one hash column.

Mike C SSC-Insane Points: 23224 More actions · Answer 5

Sean Terry (4/14/2010)

Mike C (4/14/2010)I also prepend the binary length of the string to each column, which provides additional differentiation and provides a useful method for dealing with NULLs. For instance:

"a " + "b" ==> 0x00000002 + "a " + "|" + 0x00000001 + "b"

When you have a NULL you can deal with it by passing in the binary equivalent of -1 as the length, or you can concatenate a length of 0 yet append one or more data bytes (0x00000000 + 0x00), or come up with some other scenario that cannot be duplicated by any other combination of length + data for any other data types.

Alternatively, you can compensate for nulls by cheating with FOR XML:

select
CustomerId ,FirstName ,MiddleName ,LastName ,md5 = hashbytes( 'md5', ( select c.CustomerId,c.FirstName,c.MiddleName,c.LastName for xml raw ) )
from Customer c

/edit: yaddayadda performance yadda binary base64 yaddayadda character length yadda something about unicode yadda YMMV

I do something similar using binary base64 + FOR XML + recursive CTEs to hash LOB data in 8,000 byte chunks. Keep in mind that HashBytes has an 8,000 byte limit and you can potentially exceed that with XML data. Scary part is HashBytes won't even error if you hit the 8,000 byte limit -- it silently truncates the data to 8,000 bytes so you have to be careful with it.

tarquin SSC Veteran Points: 298 More actions · Answer 6

If you're planning to compare all of the columns anyway then you don't need to generate a hash code.

I dont agree with this. An index is only useful for retrieving results using its leading column. The idea is to use the hash value as its leading column so the cardinality of the index is very high and matching is very quick. This is all about cardinality!!!! The index also includes the actual values being hashed so that a match is 100%. This may not be efficient if you have very large text fields as the index will be large and take up a lot of disk space.

bohra_anand SSC-Addicted Points: 473 More actions · Answer 7

I am getting this error

Msg 156, Level 15, State 1, Line 1

Incorrect syntax near the keyword 'as'.

Msg 156, Level 15, State 1, Line 3

Incorrect syntax near the keyword 'as'.

while executing this lines

merge fact_stage as target

using (select fact_nk, dim_nk, fact_data, hashbytes('MD5',convert(varchar(max),fact_data))

from fact_source) as source (fact_nk, dim_nk,fact_data,chksum)

on target.fact_nk = source.fact_nk

when matched and target.chksum != source.chksum

then update

set target.chksum = source.chksum

, target.fact_data = source.fact_data

, load_flag = 'Y'

when not matched then insert

(fact_nk,dim_nk,fact_data,load_flag,chksum)

values (source.fact_nk,source.dim_nk,source.fact_data,'Y',source.chksum);

Phil Brammer Ten Centuries Points: 1135 More actions · Answer 8

Phil Brammer

Ten Centuries

Points: 1135

April 15, 2010 at 7:37 am

#1152111

bohra_anand (4/15/2010)

I am getting this error

Msg 156, Level 15, State 1, Line 1

Incorrect syntax near the keyword 'as'.

Msg 156, Level 15, State 1, Line 3

Incorrect syntax near the keyword 'as'.

Are you doing this on SQL Server 2008?

Mike C SSC-Insane Points: 23224 More actions · Answer 9

tarquin (4/15/2010)

If you're planning to compare all of the columns anyway then you don't need to generate a hash code.

I dont agree with this. An index is only useful for retrieving results using its leading column. The idea is to use the hash value as its leading column so the cardinality of the index is very high and matching is very quick. This is all about cardinality!!!! The index also includes the actual values being hashed so that a match is 100%. This may not be efficient if you have very large text fields as the index will be large and take up a lot of disk space.

Assuming you have a primary key defined, your cardinality for locating the row is handled already, would you agree? At that point you're simply checking for inequality in non-key columns. If you believe that the hash function can't be trusted to detect changes then what exactly have you saved? You've basically introduced the overhead of generating and storing a hash into your process on top of comparing all columns for inequality. So what is the advantage in this scenario where you don't trust the hash function?

Phil Brammer Ten Centuries Points: 1135 More actions · Answer 10

Correct - the hash is not the lookup.

To make this work correctly, you create an index on the natural key and include the hash column. This way, you lookup on the natural key, then compare the stored hash against the incoming hash.

The hash should not be a key in an index. Ever. Data changes - as a result, the hash would change and so would its location in the index. That's not a good idea in general.

Phil

magarity kerns SSCrazy Points: 2763 More actions · Answer 11

tarquin (4/15/2010)

The idea is to use the hash value as its leading column so the cardinality of the index is very high and matching is very quick. This is all about cardinality!!!! The index also includes the actual values being hashed so that a match is 100%.

No, that's actually backwards. One indexes the natural key from the source to look up the hash value. This lets you determine if the record for that key is changed or does not yet exist. If the hash is looked up upon then it is only possible to tell if the row already exists as a match. One can't tell the difference between an update to an existing row and a new insert if the hash is not found.

Consider the following chain of events:

1. stage record has key 1, hash 'def'

2. existing key 1 hash 'abc'

3. Lookup 'def', not found

4. Insert key 1, primary key violation

versus:

1. stage record has key 1, hash 'def'

2. existing key 1 hash 'abc'

3. lookup key 1, stored hash is not equal to incoming hash

4. update key 1, success

The alternative to flow 1, step 4 is upsert key 1 but this is often less efficient than sending all the updates to an update path in your etl and all inserts to another.

magarity kerns SSCrazy Points: 2763 More actions · Answer 12

bohra_anand (4/15/2010)

I am getting this error

Msg 156, Level 15, State 1, Line 1

Incorrect syntax near the keyword 'as'.

You may be using an older version of SQL Server. Sorry, I forgot to mention in the article that the MERGE command requires any 2008 version (express and up). The HASHBYTES routine works in 2008, 2005, and I think even 2000 has an MD5 routine but it may be called something else.

tarquin SSC Veteran Points: 298 More actions · Answer 13

Quoting "The Microsoft Data Warehouse Toolkit" by Ross and Kimball pg 245: "...This is a common technique for high performance dimension change management. Compute the hashes and store them in the dimension table. Then compute the hashes on the incoming rowset and compare to stored values. The comparison on a single, relatively short string column is far more efficient than pair wise comparison on dozens of seperate columns."

So this technique can be used to do difference checking for dimensions and also attempt to speed up updates onto fact records in specific cases although I do admit this is stretching the above mentioned quote a bit.

Fact data that has been written to a datawarehouse typicall does not change i.e. is not updated, it is flagged as non-current, or deleted or just left alone and reloaded data has a more recent load date. Updates are very very slow on fact tables that are very large. In our case when we reload a fact record we flag the existing record as non-current by updating a bit field to false - this hides the "old" record in our views used to expose the data (Kimball recommends DWH data is always exposed via views to abstract the underlying structure). So the challenge we face is to quickly find previously loaded records and update the bit flag and then load the new record. In this scenario its better to use the hash or checkum over hash value as the leading column of an index to find these records rather than match on multiple incoming logical keys. The does depend on the nature of the data and in some cases the logical key has the same cardinality as the hash value (when you load a small volume per day and effective date ends up being unique). You would need to check and see if you can significalty improve the cardinality of the index using a hash value.

Phil Brammer Ten Centuries Points: 1135 More actions · Answer 14

Phil Brammer

Ten Centuries

Points: 1135

April 16, 2010 at 6:56 am

#1152787

tarquin,

I'm not quite clear on how you know the hash of a record in a fact table ahead of time such that it helps with your updates.

Mike C SSC-Insane Points: 23224 More actions · Answer 15

tarquin (4/16/2010)

Quoting "The Microsoft Data Warehouse Toolkit" by Ross and Kimball pg 245: "...This is a common technique for high performance dimension change management. Compute the hashes and store them in the dimension table. Then compute the hashes on the incoming rowset and compare to stored values. The comparison on a single, relatively short string column is far more efficient than pair wise comparison on dozens of seperate columns."

So this technique can be used to do difference checking for dimensions and also attempt to speed up updates onto fact records in specific cases although I do admit this is stretching the above mentioned quote a bit.

Fact data that has been written to a datawarehouse typicall does not change i.e. is not updated, it is flagged as non-current, or deleted or just left alone and reloaded data has a more recent load date. Updates are very very slow on fact tables that are very large. In our case when we reload a fact record we flag the existing record as non-current by updating a bit field to false - this hides the "old" record in our views used to expose the data (Kimball recommends DWH data is always exposed via views to abstract the underlying structure). So the challenge we face is to quickly find previously loaded records and update the bit flag and then load the new record. In this scenario its better to use the hash or checkum over hash value as the leading column of an index to find these records rather than match on multiple incoming logical keys. The does depend on the nature of the data and in some cases the logical key has the same cardinality as the hash value (when you load a small volume per day and effective date ends up being unique). You would need to check and see if you can significalty improve the cardinality of the index using a hash value.

Quoting "Mike C": The only way Kimball's generalized suggestions will work is if you compute TWO hashes per row. That might be why he indicated you'd be computing "hashes" in your quote above.

Consider the process. You have an inbound row. The first step is you need to locate the matching row (based on natural key/business key) in the target dimension table. You can generate a hash of the natural key/business key column(s) for this purpose. In that instance it makes sense to create an index with that hash column as the leading column. Apart from that it makes no sense at all. This provides no better cardinality than a unique constraint on the natural key/business key columns. Very rarely have I encountered natural keys in dimension tables that were "dozens of columns".

Now consider the second step which is to determine whether the non-key columns of an existing row have changed. For this you need to calculate a hash on the non-key columns and compare. Once you've located your data in the table with the first hash you have to comape the non-key column hashes to actually detect changes. Cardinality is not an issue at that point, you're comparing 1 column of 1 row to 1 column of 1 row.

I take exception to Kimball's point about the efficiency of comparing one (binary) string value, etc., for this reason: your natural key for a dimension might be an integer, a varchar(10) or even some combination of fields like (int, int, char(4)). An MD5 hash is 16 bytes long. An int is 4 bytes long. It's doubtful you'll achieve better performance comparing one 16 byte binary column as opposed to comparing one (or two or three or four) int columns, char(4) columns, etc.

Kimball writes some interesting books, from an academic perspective. If you really want to see how and if it works in the real world, implement it yourself and give it a go.