HASHBYTES can help quickly load a Data Warehouse

  • tarquin (4/16/2010)


    Compute the hashes and store them in the dimension table

    Fact data that has been written to a datawarehouse typicall does not change i.e. is not updated, it is flagged as non-current, or deleted or just left alone and reloaded data has a more recent load date

    In the first quoted point, the major innovation pointed out in my article is to use the super narrow cross reference table to store the keys and hash only. There are multiple performance benefits using these instead of the much wider target tables and the only downside is maintaining another table. You can index it as you please if you don't like my recommendation but it seems from your arguments that you're grinding a theoretical axe instead of actually using either technique.

    On the second point, you're in an unusual situation if this is true where you work. It is usually the first academic DW point that is tossed out by the business analysts either on their own or under influence of the DBAs who howl about read performance. I've yet to encounter a working DW that doesn't update overtop of old records at least half of the primary tables.

  • Mike C (4/16/2010)


    I take exception to Kimball's point about the efficiency of comparing one (binary) string value, etc., for this reason: your natural key for a dimension might be an integer, a varchar(10) or even some combination of fields like (int, int, char(4)). An MD5 hash is 16 bytes long. An int is 4 bytes long. It's doubtful you'll achieve better performance comparing one 16 byte binary column as opposed to comparing one (or two or three or four) int columns, char(4) columns,

    Maybe Ralph's got some of the new AMD Bulldozer 128 bit CPUs.

  • Mike C (4/14/2010)


    I wouldn't expect many collisions with MD5 unless you're processing 2^64 or more rows. If you're getting collisions with MD5 or SHA I would immediately suspect the concatenation function you're using.

    I agree pretty strongly with that (or it could be an MD5 implementation that doesn't actually do MD5 - I've come across one or two of those). Quick top-of-the-head calculation suggests probability of chance collision at 2^40 (random) rows (which is more than most data warehouses see in a 3 year period) is about 2^-26. Anyway, the function of the hash in DW loading should not be to detect sameness with 100% accuracy, it should be to reduce (by several orders of magnitude) the number of rows that must be checked field by field , so a small number of collisions is acceptable for this application.

    edit: typos

    Tom

  • Mike C (4/14/2010)


    tarquin (4/14/2010)


    Wrapping hashbytes with checksum is not bad as long as you include all columns in your join as indicated and create the index including all columns required for the join. If you only use the result of the checksum in your join - yes it is bad!

    If you're planning to compare all of the columns anyway then you don't need to generate a hash code.

    Got infinite CPU resource on your servers, have you?

    Tom

  • Tom.Thomson (4/17/2010)


    Mike C (4/14/2010)


    tarquin (4/14/2010)


    Wrapping hashbytes with checksum is not bad as long as you include all columns in your join as indicated and create the index including all columns required for the join. If you only use the result of the checksum in your join - yes it is bad!

    If you're planning to compare all of the columns anyway then you don't need to generate a hash code.

    Got infinite CPU resource on your servers, have you?

    Fortunately for us generating hashes is infinitely fast so we can completely discount that overhead.

  • Tom.Thomson (4/17/2010)


    Mike C (4/14/2010)


    I wouldn't expect many collisions with MD5 unless you're processing 2^64 or more rows. If you're getting collisions with MD5 or SHA I would immediately suspect the concatenation function you're using.

    I agree pretty strongly with that (or it could be an MD5 implementation that doesn't actually do MD5 - I've come across one or two of those). Quick top-of-the-head calculation suggests probability of chance collision at 2^40 (random) rows (which is more than most data warehouses see in a 3 year period) is about 2^-26. Anyway, the function of the hash in DW loading should not be to detect sameness with 100% accuracy, it should be to reduce (by several orders of magnitude) the number of rows that must be checked field by field , so a small number of collisions is acceptable for this application.

    edit: typos

    Not sure how you calculated the probability of collision at 2^40 off the top of your head... You can engineer an MD5 collision in 2^39 or less, but it's a pretty hard calculation to do without showing your work. In fact I'll post a very simple MD5 hash collision example shortly. This is why I recommend SHA-1 or better (probability of collision for 160-bit hash is 2^80 per the birthday paradox). Even the current crop of theoretical attacks on SHA-1 have only pushed the probability of forcibly generating an SHA-1 hash collision to 2^69 which is higher than the standard probability of collision with MD5. It's really hard to prove a negative and in most cases you're not going to save much by using a hash in a compromise solution to eliminate a few rows from full-blown multi-column comparisons. I'm putting together a blog entry to give folks who want to use hashes in this way a chance to prove their case.

  • Mike C (4/17/2010)


    This is why I recommend SHA-1 or better (probability of collision for 160-bit hash is 2^80 per the birthday paradox). Even the current crop of theoretical attacks on SHA-1 have only pushed the probability of forcibly generating an SHA-1 hash collision to 2^69 which is higher than the standard probability of collision with MD5. It's really hard to prove a negative and in most cases you're not going to save much by using a hash in a compromise solution to eliminate a few rows from full-blown multi-column comparisons. I'm putting together a blog entry to give folks who want to use hashes in this way a chance to prove their case.

    Sure, if you want secure validation MD5 is a bit old and week. But for the DW upload application you aren't after secure validation, you are after saving comptutation. So the questions are what fraction of the cost of a full row comparison is the hash algorithm, and what is the expected proportion of cases where the hash algorithm will deliver a negative result (ie will eliminate the need for a full row comparison). If the cost fraction is P and the elimination proportion is Q using the hash multiplies the cost by (1+P-Q) so as long as P is smaller than Q you are winning. Certainly SHA1 will have a bigger Q than MD5, but it will also have a bigger P and I am pretty sure that the increase in P is much greater than the increase in Q, which would mean that SHA1 is less efficient for this particular application.

    Tom

  • Tom.Thomson (4/17/2010)


    Mike C (4/17/2010)


    This is why I recommend SHA-1 or better (probability of collision for 160-bit hash is 2^80 per the birthday paradox). Even the current crop of theoretical attacks on SHA-1 have only pushed the probability of forcibly generating an SHA-1 hash collision to 2^69 which is higher than the standard probability of collision with MD5. It's really hard to prove a negative and in most cases you're not going to save much by using a hash in a compromise solution to eliminate a few rows from full-blown multi-column comparisons. I'm putting together a blog entry to give folks who want to use hashes in this way a chance to prove their case.

    Sure, if you want secure validation MD5 is a bit old and week. But for the DW upload application you aren't after secure validation, you are after saving comptutation. So the questions are what fraction of the cost of a full row comparison is the hash algorithm, and what is the expected proportion of cases where the hash algorithm will deliver a negative result (ie will eliminate the need for a full row comparison). If the cost fraction is P and the elimination proportion is Q using the hash multiplies the cost by (1+P-Q) so as long as P is smaller than Q you are winning. Certainly SHA1 will have a bigger Q than MD5, but it will also have a bigger P and I am pretty sure that the increase in P is much greater than the increase in Q, which would mean that SHA1 is less efficient for this particular application.

    I just posted a blog post with a contest to give folks a chance to prove a negative. The blog entry shows (1) a very simple MD5 hash collision with two relatively short binary strings that have one byte difference. I'm also holding a contest (discussed on the blog as well) -- the first person to submit an SHA-1 hash collision wins $100 and a book: http://bit.ly/cRKByv

    On your assertion above, I agree completely, and I discuss this on the blog. Taking the example of the "slowly" changing dimension, the question is how many inbound rows are going to be inserted or updated, and how many are going to be ignored? You also need to break "updates" into two separate types of updates -- updates where the hash values are different and updates where the hash values are the same.

    The only inbound rows that don't require a full all-columns comparison are new inserts (row doesn't exist in target table) and update rows (row exists in target table, but hash is different). As you point out, for the compromise position to add any value whatsoever you have to have a much, much higher percentage of inserts and updates (where hash is different) than non-change/ignore rows or update rows where the hash is the same. If the percentage of inserts and updates (different hash) is not significantly higher than the number of rows to ignore and update (same hash), then you not only don't gain any value, you lose a whole lot of efficiency by unnecessarily adding the overhead of generating hashes for no useful reason.

    The vast majority of ETL solutions I've encountered tend to discard a lot more duplicate rows than they update or insert (after the initial table load, that is). Perhaps someone can provide some counter-examples they've encountered where the opposite is true?

  • Tom.Thomson (4/17/2010)


    Mike C (4/17/2010)


    This is why I recommend SHA-1 or better (probability of collision for 160-bit hash is 2^80 per the birthday paradox). Even the current crop of theoretical attacks on SHA-1 have only pushed the probability of forcibly generating an SHA-1 hash collision to 2^69 which is higher than the standard probability of collision with MD5. It's really hard to prove a negative and in most cases you're not going to save much by using a hash in a compromise solution to eliminate a few rows from full-blown multi-column comparisons. I'm putting together a blog entry to give folks who want to use hashes in this way a chance to prove their case.

    Sure, if you want secure validation MD5 is a bit old and week. But for the DW upload application you aren't after secure validation, you are after saving comptutation. So the questions are what fraction of the cost of a full row comparison is the hash algorithm, and what is the expected proportion of cases where the hash algorithm will deliver a negative result (ie will eliminate the need for a full row comparison). If the cost fraction is P and the elimination proportion is Q using the hash multiplies the cost by (1+P-Q) so as long as P is smaller than Q you are winning. Certainly SHA1 will have a bigger Q than MD5, but it will also have a bigger P and I am pretty sure that the increase in P is much greater than the increase in Q, which would mean that SHA1 is less efficient for this particular application.

    DECLARE @a varbinary(8000),

    @b-2 varbinary(8000),

    @hA binary(16),

    @hb binary(16);

    SELECT @a = 0xd131dd02c5e6eec4693d9a0698aff95c2fcab58712467eab4004583eb8fb7f8955ad340609f4b30283e488832571415a085125e8f7cdc99fd91dbdf280373c5bd8823e3156348f5bae6dacd436c919c6dd53e2b487da03fd02396306d248cda0e99f33420f577ee8ce54b67080a80d1ec69821bcb6a8839396f9652b6ff72a70,

    @b-2 = 0xd131dd02c5e6eec4693d9a0698aff95c2fcab50712467eab4004583eb8fb7f8955ad340609f4b30283e4888325f1415a085125e8f7cdc99fd91dbd7280373c5bd8823e3156348f5bae6dacd436c919c6dd53e23487da03fd02396306d248cda0e99f33420f577ee8ce54b67080280d1ec69821bcb6a8839396f965ab6ff72a70;

    SELECT @hA = HASHBYTES('MD5', @a),

    @hb = HASHBYTES('MD5', @b-2);

    SELECT CASE WHEN @a = @b-2

    THEN '@A Equals @b-2'

    ELSE '@A Is Not Equal To @b-2'

    END AS AB_Equal,

    CASE WHEN @hA = @hb

    THEN '@hA Equals @hb'

    ELSE '@hA Is Not Equal To @hb'

    END AS Hash_Equal;

    Now do this with SHA1, submit it to my blog http://sqlblog.com/blogs/michael_coles/archive/2010/04/17/find-a-hash-collision-win-100.aspx and we'll see if you can't win $100 and a book.

  • Mike C (4/17/2010)


    SELECT @a = 0xd131dd02c5e6eec4693d9a0698aff95c2fcab58712467eab4004583eb8fb7f8955ad340609f4b30283e488832571415a085125e8f7cdc99fd91dbdf280373c5bd8823e3156348f5bae6dacd436c919c6dd53e2b487da03fd02396306d248cda0e99f33420f577ee8ce54b67080a80d1ec69821bcb6a8839396f9652b6ff72a70,

    I surrender; your business users are infintely more badass than mine if they're analyzing these kinds of statistics in your data warehouse.

  • magarity kerns (4/17/2010)


    Mike C (4/17/2010)


    SELECT @a = 0xd131dd02c5e6eec4693d9a0698aff95c2fcab58712467eab4004583eb8fb7f8955ad340609f4b30283e488832571415a085125e8f7cdc99fd91dbdf280373c5bd8823e3156348f5bae6dacd436c919c6dd53e2b487da03fd02396306d248cda0e99f33420f577ee8ce54b67080a80d1ec69821bcb6a8839396f9652b6ff72a70,

    I surrender; your business users are infintely more badass than mine if they're analyzing these kinds of statistics in your data warehouse.

    LOL. This could actually represent a concatenation of several columns converted to binary format. There are examples of PDF files that generate collisions with MD5. Surely your business users use PDF files? 🙂

  • magarity kerns (4/17/2010)


    Mike C (4/17/2010)


    SELECT @a = 0xd131dd02c5e6eec4693d9a0698aff95c2fcab58712467eab4004583eb8fb7f8955ad340609f4b30283e488832571415a085125e8f7cdc99fd91dbdf280373c5bd8823e3156348f5bae6dacd436c919c6dd53e2b487da03fd02396306d248cda0e99f33420f577ee8ce54b67080a80d1ec69821bcb6a8839396f9652b6ff72a70,

    I surrender; your business users are infintely more badass than mine if they're analyzing these kinds of statistics in your data warehouse.

    Here's a version of the exact same thing that even a business user might be able to analyze 🙂 I wouldn't expect a business user to necessarily understand the concepts

    of data type conversions and hash functions, however. For that they'll have to read Wikipedia like everyone else.

    CREATE TABLE #Temp

    (

    ID1 TINYINT,

    CODE1 CHAR(1),

    FEE1 SMALLMONEY,

    FEE2 SMALLMONEY,

    FEE3 SMALLMONEY,

    FEE4 SMALLMONEY,

    INTEREST1 SMALLMONEY,

    ID2 SMALLINT,

    CODE2 CHAR(1),

    CHARGE1 SMALLMONEY,

    ID3 TINYINT,

    PRICE1 SMALLMONEY,

    CODE3 CHAR(1),

    ID4 TINYINT,

    ID5 TINYINT,

    CHARGE2 SMALLMONEY,

    CHARGE3 SMALLMONEY,

    ID6 TINYINT,

    STATE CHAR(2),

    TAX1 SMALLMONEY,

    TAX2 SMALLMONEY,

    TAX3 SMALLMONEY,

    TAX4 SMALLMONEY,

    SURCHARGE1 SMALLMONEY,

    SURCHARGE2 SMALLMONEY,

    SURCHARGE3 SMALLMONEY,

    SURCHARGE4 SMALLMONEY,

    SURCHARGE5 SMALLMONEY,

    SURCHARGE6 SMALLMONEY,

    ID7 TINYINT,

    CODE4 CHAR(1),

    CODE5 CHAR(1),

    ID8 TINYINT,

    REFUND1 SMALLMONEY,

    REFUND2 SMALLMONEY,

    ID9 TINYINT,

    FEE5 SMALLMONEY,

    FEE6 SMALLMONEY,

    FEE7 SMALLMONEY,

    FEE8 SMALLMONEY,

    FEE9 SMALLMONEY,

    ID10 TINYINT,

    CODE6 CHAR(1),

    ID11 TINYINT,

    FEE10 SMALLMONEY

    );

    INSERT INTO #Temp VALUES

    (

    209,

    1,

    -58702.0826,

    -28911.7891,

    -171084.3729,

    -11139.8966,

    -124943.9162,

    32427,

    '@',

    7289.2088,

    251,

    213970.6797,

    4,

    6,

    9,

    -18959.5005,

    -46081.5579,

    113,

    'AZ',

    13953.5848,

    -13750.8449,

    -65236.2254,

    -214386.3717,

    -66255.3039,

    144628.5147,

    -136854.4044,

    91914.8998,

    -58170.5036,

    -201575.5267,

    2,

    9,

    'c',

    6,

    -76698.0704,

    -37544.2622,

    15,

    146793.4926,

    142124.2496,

    -147553.5162,

    -174261.9466,

    -146777.201,

    249,

    'e',

    43,

    187846.9232

    );

    INSERT INTO #Temp

    VALUES

    (

    209,

    1,

    -58702.0826,

    -28911.7891,

    -171084.3729,

    -11139.8966,

    -125782.777,

    32427,

    '@',

    7289.2088,

    251,

    213970.6797,

    4,

    6,

    9,

    -18959.5005,

    -46081.5579,

    241,

    'AZ',

    13953.5848,

    -13750.8449,

    -65236.2382,

    -214386.3717,

    -66255.3039,

    144628.5147,

    -136854.4044,

    91914.8998,

    -58170.5164,

    -201575.5267,

    2,

    9,

    'c',

    6,

    -76698.0704,

    -37544.2622,

    15,

    146793.4926,

    142124.2496,

    67194.8486,

    -174261.9466,

    -146777.201,

    249,

    'e',

    171,

    187846.9232

    );

    SELECT *

    FROM #Temp;

    SELECT HASHBYTES('MD5',

    CAST(ID1 AS BINARY(1)) +

    CAST(CODE1 AS BINARY(1)) +

    CAST(FEE1 AS VARBINARY(4)) +

    CAST(FEE2 AS VARBINARY(4)) +

    CAST(FEE3 AS VARBINARY(4)) +

    CAST(FEE4 AS VARBINARY(4)) +

    CAST(INTEREST1 AS VARBINARY(4)) +

    CAST(ID2 AS VARBINARY(2)) +

    CAST(CODE2 AS BINARY(1)) +

    CAST(CHARGE1 AS VARBINARY(4)) +

    CAST(ID3 AS BINARY(1)) +

    CAST(PRICE1 AS VARBINARY(4)) +

    CAST(CODE3 AS BINARY(1)) +

    CAST(ID4 AS BINARY(1)) +

    CAST(ID5 AS BINARY(1)) +

    CAST(CHARGE2 AS VARBINARY(4)) +

    CAST(CHARGE3 AS VARBINARY(4)) +

    CAST(ID6 AS BINARY(1)) +

    CAST(STATE AS VARBINARY(2)) +

    CAST(TAX1 AS VARBINARY(4)) +

    CAST(TAX2 AS VARBINARY(4)) +

    CAST(TAX3 AS VARBINARY(4)) +

    CAST(TAX4 AS VARBINARY(4)) +

    CAST(SURCHARGE1 AS VARBINARY(4)) +

    CAST(SURCHARGE2 AS VARBINARY(4)) +

    CAST(SURCHARGE3 AS VARBINARY(4)) +

    CAST(SURCHARGE4 AS VARBINARY(4)) +

    CAST(SURCHARGE5 AS VARBINARY(4)) +

    CAST(SURCHARGE6 AS VARBINARY(4)) +

    CAST(ID7 AS BINARY(1)) +

    CAST(CODE4 AS BINARY(1)) +

    CAST(CODE5 AS BINARY(1)) +

    CAST(ID8 AS BINARY(1)) +

    CAST(REFUND1 AS VARBINARY(4)) +

    CAST(REFUND2 AS VARBINARY(4)) +

    CAST(ID9 AS BINARY(1)) +

    CAST(FEE5 AS VARBINARY(4)) +

    CAST(FEE6 AS VARBINARY(4)) +

    CAST(FEE7 AS VARBINARY(4)) +

    CAST(FEE8 AS VARBINARY(4)) +

    CAST(FEE9 AS VARBINARY(4)) +

    CAST(ID10 AS BINARY(1)) +

    CAST(CODE6 AS BINARY(1)) +

    CAST(ID11 AS BINARY(1)) +

    CAST(FEE10 AS VARBINARY(4))

    )

    FROM #Temp;

    DROP TABLE #Temp;

  • Mike C (4/17/2010)


    The vast majority of ETL solutions I've encountered tend to discard a lot more duplicate rows than they update or insert (after the initial table load, that is). Perhaps someone can provide some counter-examples they've encountered where the opposite is true?

    It's certainly true that if the system has those characteristics (many more discards than inserts/updates) then using a hash function is a waste of resource, so hashing should be avoided in that case. And with a good design (so that everything in the data source has a meaningful timestamp or version ID) there's no need for comparison of more than that (plus the primary key) and that's probably a lot cheaper than a hash. So now you have me wondering whether all this stuff I have read about using hashing to optimise update of DW is just hot air.

    Tom

  • Tom.Thomson (4/17/2010)


    Mike C (4/17/2010)


    The vast majority of ETL solutions I've encountered tend to discard a lot more duplicate rows than they update or insert (after the initial table load, that is). Perhaps someone can provide some counter-examples they've encountered where the opposite is true?

    It's certainly true that if the system has those characteristics (many more discards than inserts/updates) then using a hash function is a waste of resource, so hashing should be avoided in that case. And with a good design (so that everything in the data source has a meaningful timestamp or version ID) there's no need for comparison of more than that (plus the primary key) and that's probably a lot cheaper than a hash. So now you have me wondering whether all this stuff I have read about using hashing to optimise update of DW is just hot air.

    And I would argue that no it's not hot air if you believe in the collision-free properties of your hash function. If you have a true collision-free hash function then there's no need to do column-by-column comparisons at all. You eliminate the column-by-column comparisons completely, you'll save considerable processing as you pointed out. Whether you use a timestamp, version ID, or hash function, the key to better efficiency is to eliminate the column-by-column comparisons completely from your process.

    A timestamp or version ID is still not going to tell you whether the row being imported via ETL from a flat file is different from the corresponding row in the target table--unless, of course, your source systems have *global* timestamps or version IDs that align with the target system's. If you can put a global versioning system in place on every source system you pull data from, it would increase the efficiency of your ETL considerably and remove the need for many of these types of change detection solutions. That's usually not feasible though...

  • Here I was believing in Kimball! To prove or disprove my guru's theory I went and looked at a client database that has been in operation for about a year. I found a table with a just under 20 million records - the logical key is a date and two ints (EffectiveDate, FundId and DataTypeId). The date is the most unique, fund a bit lower and datatypeid basically useless. I have created a checksum of the three keys roughly as follows: checksum(hashbytes(date as string), fundid, datatypeid)) which is stored along with the values at load time. The checksum calculation has almost no overhead at load.

    The cardinality value when looking at statistics details are:

    1) Checksum = 0.00005228

    2) EffectiveDate = 0.0001738

    2) FundId = 0.00057142

    3) DataTypeId = 1

    So my hashindex is more unique than any of the logical keys alone. This represents a moderate amount of data on a daily basis - more data would show a bigger difference.

    Am I being dumb here - is this a wasted exercise?

Viewing 15 posts - 46 through 60 (of 108 total)

You must be logged in to reply to this topic. Login to reply