Storing a hash of a row

  • Hi Folks,

    First of all, happy new year to you all!

    TL;DR - I need to store a calculated hash value from a staging table into a permanent table as a permanent Binary(32) field.  Would it be best to store this value on the record itself (at the front of the table), or would I be better off storing it in a lookup table?

    Now the very long winded post explaining what I am doing......

    I am after a bit of advice if I can.  I have been currently working on a data store project.  The purpose of the data store is to take all the records from our source systems (typically Sybase ASE), load them into a type 2 table, with the records as close as possible to the source systems, plus a bit of meta data to ID inserts/updates, start/end dates and a hash of the record.

    To populate the table, I stage the data in a stage table (using BCP out/in - much faster when getting data from ASE), add a hash column to the staging (as a computed hashbytes column) and create a unique index on what would be the source primary key.  When this data is loaded into the destination table (by comparing the primary key fields against each other to determine whether there is a new record and the hash to determine if there is an update), I take that hash value from staging and save it a as a permanent binary(32) column on the END of the table.

    Unfortunately, saving this data at the end of the table was a BIG mistake on my behalf.  As we want to maintain the data structure as it appears in the source, when ever there is a new field added, I need to drop and re-create the table.  This mistake recently became very apparent when having to deal with 200 Million row table, which took a very long time to rebuild (:crying: - I can at least admit when I screw something up).  

    As we are still in a development phase, I want to change the storage of this hash value.  While I understand that the order of the data is not important from a SQL Server point of view (and I could simply add new fields to the end of the table and leave that hash value hanging there in the middle), I would ideally keep the source data as close as possible.  The way I see it I have two options:

    1 - Modify the table and bring the hash field up the front with the other meta data
    2 - Create a lookup table that stores the hash value and a record Identifier

    Option 1 would require the least amount of development, but would take the longest time to do the updates.

    Currently, I am developing on SQL Server 2016, but will be migrating to 2017 very soon.

    I have spoken to a few developers locally and they have given me different answers, so I'd like to head this question to the Global Council 😉 .  Does anyone have any opinion on what would be the better solution?

    Cheers

  • What if you left the table as it is and then add columns to the end as needed?  Not sure why you want to keep the table as close to the source as possible, but you could always create a view to do that

    For better, quicker answers, click on the following...
    http://www.sqlservercentral.com/articles/Best+Practices/61537/

    For better answers on performance questions, click on the following...
    http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • I did something similar.  I made a persisted calculated field in the staging table that held the hash value.  Then the CPU for hash calculation would get hidden by the IO overhead loading the stage table, but the column would be populated when the pages were written so that I didn't have to do an expensive update afterwards that caused a lot of page breaks.  Then I would just copy the hash over to a regular bigint column the prod table from the stage table during inserts/updates

    In the future, when new columns were added, I only had to change the calculation in the stage table, which was regularly truncated, so adding the column would not be time consuming and lock inducing.

    Then the next update for each record would carry the updated hash to the prod table.

  • Mike01 - Monday, January 7, 2019 7:31 AM

    What if you left the table as it is and then add columns to the end as needed?  Not sure why you want to keep the table as close to the source as possible, but you could always create a view to do that

    OK, to be honest, its a bit of OCD both on my bosses and my behalf (considering that we are still in development).  I feel more comfortable separating the meta data from the actual data.  If the system was in production, I would suggest that I would just leave it as is.

    doug 88182 - Monday, January 7, 2019 9:01 AM

    I did something similar.  I made a persisted calculated field in the staging table that held the hash value.  Then the CPU for hash calculation would get hidden by the IO overhead loading the stage table, but the column would be populated when the pages were written so that I didn't have to do an expensive update afterwards that caused a lot of page breaks.  Then I would just copy the hash over to a regular bigint column the prod table from the stage table during inserts/updates

    In the future, when new columns were added, I only had to change the calculation in the stage table, which was regularly truncated, so adding the column would not be time consuming and lock inducing.

    Then the next update for each record would carry the updated hash to the prod table.

    This is the exact reason I have chosen to do it this way.  The hash value as calculated in the staging table represents the record.  This record does not change ever once stored in the permanent table, that I felt it was much more efficient to store it as an actual value as opposed to calculating it every time.  The data storage overhead in the grand scheme of things is a small percentage.

  • @mike01

    I expect the OP wants to always have the hash value in the same place so that queries are consistent - you can pretty much lift and shift from source to DWH as the table structures are the same.  This would mean that the hash should either be:
    At the end of the table (forces inserts of columns if the structure changes, which requires a rebuild which can be expensive)
    As the first field in the table
    Immediately after primary key fields

    I wouId suggest that immediately after the PK probably makes the most sense

    I know you have said that you want to keep the data structures the same as for the source systems, but have you looked at the Data-Vault model.  This splits the data into a HUB table (primary key fields) and SATtelite table (attribute fields) you can create multiple SAT tables for the source table and then you only have to update the Type II for the volatile fields.  You may see significant performance and storage wins if you follow this model.

  • aaron.reese - Wednesday, January 9, 2019 7:22 AM

    @mike01

    I expect the OP wants to always have the hash value in the same place so that queries are consistent - you can pretty much lift and shift from source to DWH as the table structures are the same.  This would mean that the hash should either be:
    At the end of the table (forces inserts of columns if the structure changes, which requires a rebuild which can be expensive)
    As the first field in the table
    Immediately after primary key fields

    I wouId suggest that immediately after the PK probably makes the most sense

    I know you have said that you want to keep the data structures the same as for the source systems, but have you looked at the Data-Vault model.  This splits the data into a HUB table (primary key fields) and SATtelite table (attribute fields) you can create multiple SAT tables for the source table and then you only have to update the Type II for the volatile fields.  You may see significant performance and storage wins if you follow this model.

    Thanks for the reply.  Looks like I have a bit of reading to do, because any performance gains would be worth it.  I hadn't considered such a design.

    Cheers

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply