Avoiding duplicates in incremental update to Warehouse

  • We have a legacy database whose data needs to be included in our yet-to-be-built sql 2005 data warehouse.   Some of the tables don’t have any natural candidates for a primary key. Of course, we also need to add other data to the mix.

     

    Suppose we load the empty warehouse initially. In following loads we don’t want to include those records that haven’t changed from the first load again (“duplicates”) but we also don’t want to delete the contents of the entire warehouse because of the load time.  Any ideas/best practices how to handle “incremental updates” to a warehouse would be appreciated.

     

    TIA,

     

    Bill

  • Generally I've steered clear of incremental updates entirely as I've never had enough confidence I can completely control the source data automatically.

    I like the idea and had toyed with a binary column added to each fact table to flag it (said flag to be updated by a trigget etc) and then using that column for the update - drawback is that in my business if the flag for some reason isn't set back after the incremental update then next time we're double counting sales and mis-reporting to share holders... next minute someone mentions SOX and all heck would break loose.

    In your circumstances though it sounds like partitions might be the best bet unless your base dimensions change regularly? Have one cube but 2 partitions, one that you can refresh daily and one that only needs to be processed the once?

    Mike

  • Some questions / comments.

    Is there any flag in the legacy system to let you know when rows have changed?  This could even be a "last updated" column against the row.  This would at least let you know what has changed since the last time you did a data dump.

    Having no pkey makes things trickier I suppose.  You could consider the combination of all fields in the row to be a very large & composite primary key and act accordingly.  But if this was the case then any change to the row would change the assumed primary key and the row wouldn't be a duplicate anyway...

    Is the legacy system in MS SQL so you could add triggers to catch inserts separately from deletes?

    Mike's suggestion of having 2 partitions - one that is wiped clean daily and refreshed vs one that, once loaded, stays the same may be suitable to your scenario if you can be sure that, after a set period of time, a large set of your data will not change.

    Good luck - let us know how you progress.

    Cheers, Ian 

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply