A faster way to prepare dimensional databases

  • Hi Magarity,

    Absolutely the way to go once the process is stable and populated on an incremental basis.

    Tell me, I gather from the discussions on the subject that there is only a 1 in 2^40 chance of a duplicate MD5 checksum (as opposed to the SQL Server CHECKSUM function) - have you ever experienced problems with checksums?

    Thanks,

    Adam

  • For Dbowlin:

    There is a good MD5 checksum approach here:

    http://www.tek-tips.com/viewthread.cfm?qid=1268144&page=1

  • Hi,

    thanks for the article and example.

    what I don't understand and this may be as you don't have complete control of the ETL or you're restricted to on the design, is why do you truncate the Dimensions? could you not just relate this to a Business Key and do appropriate SCD 1 / SCD 2 transformations?

    Even with the Fact table, you could just insert for new records and update for existing records.

    Regards,

    Amrit

  • Hi Amrit,

    This is explained in my reply to Magnus, above - the design is in a considerable state of flux, and we wnated complete teardown on every process run, to guarantee coherence and to avoid having "old" data cluttering up the DW.

    Sorry that this wasn't clear in the article.

    Otherwise yes - standard techniques could (and probably will) be used.

    Regards,

    Adam

  • Ok, great!

    Sorry if I misunderstood.

  • No, I've never had a problem with using md5. I'm inspired to write my own article on how to use md5 in an update/insert strategy for loading data warehouses if I can get Steve the admin to accept it.

  • magarity kerns (2/25/2010)


    No, I've never had a problem with using md5. I'm inspired to write my own article on how to use md5 in an update/insert strategy for loading data warehouses if I can get Steve the admin to accept it.

    By all means, please write away. We currently truncate and replace, but would be extremely interested in a more incremental strategy. For example, my understanding is BIDS Slowly Changing Dimension transformation is a performance hound for large tables.

  • We don't need no stinkin' changing dimension widgets... Took me a while with the article-submitting editor, but I got my treatise on using MD5 submitted. We'll see how fast it gets rejected.

  • Fabulous! I look forward to reading it!

    Adam

  • Thanks for the article.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • Its a good article.

    However, did you experience any problems with datatypes being incorrectly assigned using the SELECT..INTO.. method, I notice you mention that nulls are handled in the consolidation layer so it may mitigate the issue.

    I'm also not sure how this can be adapted to an incremental DW unless you are not concerned about maintaining the history, as by dropping the fact table you loose the ability to maintain the historic Fact record, and thus you surely defeat the object one of the objectives of an incremental, which is to allow you to perform point in time analysis, which can be quite critical for DW's in the Finance industry.

    _________________________________________________________________________
    SSC Guide to Posting and Best Practices

  • INSERT INTO <table> WITH (TABLOCK)

    will minimally log in the same way SELECT .... INTO will by default.

  • For example, my understanding is BIDS Slowly Changing Dimension transformation is a performance hound for large tables.

    FAO tskelly:

    If your finding the standard SCD component of SQL slow, have a look at Todd McDermitts Kimball SCD component.

    I've switched to using his Kimball SCD for a number of reasons. First of all it's faster. As mentioned in the details, it doesn't destroy the dataflow when you need to make changes. Also, it provide a wealth of auditing information. Well worth a look!

    You can find it here: http://kimballscd.codeplex.com/

    _____________________________________________________________________________MCITP: Business Intelligence Developer (2005)

  • Thank you Adam.

    At first it is almost necessary to have control of the the whole "ETL" process. Missing data or

    bad quality is major problem in many cases. You can seldom avoid outer joins. I use ISNULL(expression,'MISSING') and have a default dimension member MISSING in most of the dimensions. Your approach is also usefull in "non OLAP" situations like with the new feature PowerPivot.

    We do have dimension tables and one or more fact tables to import. PowerPivot then creates the "CUBE" by it self.

    Gosta M

  • dave-dj (12/17/2010)


    For example, my understanding is BIDS Slowly Changing Dimension transformation is a performance hound for large tables.

    FAO tskelly:

    If your finding the standard SCD component of SQL slow, have a look at Todd McDermitts Kimball SCD component.

    I've switched to using his Kimball SCD for a number of reasons. First of all it's faster. As mentioned in the details, it doesn't destroy the dataflow when you need to make changes. Also, it provide a wealth of auditing information. Well worth a look!

    You can find it here: http://kimballscd.codeplex.com/%5B/quote%5D

    We use Todds component as well, it has a few performance issues with large datasets, we did some alpha testing the v1.6 of the component and that seems to have fixed some of the perfomance issues, unfortuantely we're just waiting for the final release.

    _________________________________________________________________________
    SSC Guide to Posting and Best Practices

Viewing 15 posts - 16 through 30 (of 38 total)

You must be logged in to reply to this topic. Login to reply