How to Populate the Datawarehouse

  • Ok. So we have designed a new Datawarehouse and populated the Datawarehouse with the initial T-SQL Scripts executed in a DTS Package. I have then created the cubes and processed them. We are using Analaysis Server 2000.

    My ultimate question is, once the above has been done, how do you populate the Datawarehouse from the Production OLTP Database on an incremental basis? In other words copy to the Datawarehouse from the OLTP Database the newly created or updated data!

    I know that you could execute perhaps once a night the initial scripts that populated (entire refresh) it but how about if the OLTP Database is large like GB in size?

    What is the best practice in this situation?

    What does everyone do with their Datawarehouses?


    Kindest Regards,

  • Too many variables and dependencies to give a single answer.

    Examples of factors that influence design & architecture:

    - Type of source system - Is it legacy mainframe, or SQL based ?

    - If legacy, can your legacy programmers generate an extract of what changed or can they only dump current state ?

    - If SQL based, does each record have a reliable last insert/update timestamp ?

    - What type of warehouse is it, Kimball style with dimension & fact tables and slowly changing dimensions ?

     

     

  • The source system is SQL Server 2000.

    I haven't got a timestamp Column on any of the teables yet but can do if I need it!

    2 Fact Tables and several Dimension tables.

    We are using Analysis Server to produce the cubes.


    Kindest Regards,

  • You will need some sort of ETL process that runs at a regular interval.  The method that I use is something like the following.  For each table write SQL something like:

    --update changed records

    update dt

    set dt.Name = st.Name,

        dt.Attribute = st.Attribute

    from DWDB..dimTable dt

    inner join OTLPDB..srcTable st on dt.ID = st.ID and (dt.Name <> st.Name or dt.Attribute <> st.Attribute)

     

    --insert new records

    insert DWDB..dimTable(ID, Name, Attribute)

    select st.ID, st.Name, st.Attribute

    from OTLPDB..srcTable st

    left outer join DWDB..dimTable dt on st.ID = dt.ID

    where dt.ID is null

     

    Using this style, the intial population can use the same code and the DW can would be completely repopulated if the tables in DWDB were truncated.

  • Within our system we restore the databases on a daily basis and merge data using an ETL process to grab the data we do not have yet. This is for our fact tables. For our dimensions, we have 90% Type 1 changes and have a handful of columns we care about historically and have marked them as a trigger (not a sql trigger, just the term I am using ) to perform type 2 changes.

    I would highly recommend reading some material from Ralph Kimball. Specifically The ETL Toolkit is a good reference book for developing a robust ETL system.

    Cliff

  • Ok thanks folks.


    Kindest Regards,

  • We populate the stage of our data warehouse with 3 different methods:

    1. Rebuild - we clear the stage everytime and download the entire table (all records) from the source system.

    2. Incremental - we download only the delta from the source system.

    3. Snapshot - we download the entire table, but we don't clear the stage first, i.e. we keep the history in the data warehouse.

    For incremental loading we normally use creation date time and updated date time column on the source system. If this is not available, we build a trigger in the source system (for insert, update, delete). Triggers slow the source system down (OLTP requires fast response time), we try to avoid this if we can. On DB2 based source system we have journaling so we can use this if we need to.

    When a record is deleted from the source system, we don't normally delete from the data warehouse, so we don't need to track down deletion. But if we need to delete from the DW, the first thing we do is to look for a soft delete in the source system. By soft delete I mean a column which marks that the record is no longer used. If soft delete is not available, then we normally compare the PK column (the natural key) to identify deletion on source system. (if more on this is required I'll be happy to explain)

    One caution when using time stamp column for upsert into DW stage is: sometimes the OLTP is updated via the back door, ie not from menu but direct SQL command into the database. In this case it is possible that the time stamp is not updated.

    If time/date stamp column is not available on the source system, and for some reason triggers can not be build, we normally use CRC (cyclic redundancy check) method. In short this means we compute a hash for every column in the table in the source system, then we compare the hash column with the copy of the table in the stage. This way we can identify the difference. I have the script for implementing CRC on SQL Server platform, if anybody is interested I'll be happy to share.

    If any of the method describe above have been explored but not possible (because the source table allow duplicate rows for example, e.g. no PK or natural key; and because the timestamp columns are not 100% reliable), and downloading the entire table everytime is also not possible (perhaps because of the sheer size of the source table), then we have 2 last options: a) range download and b) using timestamp from other table.

    The range download works as follows: we identify a column on the source system which changes over time. Usually this is either one of the collective PK set (such as transaction ID, or transaction number), or one of the date columns (transaction date, planned date, etc). We then go back significantly in time, and download that period. For example, we have an order date column on the table. This column is not 100% reliable for incremental download because in this source system it is possible that a record is created with date column in the past. What do we do? We know that the order date is normally only back dated by a week or two, and 100% sure that it will not be back dated more than 1 month. In this case we download 1 month worth of data every time (where order_date >= today - 1 month). We then do upsert operation in the data warehouse (upsert = update when exist, insert when not). We can use the same technique if the ID column is incremental over time but not 100% reliable.

    Using timestamp from other table works like this: suppose we are downloading header and detail tables from a source system. (it's a hierarchy of 2 tables with good FK-PK). The detail table has date stamp columns, both for creation and for last updated. Unfortunately the header table has no date stamp columns. In this case we could use the date stamp columns of the detail table to do download header table incrementally.

    Vincent

  • Read Kimball's, The Data Warehouse ETL Toolkit. Very good ref book.

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply