SSIS 2008 R2 Table Update

  • Hello Fellas,

    I have a little question on what would be the best solution for the following scenario I am facing:

    I have an ETL job that needs to run every hour to move data from the application database to the data warehouse so reports can be ran against this app data. I have a package that moves all the tables I need from the app database to an staging dbase. From there i have another package that loads the data warehouse from the staging tables. I have this last task that loads a fact table that contains about 10 million records that is taking about 45 mins.

    We are utilizing a very simple process that loads the data into a temp table that once fully populated, it deletes the fact (production table) and renames that temp table to the fact table name. The problem with this obviously is that the new data does not get to the production fact table after 45 mins because we are basically loading all 10 mill records regarless of whether they need update or not.

    Now i want to implement a better way of handling this task by only updating those records that need to be updated. I looked into the MERGE solution, but as far as I understand, this writes a "staging" table first with all the records and then compares it against the production table and then acts accordingly but deleting, updating the records that need it. This process would take the same amount of time as it writes all records to a temp table and then compares it against the production one.

    Am i correct on this or does the MERGE task works differently. What other solutions do you recommend for attacking such process? This run every hour on the hour and we need the fastest way possible to get the data from the app dbase to the data warehouse so reporting can be ran against it.

    Any ideas/help is much appreciated.

    Thanks!

    R.

  • RenzoSQL (5/30/2012)


    Hello Fellas,

    I have a little question on what would be the best solution for the following scenario I am facing:

    I have an ETL job that needs to run every hour to move data from the application database to the data warehouse so reports can be ran against this app data. I have a package that moves all the tables I need from the app database to an staging dbase. From there i have another package that loads the data warehouse from the staging tables. I have this last task that loads a fact table that contains about 10 million records that is taking about 45 mins.

    We are utilizing a very simple process that loads the data into a temp table that once fully populated, it deletes the fact (production table) and renames that temp table to the fact table name. The problem with this obviously is that the new data does not get to the production fact table after 45 mins because we are basically loading all 10 mill records regarless of whether they need update or not.

    Now i want to implement a better way of handling this task by only updating those records that need to be updated. I looked into the MERGE solution, but as far as I understand, this writes a "staging" table first with all the records and then compares it against the production table and then acts accordingly but deleting, updating the records that need it. This process would take the same amount of time as it writes all records to a temp table and then compares it against the production one.

    Am i correct on this or does the MERGE task works differently. What other solutions do you recommend for attacking such process? This run every hour on the hour and we need the fastest way possible to get the data from the app dbase to the data warehouse so reporting can be ran against it.

    Any ideas/help is much appreciated.

    Thanks!

    R.

    I'm not 100% sure how the MERGE task in SSIS works underneath the covers, but I would suggest that you use the T-SQL Merge statement to merge your fact table with the staging table.

    You should also try and limit the amount of data you are extracting from the source. If you have typical facts, meaning transactions that do not change after it is captured, then it should be relatively easy to extract only the most recent transactions/facts based on the date/time. Even if there is some overlap it should be better than extracting 10 million records.

    If limiting the records from the source is not an option, my recommendation would be to ensure that your staging table is a heap. Transferring 10 million records to a heap should not take that long (yes, it depends)...and building adequate indexes on the staging table after transfer will speed things up.

    Hope this helps...

  • Renzo,

    Is there a way at the source to know the rows updated (new or updated rows) at the legacy application? I will look for a column with timestamps, like creation or update dates, then the extraction process will SELECT data updated/created in the last 60 minutes. This approach certainly will reduce the volume of records your package is handling.

    Cheers,

    Hope this helps,
    Rock from VbCity

  • Can you set up some sort of replication scenario (log shipping, mirroring, etc.) and eliminate the staging table step, then load your data warehouse from the target tables?

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply