Loading changed (delta) data into data warehouse

  • I need to load only changed (delta) data from 3000 databases into a staging environment that will then be loaded into a data warehouse. The SQL Server versions include 2000, 2005, 2008 R2 and 2012. There are no ETL Flag columns and no TimeStamp columns that allow me to query just changed data. My understanding is that they cannot be added. I know this would be easiest.

    I know that in 2008 and above I can use change_tracking, although it doesn't seem very intuitive.

    All of the databases (across all versions) are replicated, however, they are replicated using merge replication, not transactional replication. I know that replication uses a distribution database that keeps track of the changes. Is there a relatively easy way to query the distribution database and load only the changed data into the staging environment?

    Also, if anybody knows what the best way to do it for each of the SQL Server versions mentioned above, I would appreciate the advice.

  • carlecot88ssn (1/18/2013)


    The fabric upper will provide good protection and soft touching feeling for its wearer. While the thickened Onitsuka Tiger Tokidoki Mex Lo collar and the additional small piece of leather of the Onitsuka Mexico.:hehe:

    Reported

  • carlecot88ssn (1/18/2013)


    The fabric upper will provide good protection and soft touching feeling for its wearer. While the thickened Onitsuka Tiger Tokidoki Mex Lo collar and the additional small piece of leather of the Onitsuka Mexico.:hehe:

    what the hell is this !!

    Edited : I see it's reported ..

    ~ demonfox
    ___________________________________________________________________
    Wondering what I would do next , when I am done with this one :ermm:

  • scottsnyder55 (1/17/2013)


    I need to load only changed (delta) data from 3000 databases into a staging environment that will then be loaded into a data warehouse. The SQL Server versions include 2000, 2005, 2008 R2 and 2012. There are no ETL Flag columns and no TimeStamp columns that allow me to query just changed data. My understanding is that they cannot be added. I know this would be easiest.

    I know that in 2008 and above I can use change_tracking, although it doesn't seem very intuitive.

    All of the databases (across all versions) are replicated, however, they are replicated using merge replication, not transactional replication. I know that replication uses a distribution database that keeps track of the changes. Is there a relatively easy way to query the distribution database and load only the changed data into the staging environment?

    Also, if anybody knows what the best way to do it for each of the SQL Server versions mentioned above, I would appreciate the advice.

    Not sure, but through ETL lookup it might be acheived for each tables; where in comparing all the source table records with destination table records ..(In informatica this concept is Dynamic lookup, you may check if the equivalent in SSIS is there)

    ~ demonfox
    ___________________________________________________________________
    Wondering what I would do next , when I am done with this one :ermm:

  • scottsnyder55 (1/17/2013)


    I need to load only changed (delta) data from 3000 databases into a staging environment that will then be loaded into a data warehouse. The SQL Server versions include 2000, 2005, 2008 R2 and 2012. There are no ETL Flag columns and no TimeStamp columns that allow me to query just changed data. My understanding is that they cannot be added. I know this would be easiest.

    I know that in 2008 and above I can use change_tracking, although it doesn't seem very intuitive.

    All of the databases (across all versions) are replicated, however, they are replicated using merge replication, not transactional replication. I know that replication uses a distribution database that keeps track of the changes. Is there a relatively easy way to query the distribution database and load only the changed data into the staging environment?

    Also, if anybody knows what the best way to do it for each of the SQL Server versions mentioned above, I would appreciate the advice.

    Back on topic.

    What is the version and compatability level of the data warehouse (SQL 2008 or above and the compat level of the staging area 100 or above)

    If so you could look at using the merge statement and embed it with checksums if the checksum is different load it. (Caveat, merge is for 2008 and compat level 100 or above, but unsure if this just applies to the destination or the source as well).

    If your lucky enough to have Enterprise edition, you could enable CDC for your 2008 or above databases.

    Would just depend how much work it would be on how many 2000 or 2005 databases you need to import.

  • SSCarpal Tunnel thanks for replying. I have already implemented a data warehouse ETL process to load 100s of GB of Postgres data into SQL Server data warehouse using SQL Server Merge and Postgres offset limit commands from audit tables. Refresh time - 15 minutes.

    And I am implementing ETL process for 100s of GB of SQL Server 2005 databases with LastUpdated(timestamp) column and for delete trigger audit table.

    However, another division now wants me to design and implement a ETL process to load their 3000 SQL and several hundred Oracle databases into DW without changing table structure. And they want it real-time (ha). I know for the SQL 2008 and above and the Oracle I can use internal Change Data Capture techniques, but I was wondering what my options were for earlier versions of SQL Server.

    After some research I am come to conclusion that at minimum I need Timestamp column and for delete trigger audit table. They wanted me to read transaction logs or Distribution database, but I don't think it's a viable option.

    If there is anybody who has successfully implement a large scale data warehouse ETL process using logs or distribution database, I would like to hear. However, I believe have to go back and tell them the additional column, trigger and audit tables.

  • You should use CDC on the versions that have it.

    On the older versions you could subscribe (pull) to their publications, modify the replication MS

    insert and update procs in the subscriber database to to upserts/merge.

    At any given point in time you want to get the modified data, you:

    1. shut down the distriibution agent

    2. get the modified data

    3. truncate the tables

    4. start up the distriibution agent

    We used to do something similar to this on a previous job using Sybase replication.

    The only tricky part is the applying post snapshot script wit hte customised MS insert and update procs.

  • Thanks arnie. I figured it was theoretically possible to use replication to insert only new data, but was concerned we could run into lots of unforseen challenges. I have already begun detailing the CDC specifications for Oracle and SQL 2008 and above. After sending a detailed email this morning describing the expected obstacles and costs associated, the SQL 2000 database requirment has been completely dropped (over 2000 databases). And it has been agreed to migrate the 2005 databases to 2008.

    The data warehouse will be in SQL Server 2012 and it looks like the improvements they have made for data warehousing have been substantial. If anybody has experience implementing 2012 BI I would be very interested.

    I appreciate everyone's help and will hopefully come back with real life experience with CDC, and other aspects of ETL and BI implementation using SQL Server 2012.

  • Change tracking on SqlServer 2008 will help u on this. Check it.

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply