use Change Data Capture for incremental load

  • Ive been reading a good way to improve performance of your etl is to design it to only pull the records that have changed since the last load as opposed to pulling everything and checking if a change/add has occured.

    This make total sence why pull 200,000 records when only 100 have changed. so been reading up on CDC. Not sure how i would set this up on my etl. for example my etl that pulls my products start with a execute sql task that runs an inner join sql that gives me all the product info from my source. CDC seems to look at individual tables for changes.

    i thought about using views but again cdc only uses tables. is this possible without loading the entire source data into a staging table or memory?

    thanks

  • CDC needs to be enabled on the source database and on the individual tables. You then have the query some functions (created by CDC) to get the changed data.

    If you have SSIS 2012 or above, there are dedicated SSIS components that can help you with the ETL flows.

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • Thanks Koen

    yes ive got this far with the database and ssis packages. CDC seems good for replicating changes on table for table bases as you would for incremental loading and ODS. the ssis controls can pickup the insert, updates and deletes straight from each monitored source table.

    what im not sure about is how you would go about picking up the changed records when your source isnt a single table but a sql queries with joins. for example if i was monitoring products to populate my products dimension, if all product attributes were within the products table then fine i could point the cdc at the products table. But I populate my product dimension using a query that joins products table with size table and color table. Is this still possible?

    Thanks

  • My first thought is using an indexed view and putting CDC on that, but I'm not sure that is possible. I'm asking around if this is feasible.

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • you cant seem to hit views. my only thoughts are to somehow build the sql onto the system tables that cdc creates to track the changes but not sure if that will work either.

  • How I typically do this is that I keep a copy of the three source tables in a staging area. Meaning, I read from the source tables using CDC and update the tables in the staging area. Then I launch the query with the joins on those staging tables.

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • Ok so your doing the comparison in your staging area. but you would still have to do a full extract from source db (bulk Insert?) to get the source tables in your staging area up todate, yes?

  • ps_vbdev (1/13/2015)


    Ok so your doing the comparison in your staging area. but you would still have to do a full extract from source db (bulk Insert?) to get the source tables in your staging area up todate, yes?

    The first time yes, but then you can use the CDC to update your staging tables.

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • Koen Verbeeck (1/13/2015)


    ps_vbdev (1/13/2015)


    Ok so your doing the comparison in your staging area. but you would still have to do a full extract from source db (bulk Insert?) to get the source tables in your staging area up todate, yes?

    The first time yes

    when you say the first time is this every time the etl runs to update the DW weather it be nightly, hourly or 5 minutes etc?

  • ps_vbdev (1/13/2015)


    Koen Verbeeck (1/13/2015)


    ps_vbdev (1/13/2015)


    Ok so your doing the comparison in your staging area. but you would still have to do a full extract from source db (bulk Insert?) to get the source tables in your staging area up todate, yes?

    The first time yes

    when you say the first time is this every time the etl runs to update the DW weather it be nightly, hourly or 5 minutes etc?

    I mean the actual very first time - the initial load - when your staging area and DW are empty.

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • Ok i understand. Think im getting my head round this after a bit of confusion. CDC will only provide a performance improvement at the extract stage, once the changed rows have been extracted you will still have to do a full lookup on the staging table to see which rows have changed for type 1 & 2 SCD's or is there a way of only looking at the changed rows for SCD items?

  • ps_vbdev (1/14/2015)


    Ok i understand. Think im getting my head round this after a bit of confusion. CDC will only provide a performance improvement at the extract stage, once the changed rows have been extracted you will still have to do a full lookup on the staging table to see which rows have changed for type 1 & 2 SCD's or is there a way of only looking at the changed rows for SCD items?

    You can use a Checksum column on the source and existing data to compare the two. A difference in checksum means that the row (or at least the columns for which the checksum has been calculated) has changed. If you want to track the actual individual column changes though then yes, you'll have to compare column by column. You could use a T-SQL MERGE statement to do that - much better performing than the SCD component that ships with SSIS.

    Going back to the checksum idea, you might want to 'roll your own' by using a HASH such as MD5 or SHA1 because you'll get fewer hash collisions (false positives) with those algorithms than you will with the built-in checksum component.

    Regards

    Lempster.

  • Lempster (1/14/2015)


    Going back to the checksum idea, you might want to 'roll your own' by using a HASH such as MD5 or SHA1 because you'll get fewer hash collisions (false positives) with those algorithms than you will with the built-in checksum component.

    Hash functions are the way to go.

    I use HASHBYTES with the SHA_256 algorithm.

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • Some extra information to what Koen posted...

    MD5 is horribly slow. Don't use it.

    CHECKSUM is a simple Exclusive OR of bytes and collisions are much more common than many would think.

    Even with hashbytes, I would take the time to guarantee uniqueness by using the result of hashbytes along with whatever the PK is.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Jeff Moden (1/14/2015)


    Even with hashbytes, I would take the time to guarantee uniqueness by using the result of hashbytes along with whatever the PK is.

    would this be including the pk in the hashbyte function (i was thinking of this) or as part of the conditional split?

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply