use Change Data Capture for incremental load

Question

Post reply

use Change Data Capture for incremental load

ps_vbdev

SSCarpal Tunnel

Points: 4060
More actions
January 12, 2015 at 11:33 am

#300187

Ive been reading a good way to improve performance of your etl is to design it to only pull the records that have changed since the last load as opposed to pulling everything and checking if a change/add has occured.

This make total sence why pull 200,000 records when only 100 have changed. so been reading up on CDC. Not sure how i would set this up on my etl. for example my etl that pulls my products start with a execute sql task that runs an inner join sql that gives me all the product info from my source. CDC seems to look at individual tables for changes.

i thought about using views but again cdc only uses tables. is this possible without loading the entire source data into a staging table or memory?

thanks

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply

Koen Verbeeck SSC Guru Points: 259015 More actions · Answer 1

CDC needs to be enabled on the source database and on the individual tables. You then have the query some functions (created by CDC) to get the changed data.

If you have SSIS 2012 or above, there are dedicated SSIS components that can help you with the ETL flows.

Need an answer? No, you need a question
My blog at https://sqlkover.com.
MCSE Business Intelligence - Microsoft Data Platform MVP

ps_vbdev SSCarpal Tunnel Points: 4060 More actions · Answer 2

Thanks Koen

yes ive got this far with the database and ssis packages. CDC seems good for replicating changes on table for table bases as you would for incremental loading and ODS. the ssis controls can pickup the insert, updates and deletes straight from each monitored source table.

what im not sure about is how you would go about picking up the changed records when your source isnt a single table but a sql queries with joins. for example if i was monitoring products to populate my products dimension, if all product attributes were within the products table then fine i could point the cdc at the products table. But I populate my product dimension using a query that joins products table with size table and color table. Is this still possible?

Thanks

Koen Verbeeck SSC Guru Points: 259015 More actions · Answer 3

Koen Verbeeck

SSC Guru

Points: 259015

January 13, 2015 at 4:46 am

#1769754

My first thought is using an indexed view and putting CDC on that, but I'm not sure that is possible. I'm asking around if this is feasible.

Need an answer? No, you need a question
My blog at https://sqlkover.com.
MCSE Business Intelligence - Microsoft Data Platform MVP

ps_vbdev SSCarpal Tunnel Points: 4060 More actions · Answer 4

ps_vbdev

SSCarpal Tunnel

Points: 4060

January 13, 2015 at 4:56 am

#1769756

you cant seem to hit views. my only thoughts are to somehow build the sql onto the system tables that cdc creates to track the changes but not sure if that will work either.

Koen Verbeeck SSC Guru Points: 259015 More actions · Answer 5

How I typically do this is that I keep a copy of the three source tables in a staging area. Meaning, I read from the source tables using CDC and update the tables in the staging area. Then I launch the query with the joins on those staging tables.

Need an answer? No, you need a question
My blog at https://sqlkover.com.
MCSE Business Intelligence - Microsoft Data Platform MVP

ps_vbdev SSCarpal Tunnel Points: 4060 More actions · Answer 6

ps_vbdev

SSCarpal Tunnel

Points: 4060

January 13, 2015 at 5:04 am

#1769759

Ok so your doing the comparison in your staging area. but you would still have to do a full extract from source db (bulk Insert?) to get the source tables in your staging area up todate, yes?

Koen Verbeeck SSC Guru Points: 259015 More actions · Answer 7

ps_vbdev (1/13/2015)

Ok so your doing the comparison in your staging area. but you would still have to do a full extract from source db (bulk Insert?) to get the source tables in your staging area up todate, yes?

The first time yes, but then you can use the CDC to update your staging tables.

Need an answer? No, you need a question
My blog at https://sqlkover.com.
MCSE Business Intelligence - Microsoft Data Platform MVP

ps_vbdev SSCarpal Tunnel Points: 4060 More actions · Answer 8

Koen Verbeeck (1/13/2015)

ps_vbdev (1/13/2015)

Ok so your doing the comparison in your staging area. but you would still have to do a full extract from source db (bulk Insert?) to get the source tables in your staging area up todate, yes?

The first time yes

when you say the first time is this every time the etl runs to update the DW weather it be nightly, hourly or 5 minutes etc?

Koen Verbeeck SSC Guru Points: 259015 More actions · Answer 9

ps_vbdev (1/13/2015)

Koen Verbeeck (1/13/2015)

ps_vbdev (1/13/2015)

Ok so your doing the comparison in your staging area. but you would still have to do a full extract from source db (bulk Insert?) to get the source tables in your staging area up todate, yes?

The first time yes

when you say the first time is this every time the etl runs to update the DW weather it be nightly, hourly or 5 minutes etc?

I mean the actual very first time - the initial load - when your staging area and DW are empty.

Need an answer? No, you need a question
My blog at https://sqlkover.com.
MCSE Business Intelligence - Microsoft Data Platform MVP

ps_vbdev SSCarpal Tunnel Points: 4060 More actions · Answer 10

Ok i understand. Think im getting my head round this after a bit of confusion. CDC will only provide a performance improvement at the extract stage, once the changed rows have been extracted you will still have to do a full lookup on the staging table to see which rows have changed for type 1 & 2 SCD's or is there a way of only looking at the changed rows for SCD items?

Lempster SSCoach Points: 15651 More actions · Answer 11

ps_vbdev (1/14/2015)

Ok i understand. Think im getting my head round this after a bit of confusion. CDC will only provide a performance improvement at the extract stage, once the changed rows have been extracted you will still have to do a full lookup on the staging table to see which rows have changed for type 1 & 2 SCD's or is there a way of only looking at the changed rows for SCD items?

You can use a Checksum column on the source and existing data to compare the two. A difference in checksum means that the row (or at least the columns for which the checksum has been calculated) has changed. If you want to track the actual individual column changes though then yes, you'll have to compare column by column. You could use a T-SQL MERGE statement to do that - much better performing than the SCD component that ships with SSIS.

Going back to the checksum idea, you might want to 'roll your own' by using a HASH such as MD5 or SHA1 because you'll get fewer hash collisions (false positives) with those algorithms than you will with the built-in checksum component.

Regards

Lempster.

Koen Verbeeck SSC Guru Points: 259015 More actions · Answer 12

Lempster (1/14/2015)

Going back to the checksum idea, you might want to 'roll your own' by using a HASH such as MD5 or SHA1 because you'll get fewer hash collisions (false positives) with those algorithms than you will with the built-in checksum component.

Hash functions are the way to go.

I use HASHBYTES with the SHA_256 algorithm.

Need an answer? No, you need a question
My blog at https://sqlkover.com.
MCSE Business Intelligence - Microsoft Data Platform MVP

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 13

Some extra information to what Koen posted...

MD5 is horribly slow. Don't use it.

CHECKSUM is a simple Exclusive OR of bytes and collisions are much more common than many would think.

Even with hashbytes, I would take the time to guarantee uniqueness by using the result of hashbytes along with whatever the PK is.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

ps_vbdev SSCarpal Tunnel Points: 4060 More actions · Answer 14

ps_vbdev

SSCarpal Tunnel

Points: 4060

January 15, 2015 at 2:21 am

#1770354

Jeff Moden (1/14/2015)

Even with hashbytes, I would take the time to guarantee uniqueness by using the result of hashbytes along with whatever the PK is.

would this be including the pk in the hashbyte function (i was thinking of this) or as part of the conditional split?