Anatomy of an Incremental Load

  • Ken,

    Perhaps you'd come up with something?

    Typically we see the complaints that everyone writes about loading a TB or some other large scale operation, but 80+% of the world doesn't do that. They work in smaller environments, so you want to give them good solutions that work well.

    I'd venture to guess that 1% of the community here has a table with 50M rows.

  • From my experience, there are really 4 ways to handle this situation.

    1. When you have a straightforward extract (SELECT * FROM SimpleJOINs) that don't have any conforming steps between when they're extracted to when they get inserted into the data warehouse, the solution entered here is perfect. It's basically drop in place, make a couple of fairly simple/routine changes (insert Junior ETL developer here), and you're ready.

    2. When you have a reasonable size to your extract, but have either complex date logic in the source system (such as a Customer extract where there are 20 different things in 20 different tables that can change and all have different dates associated with them), and particularly if you have fairly complex conforming steps, we've found that extracting the data, then hashing it on the SSIS machine, then comparing that hash with a hash of the extracted rows, and eliminating the matching hashes has worked very well. We wrote our own custom hash routine for this, but it was a piece of cake.

    3. When you have huge data sets on the source system side, if you can limit the results, even if you can only cut it to 1/4 using date logic, you have to do that. For example one of our largest tables we take the last 13 months of data, as users of the source system can make adjustments to any row in that period of time, but we don't need the 22 years of data before that. Then you can apply solutions 1 or 2 as applicable.

    4. Create custom system on the source system for handling this. Timestamps, triggers to maintain timestamps or hashes, etc. I've personally avoided this, as we haven't had to use a system like this, and I'm loathe to risk affecting the production system with the ETL process. This is of course probably exactly what you're wanting to start a discussion about...


    Rick Todd

  • Hi Steve,

    [1% of community]

    Fair enough. Although I think few would complain if the solutions offered for consideration here _were_ scalable. At the very least - in my opinion - the scalability/performance limitations of whatever ETL related solutions are offered in a forum such as this should routinely be mentioned in articles. If there is one thing about ETL we've learned in the last decades it is that the data environment grows quickly!

    [perhaps you could come up with something]

    No thanks, I just like to criticize (just kidding!!)

    In terms of suggestions, I think Rick Todd has provided a really excellent summary. A couple things worth further review perhaps are:

    1: discussion of the upcoming SQL 2008 "intersect" and "except" clauses...

    2:some "real world" examples in SSIS or SQL (or both) for the more scalable approaches Rick outlines.

    Best,

    Ken Ambrose

  • Hi Ken,

    You are correct. And I hope I made it clear that this article was a proof of concept and nowhere near production-ready code. The intent is to show folks an approach to using SSIS 2005 to load data incrementally. Aside from my update-detection flaw (which I will fix, I promise), I think the article accomplishes this goal.

    A solution to loading large tables incrementally is available in SQL Server 2008 CTP 6: Change Data Capture. There is an article about CDC and SSIS in CTP 5 here.

    In a nutshell, CDC allows you to only touch the rows that have been added, changed, or deleted on the source. This relieves the requirement to do correlated joins with the destination. It also means you do not have to load unchanged rows - ever.

    :{> Andy

    Andy Leonard, Chief Data Engineer, Enterprise Data & Analytics

  • UPDATE:

    I looked over the article today (finally) and found an error in the way I built the original destination table. It was something that I did not cover explicitly in the article but it did affect the design. I describe the proper way to build the destination database and table in an update that Steve has agreed to post soon. I also included the code in the scripts link below.

    In the meantime I have posted the updated article document here. I have included updated scripts that now include the Transact-SQL I used to create the destination database and table, and populate the destination table here. Also, I updated the SSIS project files. They can be downloaded here.

    I cannot adequately express my disappointment in myself for allowing work of such poor quality to be published. I apologize to Steve, SQL Server Central, and all who read the original article and worked through the original examples only to realize inaccurate results. I would like to thank those who pointed this out in this forum thread. I hope everyone who experienced issues with this will re-run the examples.

    Sincerely,

    Andy Leonard

    12 May 2008

    Andy Leonard, Chief Data Engineer, Enterprise Data & Analytics

  • Andy, thanks for promising that you'll find a solution to the updated (changed) rows.

    I test it and it worked. This time instead of 18000 records recorded as changed, it recorded 3994 rows which is the actual number of records that changed.

  • You are most welcome. I am sorry I made the errors in the first place, and then took so long to correct them. It's good to hear you were able to replicate the expected results!

    :{> Andy

    Andy Leonard, Chief Data Engineer, Enterprise Data & Analytics

  • I must hardly object!

    I have seen poor quality in some of my projects in the past.

    Your article doesnt fall under this category even if it might not be perfect (but that type of project has yet to be invented ;-).

    Again thanks for your contribution!

    Don't be discouraged by a failure. It can be a positive experience. Failure is, in a sense, the highway to success, inasmuch as every discovery of what is false leads us to seek earnestly after what is true, and every fresh experience points out some form of error which we shall afterwards carefully avoid.

    John Keats (1795 - 1821)

    Best Regards,

    Chris Büttner

  • Thanks Christian!

    :{> Andy

    Andy Leonard, Chief Data Engineer, Enterprise Data & Analytics

  • Andy,

    In your article write changed rows, "To do a set-based update we first stage all the data we need in a stage table. When we’re done with our Data Flow task, we execute a single T-SQL UPDATE statement joining the contents of this table to the Destination table. " Can you explain more on this, or show me the detailed step-by-step instructions of how I can update the changed rows back to the destination table. I just get touched with SSIS and walk through the toturial now, no clue how I can do this in OLEDB Destination editer. Thanks.

  • Hi Hzhu,

    With one exception, you cannot do set-based operations in an OLEDB Destination Adapter (the exception is fast loads). You are stuck with row-based operations most of the time. In selecting between row-based operations in a data flow, inserts are way faster than updates. So we use an OLE DB Destination Adapter to insert the data - data that will be updated later - in a stagng table.

    After the data flow completes we execute a set-based update statement - joining the stage table to the destination table. Keep working through the example, you should see what I mean here.

    :{> Andy

    Andy Leonard, Chief Data Engineer, Enterprise Data & Analytics

  • Hi Andy,

    Thanks for replying so quickly. From my understanding of reading your reply, just write(insert) the detected changed rows to a stage table(in your example, it be "Changed Row" OLEDB Destination Adapter). After I'm done with the Data Flow task, which means I find out how many rows be changed. I'll do a update sql command to update the changed rows in destination table by joining the stage table and the destination table. Is that what you mean? Thanks

  • Yep - that's it. And I cover that in the example.

    :{> Andy

    Andy Leonard, Chief Data Engineer, Enterprise Data & Analytics

  • Hi Andy,

    I'm trying to put a a case of delete rows to data flow task in your example. I searched your previous replies on this issue, you said use right join to get all data from destination, and do pretty much the same thing as the new rows. But how can I do the right join(to find out the deleted rows) in Lookup transformation? And at the same time, I still need the left join to find the new insertion. I think it needs to be function like full join(all rows from source and all rows from destination) for this case. Is this doable in ssis 2005, and how can I practically do this in one Lookup transformation or should I use another lookup transformation to deal with deleted rows. Thanks.

  • Hi Andy,

    This is to continue previous discussion I post above. I followed all your article and got the same result(new and update) as you showed. for trying deleted rows case, since Lookup transformation pulls all rows from destination(ignore failure), so I didn't change anything in here. In conditional Split, I add third output as deleted_rows, and the condition as "IsNull(ContactID)", which is opposite from the new output condition "IsNull(Dest_ContactID)". Last I drag another OLEDB Destination named deleted_rows connecting to a stage table. After I've done design, when I test it, it didn't come out the right result. The deleted rows didn't go to that stage table. I think the problem is still in Lookup transformation(right join issue) and maybe the condition in conditional Split. Couldn't figure out by myself. Could you give me some hint? Thanks.

Viewing 15 posts - 31 through 45 (of 101 total)

You must be logged in to reply to this topic. Login to reply