delta processing: alternative to cursor

  • Hi,

    For our datawarehouse we bulk insert files into 'temp tables' which represent a 1:1 representation of the OLTP system. The next step is deltaprocessing between those tables and ods. This is done in three steps (within one stored procedure) 1) updates 2) inserts 3) deletes (which are actually updates)

    Step 1 is currently coded as a cursor. I would like to rewrite this as a statement without cursors. Any clues ? Thanks in advance.

    Simplified example of the current inserts:

    (there are f.i. 7 'keycolumns' and 150 'datacolumns'). Is it possible to simplify the declaration / variable usage in the cursor by using some kind of record structure ?

    declare @key1 char(10), @key2 char(10), @key3 char(10), @data1 char(10), @data2 char(10), @data3 char(10)

    --this types are only an example of course 😉

    declare @commitcount integer

    set @commitcount = 0

    declare [delta-inserts] cursor fast-forward for

    select t.*

    from temptable t, odstable o

    where

    t.key1 = o.key1 and

    t.key2 = o.key2 and

    t.key3 = o.key3 and (

    (isnull(t.data1,0) <> isnull(o.data1,0)) or

    (isnull(t.data2,0) <> isnull(o.data2,0)) or

    (isnull(t.data3,0) <> isnull(o.data3,0)) )

    open [delta-inserts]

    fetch next from [delta-inserts] into @key1, @key2, @key3, @data1, @data2, @data3

    begin transaction insertings

    while @@fetch_status=0

    begin

    update odstable

    set data1= @data1,

    set data2= @data2,

    set data3= @data3,

    set modify_date = @processing_date,

    set modify_flag = 'U',

    where

    key1= @key1,

    key2= @key2,

    key3= @key3

    set @commitcount = @commitcount + 1

    if @commitcount = 10000

    begin

    set @commitcount = 0

    commit transaction insertings

    end

    fetch next from [delta-inserts] into @key1, @key2, @key3, @data1, @data2, @data3

    end

    close [delta-inserts]

    deallocate [delta-inserts]

    commit transaction insertings

    ---

    thanks a lot !

  • gee I love killing cursors.

    Suggestion: use the modern JOIN syntax for readability.

    
    
    UPDATE o
    SET
    o.data1 = t.data1
    , o.data2 = t.data2
    , o.data3 = t.data3
    , o.modify_date = @processing_date
    , o.modify_flag = 'U'
    FROM odstable o
    INNER JOIN temptable t
    ON t.key1 = o.key1
    AND t.key2 = o.key2
    AND t.key3 = o.key3
    WHERE
    (
    ISNULL(o.data1, 0) <> ISNULL(t.data1, 0)
    OR ISNULL(o.data2, 0) <> ISNULL(t.data2, 0)
    OR ISNULL(o.data3, 0) <> ISNULL(t.data3, 0)
    )

    Edited by - jpipes on 07/29/2003 08:24:03 AM

  • The WHERE clause can be rewritten as:

    ...

    WHERE CHECKSUM(o.data1,o.Data2,o.Data3)<>CHECKSUM(t.data1,t.Data2,t.Data3)

    The CHECKSUM creates a CRC value.

  • I've been doing exactly the same over the last couple of days - trying to replace a cursor based update process with a set based process.

    So far I am not impressed - the set based process is no quicker than the cursor. In fact it is probably going to be slower as there are a number of other things that the cursor based processing is doing that I haven't yet put into the set based processing. Any ideas?

    Input file: 42 columns. 8 columns require date validation, 3 columns require currency conversion. 280,000 rows (but can be over 3 million). One column calculated on one of the columns.

    Master table: 9 million rows (Same process used on tables of 25 million rows).

    Cursor method:

    1. Load file into temp table

    2. Declare Cursor

    3. Loop for each row doin:

    - date validation for 8 columns

    - currency conversion for 3 columns

    - count PK in master table - if exists update else insert

    4. Fetch next

    5. Tidy up

    Set based process:

    1. Load file into Temp table

    2. Create indexes (found that it improves performace)

    3. Update master

    set master.col_values = temp.col_values

    where master.PrimaryKey = temp.PrimaryKey

    4. Insert master

    set master.col_values = temp.col_values

    where not exists ....

    For the date validations in the set based method, I am using User Defined Functions and also for the Currency Conversions.

    From what I have read on this site, I thought that set based processing was supposed to be much better (i.e. quicker) than cursor based processing. If this true, what am I doing wrong?

    Suggestions welcome.

    Jeremy

  • quote:


    The WHERE clause can be rewritten as:

    ...

    WHERE CHECKSUM(o.data1,o.Data2,o.Data3)<>CHECKSUM(t.data1,t.Data2,t.Data3)

    The CHECKSUM creates a CRC value.


    Nice. Very nice.

  • Jeremy,

    There may be some things in your SQL that we could look at that might speed things up. But aside from that, speed is not the only problem with CURSORS; they can also have serious memory problems associated with them that a SET-based solution usually doesn't. That said, there ARE times when a cursor will run faster, and there ARE times when a set-based solution will run faster. A lot of it is about predicting the future, as well (I know it sounds corny). If you know ahead of time that the procedure that you just implemented with a cursor will be run every minute on a production system with thousands of users, you're heading for trouble. But if the process is one-time run, or something run off-hours or on small data sets, CURSORS generally won't present a problem. That said, I have been bitten numerous times by cursors that have been implemented in procedures that weren't intended to be run all the time, but "it just turned out that they do"...If you want to post some code, we'll take a look...

  • Jay,

    This is the code for the set based update statement:

    declare @row int, @increment int, @start datetime, @update_count int

    set @start = getdate()

    set @row = 1

    set @increment = 5000

    set @update_count = 0

    set rowcount 0

    while @row <= 5000 begin

    update [claims_2002] with (tablockx)

    set mdl_yr = left(col001,4), vin_cd = col002, sls_cntr = col003, veh_line = col004, plant_cd = col005, maint_dt = dbo.fn_AwsDateCheck(col006), prodn_dt = dbo.fn_AwsDateCheck(col007), wrty_sta = dbo.fn_AwsDateCheck(col008), orig_wrt = dbo.fn_AwsDateCheck(col009), wcc_cd = col011, tot_cost = dbo.fn_AwsClaimLocalCurrencyCost(col026,col012)

    ,lbr_cost = dbo.fn_AwsClaimLocalCurrencyCost(col026,col013), mtrl_cos = dbo.fn_AwsClaimLocalCurrencyCost(col026,col014), prt_num0 = col015, prt_num_ = col016, prt_num1 = col017, cpsc_6_c = col018, clm_cnt_ = col019, veh_cnt_ = col020

    ,sls_regi = col021, load_dt = dbo.fn_AwsDateCheck(col022), fcc_auth = col023, rpr_dt = dbo.fn_AwsDateCheck(col024), cust_con = col025, rpr_cntr = col026, milge = col027, mkt_deri = col028, adj_amt = col029, adj_rsn_ = col030

    ,cond_cd = col031, fcc_bin_ = col032, fcc_milg = col033, fcc_regi = col034, fcc_rpt_ = col035, fcc_term = col036, lbr_hrs = col037, prt_mrku = col038, prt_typ_ = col039, rpr_dlr_ = col040

    ,rpr_dlr0 = col041, doc_num = col042, funds_ty = dbo.fn_AwsClaimCurrency(col026)

    from tempdb..clm02_new

    where [claims_2002].clm_key = tempdb..clm02_new.col010

    and tempdb..clm02_new.rec_id between @row and (@row + @increment)

    set @update_count = @update_count + @@rowcount

    set @row = @row + @increment + 1

    end

    print @update_count

    print datediff(second,@start,getdate())

    print @update_count/datediff(second,@start,getdate())

    Functions:

    /* Converts US$ Cost into Local Currecny cost for AWS claims */

    CREATE FUNCTION [dbo].[fn_AwsClaimLocalCurrencyCost] (@country_repaired varchar(25), @usd_cost decimal (25,2) )

    RETURNS decimal (25,2) AS

    BEGIN

    return ( select @usd_cost * funds_fa

    from claims_fundrep_new

    where rpr_cntr = @country_repaired

    )

    END

    /* Checks the validity of date fields from AWS downloads */

    CREATE FUNCTION [dbo].[fn_AwsDateCheck] (@date smalldatetime)

    RETURNS varchar(25) AS

    BEGIN

    declare @output_date varchar(25)

    if @date > '1950-01-01' and isDate(@date) = 1 set @output_date = @date

    else set @output_date = null

    return @output_date

    END

    /* Returns local currency for AWS claims */

    CREATE FUNCTION [dbo].[fn_AwsClaimCurrency] (@country_repaired varchar(25) )

    RETURNS varchar(25) AS

    BEGIN

    declare @output_Field varchar(25)

    if @country_repaired = 'USA' set @output_field = 'USD'

    else select @output_Field = funds_ty

    from claims_fundrep_new

    where rpr_cntr = @country_repaired

    return (@output_field)

    END

    I'm still working on this so cosmectically it may not be at it's best (such as column names in the temporary tables)

    This is a 'weekly' update job although it can run at any time.

    Any ideas for improvements?

    Jeremy

  • Jeremy,

    If performance really is too slow, drop the Userdefined functions and code everything directly in the SQL Statement.

    If you want, you can testdrive, only doing the updates of the function fields in the temp table. This will give you an idea where the biggest delay is.

  • I do the same thing in datawarehouse evironment, but with a few tweeks.

    I load my data files into a "working table" that has the same fields as the data file. I then check for changes, new records, and deleted records in this manner. First I get the changed accounts by loading a temp table with the primary key of the records in the production table that have changed along with the changed values. I then use the method that jpipes pointed out (isnull(tbl1.fld1,0) != isnull(tbl2.fld1,0)), but like the checksum idea of SeekQuel.

    Anyway, then do a "update from" joining the temp table to the production table using the primary key.

    Same process for new and deleted, except the query to find them is a little different. I use the "not in" structure in the where clause to find them. Then insert the new records, and update the deleted records. I update the deleted records setting a "flag" to indicate they are not in the source business system anymore.

    This runs well for us, about 20 minutes. It is a 30 million record table checking against a data file of about 1 million records. The server has 2 gig RAM, dual 750mhz Xeons. We have had processes similar to this using a cursor and they take hours.

    Not sure why you need the loop in the code you posted, smaller batches w/ commits??? You could do your currency conversions and date validations when you load the temp table. I would be concerned about the UDF's in the update. Not sure how that effects performance.

    Hope this helps, would be glad to post an example.

    Thanks,

    epurg

  • @jpipes: Thanks a lot ! This is exactly what we are looking for. A first test:

    two tables containing 1.2 million rows (temp and ods). After changing 47.000 rows on the temp. I've ran the update statement. (server Pentium III 500 Mhz 384Mb an old one to test on) completed in 27 seconds! Comparisons against the cursor statement will also be made.

    @SeekQuel: Very nice (the checksum). Does this neat little "trick" offers an important performance gain ? (anyway, it is shorter and more readable as we have tables with 280 columns)

    @jeremy-2 Kemp: In our situation the set based solution seems very fast. Remember: the set only works on the dataset results from the joins

    update 1..5% of the data

    insert 1..5% of the data

    instead of sequentially digging through the whole 100%

  • hbkdba, Great news! Give us an update when you do a performance comparison on the cursor...

    Jeremy,

    I have a feeling those UDFs might be the culprit, like NPeeters mentioned. Any chance to see an execution plan?

  • Is the following statement a valid example of the checksum-statement ? I've read about the * on msdn (-which does not mention the abality to use the table-qualifications as t. and o.) but am unable to test because I have a couple of days off right now

    UPDATE o

    SET

    o.data1 = t.data1

    , o.data2 = t.data2

    , o.data3 = t.data3

    , o.modify_date = @processing_date

    , o.modify_flag = 'U'

    FROM odstable o

    INNER JOIN temptable t

    ON t.key1 = o.key1

    AND t.key2 = o.key2

    AND t.key3 = o.key3

    WHERE

    (

    checksum(o.*) <> checksum(t.*)

    )

    Thanks !

    benchmark tests cursor vs set based are planned thursday next week (after a short vacation 🙂

  • Just to chime in on the UDF issue... I think they are amazingly cool, but should be used in select statements with great caution.

    We were working on an application here and the other developer, who comes from a VB/CGI/web-development world, coded a lot of the logic into UDFs, then put the UDFs in the select and where clauses... You have to keep in mind that if the UDF does a select off of a large table to return it's result, then you put it into a select or especially a where clause, it has to execute one time for every row that the statement is going to return. With caching, it might not generate TOO much IO, but if your result set is large and it adds a tiny bit of time to each row, you can drag an application to it's knees waiting for it to return.

    In our case, the developer's first attempt required over 10 to 15 minutes or more to return a few hundred rows out of a table with around one million (this was for a web application). I later recoded it without using the UDFs and got the entire thing to run in under three seconds. To be fair to UDFs, that wasn't the only problem the stored procedure had. This is an extreme case, but I think it makes the point.

    Matthew Galbraith

    ps: I do use UDFs, but usually only in cases where I can't do it any other way, or as a shortcut when I'm writing one-off ad hoc queries; I eschew them almost completely in where clauses though I do sometimes use them as 'smart' views in joins and every once in a while will put a simple one in the select clause.

  • Some timings:

    No UDFs (Basic) - 1036 rows/second

    Basic + Currency Conversion UDFs - 770 rows/sec

    Basic + Date UDFs - 550 rows/sec

    Basic + Currency UDFs + Date UDFs - 480 rows/sec

    Basic + Date validation as case statement - 710 rows/sec

    Basic + Date validation as case + table join for currency conversion - 880 rows/sec

    Yes the UDFs were killing performance as several of you suggested. I am surprised by how much they affected performance especially the date UDFs when they are very simple statements.

    I've also checked my cursor and confess to not using UDFs in there for date validation - I was using a much simpler if statement so the comparison between cursor processing and set processing is perhaps not fair.

    As speed is important, I will recode my update processing to use set processing.

    I'm interested in how people use UDFs but I will start another thread for that.

    Thanks for all your help.

    Jeremy

  • quote:


    Is the following statement a valid example of the checksum-statement ? I've read about the * on msdn (-which does not mention the abality to use the table-qualifications as t. and o.) but am unable to test because I have a couple of days off right now

    UPDATE o

    SET

    o.data1 = t.data1

    , o.data2 = t.data2

    , o.data3 = t.data3

    , o.modify_date = @processing_date

    , o.modify_flag = 'U'

    FROM odstable o

    INNER JOIN temptable t

    ON t.key1 = o.key1

    AND t.key2 = o.key2

    AND t.key3 = o.key3

    WHERE

    (

    checksum(o.*) <> checksum(t.*)

    )


    That would be a No. * will not work in this case.

    And I do have to say SeekQuel that is an unique approach. Have to use sometime myself now.

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply