Deleting large batches of rows - optimum batch size?

  • In another forum post, a poster was deleting large numbers of rows from a table in batches of 50,000.

    In the bad old days ('80s - '90s), I used to have to delete rows in batches of 500, then 1000, then 5000, due to the size of the transaction rollback segments (yes - Oracle).

    I always found that increasing the number of deleted rows in a single statement/transaction improved overall process speed - up to some magic point, at which some overhead in the system began slowing the deletes down, so that deleting a single batch of 10,000 rows took more than twice as much time as deleting two batches of 5,000 rows each.

    Anyone have some good rule-of-thumb numbers (or even better, some actual statistics and/or explanations) as to how many records should be deleted in a single transaction/statement for optimum speed? 50,000 - 100,000 - 1,000,000 or unlimited? Are there significant differences between 2008, 2012, 2014?

  • It's going to vary based on the number of indexes, foreign keys, adjacency of rows, etc. The old method of try 10K, 50K, then 100K is probably the best method. Even a rule of thumb of 50K can be a huge problem depending on all those things.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • When we program a loop to bulk delete rows we adjust the Batch Size according to how long the iteration takes. We increase batch size by 20% if the iteration execution time is below a LOW threshold and reduce it by 50% if it is above a HIGH threshold. We also put a WAITFOR for a couple of seconds to allow other processes to run.

    I've never thought to actually log the values, but I suppose might find that they settle at a sweet-spot 🙂

  • Quick thought and further on Jeff's fine answer, observing the wait stats during the operation usually gives a good indication of whether to alter the batch size or even change methods altogether. Alternatives such as partitioning, temp table and truncate etc. can under the right circumstances be many times faster.

    😎

  • Another trick to help the deletes along is to ensure that the deletes are done in the same order as the clustered index.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • If deleting, say, 90% of the table then worth considering "copying" the 10% to be retained into a new temporary table, and dropping old table / renaming temporary table. Some downtime, and hassle with indexes / FKeys etc. but a lot faster than deleting 90% of the data in a loop 🙂

  • Kristen-173977 (10/18/2015)


    If deleting, say, 90% of the table then worth considering "copying" the 10% to be retained into a new temporary table, and dropping old table / renaming temporary table. Some downtime, and hassle with indexes / FKeys etc. but a lot faster than deleting 90% of the data in a loop 🙂

    Amen to that!

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Thanks for all the replies. A couple of observations/comments.

    Absolutely agree with the WAITFOR breaks. When doing a lot of transactions, I always insert these to avoid getting to the point where the server is backlogged on physical disk activity as this can seriously impact other users/processes.

    Regarding creating a second table, that is rarely an option, as I am usually working on 24/7 active systems, so the "pruning" needs to occur while the table is in use by other processes. Also, the percent of rows being deleted is usually less than 20% and rarely anywhere near 50%.

    I was curious as to others' experiences with deleting large batches of rows. In the '80s, with Oracle/DB2/Informix/Sybase, I found that anything much over 500 rows at a time could impact system performance.

    Recently I have been using 5000 rows as a reasonable tradeoff between speed and system resources, but when I saw the poster doing 50,000 rows in a single batch I wondered if I was being way too conservative and thought I would see if anyone had better metrics/suggestions.

  • Kim Crosser (10/19/2015)


    Recently I have been using 5000 rows as a reasonable tradeoff between speed and system resources, but when I saw the poster doing 50,000 rows in a single batch I wondered if I was being way too conservative and thought I would see if anyone had better metrics/suggestions.

    Our default starting point is 10,000 deletions per iteration

  • Jeff Moden (10/18/2015)


    Kristen-173977 (10/18/2015)


    If deleting, say, 90% of the table then worth considering "copying" the 10% to be retained into a new temporary table, and dropping old table / renaming temporary table. Some downtime, and hassle with indexes / FKeys etc. but a lot faster than deleting 90% of the data in a loop 🙂

    Amen to that!

    The following is simplistic as there is more involved depending on the table definition including indexes, foreign key constraints, etc. I have a procedure to create DB Training cuts consisting of a subset of a full database.

    If deleting 90% of a table, SELECT * INTO to create a temporary table of data to retain, build the appropriate indexes on the temporary table, truncate the original table, A LTER TABLE SWITCH to move the data back to the original table. If you have a timestamp column in the original table it will retain the original value through out this process. I had to put a space between the A and the L to post this.

    Again, there really is more to this than what I wrote above and requires plenty of research and testing to get it right.

  • What % of rows are you deleting and how often? If > 25%, then you probably want to go ahead and just drop all indexes and then rebuild afterward. Another thing to keep in mind is that the transaction log is the worst bottleneck when batch deleting or updating, and whatever best practices there are for optimizing the log will benefit you here.

    But really, any database that routinely performs physical deletes as part of it's normal operation, whether it's periodic batch deletes in a DW or single deletes in an OLTP database, it's indicative of poor ETL processing or poor data modeling. I mean, why insert something and then delete it?

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Eric M Russell (10/19/2015)


    What % of rows are you deleting and how often? If > 25%, then you probably want to go ahead and just drop all indexes and then rebuild afterward. Another thing to keep in mind is that the transaction log is the worst bottleneck when batch deleting or updating, and whatever best practices there are for optimizing the log will benefit you here.

    Drop or disable? If you drop the indexes you also need the code to rebuild them. If you disable the nonclustered indexes, upi only need to rebuild hem when done and you don't need the T-SQL to recreate them.

  • Lynn Pettis (10/19/2015)


    Eric M Russell (10/19/2015)


    What % of rows are you deleting and how often? If > 25%, then you probably want to go ahead and just drop all indexes and then rebuild afterward. Another thing to keep in mind is that the transaction log is the worst bottleneck when batch deleting or updating, and whatever best practices there are for optimizing the log will benefit you here.

    Drop or disable? If you drop the indexes you also need the code to rebuild them. If you disable the nonclustered indexes, upi only need to rebuild hem when done and you don't need the T-SQL to recreate them.

    Yes, disable would probably be preferred, because it insures that the indexes are re-created using the same options as before.

    Also, for what it's worth, disable performs a de-allocation of the index data, just like a drop, except the meta data is still there in a disabled state.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Lynn Pettis (10/19/2015)


    If deleting 90% of a table, SELECT * INTO to create a temporary table of data to retain, build the appropriate indexes on the temporary table, truncate the original table, A LTER TABLE SWITCH to move the data back to the original table.

    I'm curious how this compares with New Table and Rename?

    Downtime is from start of SELECT * INTO to the end of the SWITCH

    With rename downtime is from start of SELECT * INTO to the end of the RENAME

    Both cases have all the same issues (presumably) with drop/create indexes, constraints and FKeys. Anything else "object" to having, for example, a table DROPped and then re-appear after a rename? I don't think VIEWs need refreshing, do they?

  • Eric M Russell (10/19/2015)


    it's indicative of poor ETL processing or poor data modeling. I mean, why insert something and then delete it?

    Pah!

    We have Archive tables on almost every table in our system. We delete data after a period of time (varies per table). The data is almost never actually viewed - only in the case of investigation as to what-went-wrong in user-editing a record, or investigating possible fraud / misconduct.

    I can't envisage adopting a more sophisticated strategy, e.g. partitioning, for all these (hundreds of) tables many of which get one edit a year if they are lucky, others are much more active but on many "it depends" whether users create a lot of edits/archive records, or not.

    When we create a new archive table we add it to the "purge" routine (with a configurable retention period). The delete routine is "same for all" so it takes only a few seconds to add a new archive table to the purge.

Viewing 15 posts - 1 through 15 (of 18 total)

You must be logged in to reply to this topic. Login to reply