delete millions of rows

  • The thing I've realized from this, which should have been obvious, is that when you loop through batches using TOP, you're always doing a table scan from the beginning and going over rows that you have already examined previously. If you have a huge table and a comparatively low TOP setting you get hundreds of scans.

    With the range technique, you walk through the table once. No wonder it's so much faster.

    :w00t:

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • It all comes back to following the clustered index.

    πŸ˜›

    _____________
    Code for TallyGenerator

  • Jeff Moden (9/13/2016)


    ChrisM@Work (9/13/2016)


    Eric M Russell (9/12/2016)


    4 million rows is actually not a lot, but it depends on the speed of your storage system and whether the table will need to remain online and usable while the deletion is in process. Essentially, you want to delete the rows in batches to reduce the size of active transaction logging. The following example deletes from a table in 100,000 row batches and waits for 1 minute between each delete operation. This wait window gives the storage system additional time to keep up and prevents extended periods of blocking by allowing other user processes an opportunity to start and complete. When no more rows to be deleted, no rows were affected by the last delete operation, it breaks from the loop. The number of rows per batch and wait time between batches can be adjusted as needed.

    WHILE 1 = 1

    BEGIN;

    DELETE TOP(100000) FROM MyTable WHERE <delete criteria>;

    IF @@ROWCOUNT = 0 BREAK;

    WAITFOR DELAY '00:01:00';

    END;

    Also, it helps a lot if the the delete criteria is indexed.

    Batch deletes like this can be crippled by physical reads. You've probably set the batch size based on performance against cached data. The first one or two batches might well have the data to be deleted in cache and will be suitably fast. After that, the deletes can be dragged to the ground by physical reads - the rate limiting step of the process isn't how fast rows can be deleted, but how fast they can be read from disk. Cache priming can make a very significant difference to batch delete times. Uncommitted reads of the data to be deleted (including all indexes, not just the CI) and blackholed into variables take next to no time. We have one observation of a batched delete which would have taken six weeks cut down to a mere few hours using this method.

    You can also modify that code to skip the most expensive iteration of them all and that's the last iteration where nothing is found but it still had to do a full scan looking. If you're trying to delete in batches of 100K rows (for example), then any deletion that returns less than 100K (the batch size) is actually the last iteration.

    For the given code above, the change to skip that last iteration is pretty simple...

    WHILE 1 = 1

    BEGIN;

    DELETE TOP(100000) FROM MyTable WHERE <delete criteria>;

    IF @@ROWCOUNT [font="Arial Black"]< 100000 [/font]BREAK;

    WAITFOR DELAY '00:01:00';

    END;

    It won't help with the problem that you've identified, though.

    Simple stuff like this is soooo often missed. Thanks for the reminder, Jeff.

    β€œWrite the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • This is how I delete many millions row a bite at a time. it seem works very well. however you will have see what 'number' your server can handle and what is running. during the day I usually take down 200,000 - 400,000 at time depending on table. at night when the mice out to play the number goes to 500,000 to 1,000,000. it is very fast.

    Now this might not work for people with no tempdb/log disk space. Be aware that tempdb, and log size changes when the number goes up.

    So when you picking your number for your server be aware of diskspace (temp) and what is running.

    I started with 100,000 and went up.

    DECLARE @RowsDeleted INTEGER

    SET @RowsDeleted = 1

    WHILE (@RowsDeleted > 0)

    BEGIN

    BEGIN TRANSACTION

    DELETE TOP (500000) FROM DATABASE.dbo.TABLENAME

    WHERE AcctNbr NOT IN (SELECT xxxx FROM database.dbo.tablename with (nolock))

    SET @RowsDeleted = @@ROWCOUNT

    COMMIT TRANSACTION

    END

    Note: I didn't read the previous post. Either way this work well for us. the sub select is pulling from a clustered index if that helps. I will probably add the wait command to this after reading everything. the first batch or two do seem to run 'slower' but after that it deletes at least ten better that a simple delete (which I would never do on a large scale - scar tissue you know πŸ™‚ ). Have a great day.

  • Withdrawn. I'm being critical. Is it Friday yet?

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • The Dixie Flatline (9/22/2016)


    Withdrawn. I'm being critical. Is it Friday yet?

    YES!! It is Friday!! Unfortunately, it's not Friday here, yet. :Whistling:

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

Viewing 6 posts - 31 through 35 (of 35 total)

You must be logged in to reply to this topic. Login to reply