Divide big delete into several small ones

  • I adapted Ludo's script http://qa.sqlservercentral.com/scripts/contributions/464.asp[/url

    It works against a half million row table but I do not understand exactly how.

    Would appreciate an explanation. Most of the code is clear except for exactly how the "FROM (SELECT TOP 1000 * FROM tblStatistics) AS T1" makes the loop move to the next 1000 records in the table.

    Assuming 10,000 records and only 100 to be deleted in say the last 1,000 I can't work out why the loop is not continually processing the first 1,000 records.

    -- Delete all records in tblStatistics older than 2 years
    
    declare @DeleteStatYear smallint
    declare @loop int, @tel int


    SET @DeleteStatYear = DATEPART(YY, GETDATE())-2

    SELECT @loop = count(*) / 1000 +1 from tblStatistics
    set @tel = 0
    while @tel < @loop
    begin
    begin transaction
    DELETE tblStatistics FROM (SELECT TOP 1000 * FROM tblStatistics) AS T1
    WHERE tblstatistics.StatYear <= @DeleteStatYear
    commit transaction
    set @tel = @tel + 1
    end
    GO

    ]

  • This was removed by the editor as SPAM

  • Sorry for the slow response, StefanJ. I am sure you have figured this out already. Since the delete occurs, those original 1000 entries are no longer in the table. SELECT top 1000 will get from whatever is left.

    I prefer using some type of loop using key criteria to make it more clear - but this seems to work.

    Guarddata-

  • With the given delete statement, that may be true, but it certainly isn't clear.

    To my eye (and, I assume, the original posters), the following should happen:

    Since there is no join specified between the derived table (T1) and the main table (tblStatistics), the initial delete should wipe out anything that matches the where clause (meaning this works because it deletes everything at once, first time out)

    As written, if the deletion was restricted to the TOP 1000 rows of the database, then we would first delete all records in the top 1000 rows that matched the where clause. Second time through, we would still presumably have any records that were in the top 1000 but did not match the where clause, filling out the top 1000 with records that have not been processed as yet. Third time through, leftovers from the first two runs, then newbies; and so on. We may never see all the records, since some records are pulled up more than once, and the loop may end before all the records can be covered. If we avoided terminating the loop after it *should* have processed all the records, then we may create an infinite loop, as we could reach a point where none of the top 1000 records can be deleted.

    If there was a join between tblStatistics and T1, and the where clause was applied to the SQL creating T1, then it would be clear what was happening.

    StefanJ: Could this be working only in the sense that all the records that should be deleted are deleted, and not in the sense that they are deleted in chunks of 1000?

    If I am right then (going back to your sample scenario), the delete statement is executed 11 times against the table, each time deleting all records that match the where clause (but only actually deleting the 100 records to be deleted the first time, of course). The SELECT TOP 1000 * is, indeed, always bringing back the same records - however, since we're ignoring that table, that's irrelevant.

    Running this in Query Analyzer and paying attention to the number of records processed during each run could be informative. Odds are 100 rows are deleted first time through, and none after that. Of course, with the numbers StefanJ specified, the same results occur with the changes I've suggested, but handling of more than 1000 records would go much better.

    If I'm right.

    Could someone else comment on the above - I'd been waiting for an explanation myself, hence my response to the (apparently, at least) inadequate response from guarddata.

    RD Francis


    R David Francis

  • Ahhh- you are, of course, correct rdfoxx. This delete does delete them all in the first execution. In my rush, I was treating this the same as an inner join that repeats the reference to tblStatistics.

    StefanJ - My apologies for the misleading answer.

    Guarddata-

  • The conclusion from the replies seems to be that the example should not work however I have tested it and it does BUT and this was the reason for my original question - WHY DOES IT WORK? Could someone explain the logic of the sub query because my brain can't work out what it is doing.

    Tot test it I created a table containing the following data:

    SELECT STATYEAR, COUNT(*) from dbo.tblStatistics GROUP BY StatYear

    Result:

    2002 186350

    2003 104041

    2004 290391

    2005 10000

    2006 10000

    2007 20000

    2008 20000

    I then ran:

    declare @loop int, @tel int

    SELECT @loop = count(*) / 1000 +1 from tblStatistics

    set @tel = 0

    while @tel < @loop

    begin

    begin transaction

    DELETE tblStatistics FROM (SELECT TOP 1000 * FROM tblStatistics) AS T1

    WHERE tblStatistics.StatYear = 2004

    commit transaction

    set @tel = @tel + 1

    end

    Then :

    SELECT STATYEAR, COUNT(*) from dbo.tblStatistics GROUP BY StatYear

    Result:

    2002 186350

    2003 104041

    2005 10000

    2006 10000

    2007 20000

    2008 20000

    Also what are the benefits of Ludos script over something simpler like the following:

    declare @loop int, @tel int

    SELECT @loop = count(*) / 1000 +1 from tblStatistics WHERE StatYear = 2005

    set @tel = 0

    while @tel < @loop

    begin

    begin transaction

    DELETE tblStatistics WHERE tblStatistics.StatYear = 2005

    commit transaction

    set @tel = @tel + 1

    end

    And one final question:

    Is the Begin transaction and commit transaction in the above essential. I thought I read that each Transact-SQL command would AutoComit.

  • The way to tell if the script actually worked is to run it in Query Analyzer. You should see that the results returned indicate 290391 rows affected once, then 290 indications of 0 rows affected...which means they were all part of one transaction. You are exactly right - there is no benefit over the simpler statement.

    As rdfozz remarked, without a better joining between the subselect, it is basically ignored and the operation is done in one big gulp.

    When we are deleting on such a global manner, we usually loop like this

    SET ROWCOUNT 1000

    -- Only want to set the @@ROWCOUNT variable

    SELECT @loop = 1

    WHILE @loop > 0

    BEGIN

    BEGIN TRAN

    DELETE FROM tblStatistics WHERE StatYear = 2004

    SET @loop = @@ROWCOUNT

    COMMIT TRAN

    END

    This is a lot easier and works well except where the WHERE clause causes table scans.

    Guarddata-

  • I believe that a query batch is treated as a single transaction.

    For instance:

    SELECT Fred FROM tblStatistics
    
    DELETE tblstatistics
    GO

    should leave your table intact (assuming you don't have a column named 'Fred'), while

    SELECT Fred FROM tblStatistics
    
    GO

    DELETE tblstatistics
    GO

    would delete everything in it.

    If my previous assumption are correct, the following would do what you wanted originally. Replace the <id> column with a unique ID from your table:

    
    
    -- Delete all records in tblStatistics older than 2 years
    declare @DeleteStatYear smallint
    declare @loop int, @tel int


    SET @DeleteStatYear = DATEPART(YY, GETDATE())-2

    SELECT @loop = count(*) / 1000 +1 from tblStatistics
    set @tel = 0
    while @tel < @loop
    begin
    begin transaction
    DELETE tStat
    FROM tblStatistics tStat
    INNER JOIN (SELECT TOP 1000 *
    FROM tblStatistics
    WHERE StatYear <= @DeleteStatYear)
    AS T1
    on (tStat.<id> = T1.<id>)

    commit transaction
    set @tel = @tel + 1
    end
    GO

    You are creating a derived table containing the top 1000 records that match your where clause, identifying the matching records in the real table by a unique ID, then deleting those 1000 matching records. The next time through, the deleted records are gone, and you get the next 1000 records.

    Note that, in the example you gave, where there are only 100 records to delete, this never actually will help you. All the records to be deleted will be caught in the first go-round. This is more useful if there are tens of thousands of records to delete.

    Hope this answers your question - if not, let us know.


    R David Francis

  • I tried

    
    
    DECLARE @Loop int
    SET ROWCOUNT 1000
    SELECT @LOOP = 1
    WHILE @LOOP > 0
    BEGIN
    BEGIN TRAN
    DELETE FROM tblStatistics WHERE STATYEAR <= 2004
    SET @Loop = @@ROWCOUNT
    COMMIT TRAN
    END

    And it seems to work perfectly and more importantly I can understand exactly what it is doing.

    Bearing in mind that I am a new comer to SQL Server, can I ask some suupplementary questions to make sure I understand it completely.

    "Except where clause causes table scans" - By this do you mean OK as long as there is an index?

    Is the BEGIN TRAN and COMMIT TRAN necessary or will each DELETE FROM be a separate transaction anyway?

    Is this code as efficient as rdfozz post?

    The main reason for using this code rather than a simple delete is so as not to impact passthru query performance against this table if the delete done at the same time. What is the best way of measuring if this is really necessary? And should I be doing something similar with a very large insert?

  • Yes - if there is an index that matches the WHERE clause - everyone is happy.

    The BEGIN and COMMIT are necessary. Without them the process runs in a single transaction and there is no benefit.

    As efficient? Because of the simpler query and no sub-select, I would expect it to be more efficient. However, these are the types of things you end up testing to make sure your hunches are correct.

    Hope this helps.

    Guarddata-

  • quote:


    Bearing in mind that I am a new comer to SQL Server, can I ask some suupplementary questions to make sure I understand it completely.

    Is the BEGIN TRAN and COMMIT TRAN necessary or will each DELETE FROM be a separate transaction anyway?


    As I stated above, statements are run as a batch unless broken up by GO statements. Since you can't use a GO in the middle of a loop, you would want to commit each delete individually.

    Without the BEGIN ... COMMIT, doing the deletes separately would have no effect; the table would remain locked until the query batch is complete.

    quote:


    Is this code as efficient as rdfozz post?

    The main reason for using this code rather than a simple delete is so as not to impact passthru query performance against this table if the delete done at the same time. What is the best way of measuring if this is really necessary? And should I be doing something similar with a very large insert?


    I shall make no claims as to the efficiency of the code I posted - I was just trying to make what you had posted originally work.

    The code you posted accomplishes your ultimate goal (delete the records), but not your intermediate goal (deleting in small batches, then commiting, to minimize the amount of consecutive time the table is locked).

    There is nothing in your DELETE clause that would cause only a limited group of pre-2004 records to be deleted. In the first iteration, the loop will zap all the pre-2004 records, and will set @LOOP to the number of records deleted; in the second iteration, nothing is left to be deleted, so @LOOP is set to zero, and the loop stops. This would be more efficient if the loop were taken out and you just issued the delete statement.

    As to the remainder of your question: is it necessary to break up your DELETE into several smaller ones? This depends heavily on the details of your situation. How often do people run queries against this table? How long does the DELETE actually take, on a weekly/daily/hourly/whatever basis? Is that time likely to change dramatically? Is there a chunk of "dead time" during which the database sees little use, when this activity could be safely scheduled?

    There's an easy way to see if you need to break things up, but it doesn't work well in most environments - do the deletion, and see if people complain....

    Otherwise, answer the questions above, and put the answers together to determine what you can and should do.

    R David Francis


    R David Francis

  • Almost forgot - INSERTs most certainly can lock up the table when done in bulk, as can UPDATEs. Whether this is a problem for you or not depends on your situation, as I described above.

    To give you a little sample of the issues to consider, here's where I needed to do this: We receive data exports from a partner on a weekly basis. We can wind up needing to perform tens or hundreds of thousands of inserts or updates, which in an environment where we have dozens to hundreds of queries on this same table every minute is a problem.

    My solution was to store the IDs of the records to insert or update in a table variable. I grab the low and high ID values, use an increment of 10000, and set a counter to )low) + 10000. I insert or update all the records with IDs less than my counter, drop those same IDs from my table variable, increment the counter, and do it again until I pass the high ID. I'd post sample code, but it's a bit too big, as I actually have nine different scenarios to account for, and updates to do to other tables at the end.

    Hope this helps some.

    R David Francis


    R David Francis

  • Thanks Guarddata and Rdfozz. I have learnt a lot from your answers.

    Just one final clarification.

    In the book I am using, it suggest that AutoCommit will treat each SQL Statement as a transaction. BOL seems to confirm this (see below). I understood from this that the BEGIN TRAN and COMMIT TRAN were superflous in the above code. What am I missing?

    In BOL

    quote:


    Autocommit mode is the default transaction management mode of Microsoft® SQL Server™. Every Transact-SQL statement is committed or rolled back when it completes. If a statement completes successfully, it is committed; if it encounters any error, it is rolled back. A SQL Server connection operates in autocommit mode whenever this default mode has not been overridden by either explicit or implicit transactions. Autocommit mode is also the default mode for ADO, OLE DB, ODBC, and DB-Library.

    A SQL Server connection operates in autocommit mode until a BEGIN TRANSACTION statement starts an explicit transaction, or implicit transaction mode is set on. When the explicit transaction is committed or rolled back, or when implicit transaction mode is turned off, SQL Server returns to autocommit mode.


  • You are correct about the Auto-Commit...but that applies to the entire procedure. In order to break it down we use the BEGIN and COMMIT.

    Guarddata-

  • So what this means is that all code contained in a procedure is a single statement and therefore treated as a single transaction.

    I guess the safest thing to do - till I am less confused - is to insert BEGIN TRANS / COMMIT TRAN even if not required.

    Is there anywhere which explains transactions better than the text below from BOL?

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply