Divide big delete into several small ones

Question

Post reply

Divide big delete into several small ones

StefanJ

SSCarpal Tunnel

Points: 4713
More actions
August 20, 2003 at 10:24 am

#81661
I adapted Ludo's script http://qa.sqlservercentral.com/scripts/contributions/464.asp[/url

It works against a half million row table but I do not understand exactly how.

Would appreciate an explanation. Most of the code is clear except for exactly how the "FROM (SELECT TOP 1000 * FROM tblStatistics) AS T1" makes the loop move to the next 1000 records in the table.

Assuming 10,000 records and only 100 to be deleted in say the last 1,000 I can't work out why the loop is not continually processing the first 1,000 records.
```
-- Delete all records in tblStatistics older than 2 years

declare @DeleteStatYear smallint

declare @loop int, @tel int





SET @DeleteStatYear = DATEPART(YY, GETDATE())-2



SELECT  @loop = count(*) / 1000 +1 from tblStatistics

set @tel = 0

while @tel < @loop

begin

	begin transaction

		DELETE tblStatistics FROM (SELECT TOP 1000 * FROM tblStatistics) AS T1 

			 WHERE tblstatistics.StatYear <= @DeleteStatYear

	commit transaction

	set @tel = @tel + 1

end

GO
```
]
August 25, 2003 at 12:00 pm

This was removed by the editor as SPAM

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply

guarddata SSCertifiable Points: 6583 More actions · Answer 1

Sorry for the slow response, StefanJ. I am sure you have figured this out already. Since the delete occurs, those original 1000 entries are no longer in the table. SELECT top 1000 will get from whatever is left.

I prefer using some type of loop using key criteria to make it more clear - but this seems to work.

Guarddata-

RD Francis SSCarpal Tunnel Points: 4867 More actions · Answer 2

With the given delete statement, that may be true, but it certainly isn't clear.

To my eye (and, I assume, the original posters), the following should happen:

Since there is no join specified between the derived table (T1) and the main table (tblStatistics), the initial delete should wipe out anything that matches the where clause (meaning this works because it deletes everything at once, first time out)

As written, if the deletion was restricted to the TOP 1000 rows of the database, then we would first delete all records in the top 1000 rows that matched the where clause. Second time through, we would still presumably have any records that were in the top 1000 but did not match the where clause, filling out the top 1000 with records that have not been processed as yet. Third time through, leftovers from the first two runs, then newbies; and so on. We may never see all the records, since some records are pulled up more than once, and the loop may end before all the records can be covered. If we avoided terminating the loop after it *should* have processed all the records, then we may create an infinite loop, as we could reach a point where none of the top 1000 records can be deleted.

If there was a join between tblStatistics and T1, and the where clause was applied to the SQL creating T1, then it would be clear what was happening.

StefanJ: Could this be working only in the sense that all the records that should be deleted are deleted, and not in the sense that they are deleted in chunks of 1000?

If I am right then (going back to your sample scenario), the delete statement is executed 11 times against the table, each time deleting all records that match the where clause (but only actually deleting the 100 records to be deleted the first time, of course). The SELECT TOP 1000 * is, indeed, always bringing back the same records - however, since we're ignoring that table, that's irrelevant.

Running this in Query Analyzer and paying attention to the number of records processed during each run could be informative. Odds are 100 rows are deleted first time through, and none after that. Of course, with the numbers StefanJ specified, the same results occur with the changes I've suggested, but handling of more than 1000 records would go much better.

If I'm right.

Could someone else comment on the above - I'd been waiting for an explanation myself, hence my response to the (apparently, at least) inadequate response from guarddata.

RD Francis

R David Francis

guarddata SSCertifiable Points: 6583 More actions · Answer 3

guarddata

SSCertifiable

Points: 6583

August 25, 2003 at 2:31 pm

#471320

Ahhh- you are, of course, correct rdfoxx. This delete does delete them all in the first execution. In my rush, I was treating this the same as an inner join that repeats the reference to tblStatistics.

StefanJ - My apologies for the misleading answer.

Guarddata-

StefanJ SSCarpal Tunnel Points: 4713 More actions · Answer 4

The conclusion from the replies seems to be that the example should not work however I have tested it and it does BUT and this was the reason for my original question - WHY DOES IT WORK? Could someone explain the logic of the sub query because my brain can't work out what it is doing.

Tot test it I created a table containing the following data:

SELECT STATYEAR, COUNT(*) from dbo.tblStatistics GROUP BY StatYear

Result:

2002 186350

2003 104041

2004 290391

2005 10000

2006 10000

2007 20000

2008 20000

I then ran:

declare @loop int, @tel int

SELECT @loop = count(*) / 1000 +1 from tblStatistics

set @tel = 0

while @tel < @loop

begin

begin transaction

DELETE tblStatistics FROM (SELECT TOP 1000 * FROM tblStatistics) AS T1

WHERE tblStatistics.StatYear = 2004

commit transaction

set @tel = @tel + 1

end

Then :

SELECT STATYEAR, COUNT(*) from dbo.tblStatistics GROUP BY StatYear

Result:

2002 186350

2003 104041

2005 10000

2006 10000

2007 20000

2008 20000

Also what are the benefits of Ludos script over something simpler like the following:

declare @loop int, @tel int

SELECT @loop = count(*) / 1000 +1 from tblStatistics WHERE StatYear = 2005

set @tel = 0

while @tel < @loop

begin

begin transaction

DELETE tblStatistics WHERE tblStatistics.StatYear = 2005

commit transaction

set @tel = @tel + 1

end

And one final question:

Is the Begin transaction and commit transaction in the above essential. I thought I read that each Transact-SQL command would AutoComit.

guarddata SSCertifiable Points: 6583 More actions · Answer 5

The way to tell if the script actually worked is to run it in Query Analyzer. You should see that the results returned indicate 290391 rows affected once, then 290 indications of 0 rows affected...which means they were all part of one transaction. You are exactly right - there is no benefit over the simpler statement.

As rdfozz remarked, without a better joining between the subselect, it is basically ignored and the operation is done in one big gulp.

When we are deleting on such a global manner, we usually loop like this

SET ROWCOUNT 1000

-- Only want to set the @@ROWCOUNT variable

SELECT @loop = 1

WHILE @loop > 0

BEGIN

BEGIN TRAN

DELETE FROM tblStatistics WHERE StatYear = 2004

SET @loop = @@ROWCOUNT

COMMIT TRAN

END

This is a lot easier and works well except where the WHERE clause causes table scans.

Guarddata-

RD Francis SSCarpal Tunnel Points: 4867 More actions · Answer 6

I believe that a query batch is treated as a single transaction.

For instance:

SELECT Fred FROM tblStatistics

DELETE tblstatistics

GO

should leave your table intact (assuming you don't have a column named 'Fred'), while

SELECT Fred FROM tblStatistics

GO



DELETE tblstatistics

GO

would delete everything in it.

If my previous assumption are correct, the following would do what you wanted originally. Replace the <id> column with a unique ID from your table:



-- Delete all records in tblStatistics older than 2 years

declare @DeleteStatYear smallint

declare @loop int, @tel int





SET @DeleteStatYear = DATEPART(YY, GETDATE())-2



SELECT @loop = count(*) / 1000 +1 from tblStatistics

set @tel = 0

while @tel < @loop

begin

begin transaction

DELETE tStat

  FROM tblStatistics tStat

       INNER JOIN (SELECT TOP 1000 *

                     FROM tblStatistics

                    WHERE StatYear <= @DeleteStatYear)

                  AS T1

         on (tStat.<id> = T1.<id>)



commit transaction

set @tel = @tel + 1

end

GO

You are creating a derived table containing the top 1000 records that match your where clause, identifying the matching records in the real table by a unique ID, then deleting those 1000 matching records. The next time through, the deleted records are gone, and you get the next 1000 records.

Note that, in the example you gave, where there are only 100 records to delete, this never actually will help you. All the records to be deleted will be caught in the first go-round. This is more useful if there are tens of thousands of records to delete.

Hope this answers your question - if not, let us know.

R David Francis

StefanJ SSCarpal Tunnel Points: 4713 More actions · Answer 7

I tried



DECLARE @Loop int

SET ROWCOUNT 1000

SELECT @LOOP = 1

WHILE @LOOP > 0

BEGIN

 BEGIN TRAN

	DELETE FROM tblStatistics WHERE STATYEAR <= 2004

	SET @Loop = @@ROWCOUNT

 COMMIT TRAN

END

And it seems to work perfectly and more importantly I can understand exactly what it is doing.

Bearing in mind that I am a new comer to SQL Server, can I ask some suupplementary questions to make sure I understand it completely.

"Except where clause causes table scans" - By this do you mean OK as long as there is an index?

Is the BEGIN TRAN and COMMIT TRAN necessary or will each DELETE FROM be a separate transaction anyway?

Is this code as efficient as rdfozz post?

The main reason for using this code rather than a simple delete is so as not to impact passthru query performance against this table if the delete done at the same time. What is the best way of measuring if this is really necessary? And should I be doing something similar with a very large insert?

guarddata SSCertifiable Points: 6583 More actions · Answer 8

Yes - if there is an index that matches the WHERE clause - everyone is happy.

The BEGIN and COMMIT are necessary. Without them the process runs in a single transaction and there is no benefit.

As efficient? Because of the simpler query and no sub-select, I would expect it to be more efficient. However, these are the types of things you end up testing to make sure your hunches are correct.

Hope this helps.

Guarddata-

RD Francis SSCarpal Tunnel Points: 4867 More actions · Answer 9

quote:

Bearing in mind that I am a new comer to SQL Server, can I ask some suupplementary questions to make sure I understand it completely.

Is the BEGIN TRAN and COMMIT TRAN necessary or will each DELETE FROM be a separate transaction anyway?

As I stated above, statements are run as a batch unless broken up by GO statements. Since you can't use a GO in the middle of a loop, you would want to commit each delete individually.

Without the BEGIN ... COMMIT, doing the deletes separately would have no effect; the table would remain locked until the query batch is complete.

quote:

Is this code as efficient as rdfozz post?

The main reason for using this code rather than a simple delete is so as not to impact passthru query performance against this table if the delete done at the same time. What is the best way of measuring if this is really necessary? And should I be doing something similar with a very large insert?

I shall make no claims as to the efficiency of the code I posted - I was just trying to make what you had posted originally work.

The code you posted accomplishes your ultimate goal (delete the records), but not your intermediate goal (deleting in small batches, then commiting, to minimize the amount of consecutive time the table is locked).

There is nothing in your DELETE clause that would cause only a limited group of pre-2004 records to be deleted. In the first iteration, the loop will zap all the pre-2004 records, and will set @LOOP to the number of records deleted; in the second iteration, nothing is left to be deleted, so @LOOP is set to zero, and the loop stops. This would be more efficient if the loop were taken out and you just issued the delete statement.

As to the remainder of your question: is it necessary to break up your DELETE into several smaller ones? This depends heavily on the details of your situation. How often do people run queries against this table? How long does the DELETE actually take, on a weekly/daily/hourly/whatever basis? Is that time likely to change dramatically? Is there a chunk of "dead time" during which the database sees little use, when this activity could be safely scheduled?

There's an easy way to see if you need to break things up, but it doesn't work well in most environments - do the deletion, and see if people complain....

Otherwise, answer the questions above, and put the answers together to determine what you can and should do.

R David Francis

RD Francis SSCarpal Tunnel Points: 4867 More actions · Answer 10

Almost forgot - INSERTs most certainly can lock up the table when done in bulk, as can UPDATEs. Whether this is a problem for you or not depends on your situation, as I described above.

To give you a little sample of the issues to consider, here's where I needed to do this: We receive data exports from a partner on a weekly basis. We can wind up needing to perform tens or hundreds of thousands of inserts or updates, which in an environment where we have dozens to hundreds of queries on this same table every minute is a problem.

My solution was to store the IDs of the records to insert or update in a table variable. I grab the low and high ID values, use an increment of 10000, and set a counter to )low) + 10000. I insert or update all the records with IDs less than my counter, drop those same IDs from my table variable, increment the counter, and do it again until I pass the high ID. I'd post sample code, but it's a bit too big, as I actually have nine different scenarios to account for, and updates to do to other tables at the end.

Hope this helps some.

R David Francis

StefanJ SSCarpal Tunnel Points: 4713 More actions · Answer 11

Thanks Guarddata and Rdfozz. I have learnt a lot from your answers.

Just one final clarification.

In the book I am using, it suggest that AutoCommit will treat each SQL Statement as a transaction. BOL seems to confirm this (see below). I understood from this that the BEGIN TRAN and COMMIT TRAN were superflous in the above code. What am I missing?

In BOL

quote:

Autocommit mode is the default transaction management mode of Microsoft® SQL Server™. Every Transact-SQL statement is committed or rolled back when it completes. If a statement completes successfully, it is committed; if it encounters any error, it is rolled back. A SQL Server connection operates in autocommit mode whenever this default mode has not been overridden by either explicit or implicit transactions. Autocommit mode is also the default mode for ADO, OLE DB, ODBC, and DB-Library.

A SQL Server connection operates in autocommit mode until a BEGIN TRANSACTION statement starts an explicit transaction, or implicit transaction mode is set on. When the explicit transaction is committed or rolled back, or when implicit transaction mode is turned off, SQL Server returns to autocommit mode.

guarddata SSCertifiable Points: 6583 More actions · Answer 12

guarddata

SSCertifiable

Points: 6583

September 2, 2003 at 11:16 am

#472232

You are correct about the Auto-Commit...but that applies to the entire procedure. In order to break it down we use the BEGIN and COMMIT.

Guarddata-

StefanJ SSCarpal Tunnel Points: 4713 More actions · Answer 13

So what this means is that all code contained in a procedure is a single statement and therefore treated as a single transaction.

I guess the safest thing to do - till I am less confused - is to insert BEGIN TRANS / COMMIT TRAN even if not required.

Is there anywhere which explains transactions better than the text below from BOL?