UPDATE SET =NULL running over an hour

Question

Post reply

UPDATE SET =NULL running over an hour

Gimli son of Gloin

Ten Centuries

Points: 1263
More actions
December 2, 2008 at 9:26 pm

#383502

The following two statements are taking 1:13 hours to run. How can I speed it up?

UPDATE dbo.CLAIMS SET admission_date = NULL WHERE admission_date IS NOT NULL;

GO

UPDATE dbo.CLAIMS SET discharge_date = NULL WHERE discharge_date IS NOT NULL;

GO

The CLAIMS table has 33,290,917 records.

Any help is much appreciated.

-Doug

Viewing 15 posts - 1 through 15 (of 39 total)

You must be logged in to reply to this topic. Login to reply

Lynn Pettis SSC Guru Points: 442447 More actions · Answer 1

Doug (12/2/2008)

The following two statements are taking 1:13 hours to run. How can I speed it up?

UPDATE dbo.CLAIMS SET admission_date = NULL WHERE admission_date IS NOT NULL;

GO

UPDATE dbo.CLAIMS SET discharge_date = NULL WHERE discharge_date IS NOT NULL;

GO

The CLAIMS table has 33,290,917 records.

Any help is much appreciated.

-Doug

Break it into smaller batches instead of of two huge updates.



declare @BatchSize int;

set @BatchSize = 10000;

while exists(select 1 from dbo.Claims where admission_date is not null

begin

    UPDATE TOP (@BatchSize) dbo.CLAIMS SET

        admission_date = NULL

    WHERE

        admission_date IS NOT NULL;

    -- I'd put a transaction log backup here

end

while exists(select 1 from dbo.Claims where discharge_date is not null

begin

    UPDATE TOP (@BatchSize) dbo.CLAIMS SET

        discharge_date = NULL

    WHERE

        discharge_date IS NOT NULL;

    -- I'd put a transaction log backup here

end

The above is a SQL Server 2005 solution. If you are actually using SQL Server 2000, there will need to be some changes.

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 2

To add to what Lynn stated.... there are two reasons you need to break up such a large update... the first is the LOG file. Even with the SIMPLE recovery mode, the LOG file will grow to hold everything that happens in a single Update so it can be rolled back if interrupted.

The second reason is, every system has a "tipping point". On most systems, such a simple Update will easily handle a million rows in a minute or less. For two million rows, just double the time for two minutes or less. It may even do 3 or 4 million rows in the expected 3 or 4 minutes. But, then the system loses it's mind... it may take an hour to do just 5 million rows and it may take 24 or more hours to do 32 million rows.

The "tipping point" in each system varies depending on available memory, IO, LOF file size, etc, etc, and I know of no one who has a simple formula to predict that point. One query may easily do an Update on 4 million rows in 4 minutes, while another may not make it past 2 million rows. You can certainly test for the "tipping point" for each query, but it's usually easier to have a template to divide the work into million row updates that continue until done.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 3

P.S. You may want to stop that single Update you're running right now. It may take several hours to rollback or it might rollback quicker. But even such a simple update is going to take a huge amount of time on 32 million rows because of the "tipping point". It's worth the stop and then restart the Update in smaller, million row chunks.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 4

One more thing... you're updating the same columns that you have in your criteria. Sometimes, and it's again unpredictable, that will cause a recompile for every row updated. You should probably capture the PK column info in a separate working table for rows that meet your criteria and the use that information as a join to do the update. If you include an IDENTITY column in the working table, that'll also give you something to control the million row cycles with, as well.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

Gimli son of Gloin Ten Centuries Points: 1263 More actions · Answer 5

Thanks Lynn for your input, but it's not helping me much. I ran the script you posted and the 1st statement had 213 batches and took 49 minutes. I let it keep running until it reached 59 minutes and went ahead and killed it. No one else is using the server so most all its resources were available. Perhaps I should go smaller than 10000.

-Doug

Gimli son of Gloin Ten Centuries Points: 1263 More actions · Answer 6

Lynn, Jeff had mentioned the log file and you had a comment to back it up between statements; however, I am using simple recovery and ss2005 doesn't let me run the BACKUP LOG statement for a database using the simple recovery mode but requires BACKUP DATABASE instead which I did not do in the script you provided.

If the log file is the reason it still runs a long time, what can I do about that?

I will try Jeff's method of creating a pk tmp table.

-Doug

noeld SSC Guru Points: 96590 More actions · Answer 7

noeld

SSC Guru

Points: 96590

December 3, 2008 at 2:13 pm

#907166

You should take transaction log backups while this is running to avoid log autogrows which could slow you down even more.

* Noel

Gimli son of Gloin Ten Centuries Points: 1263 More actions · Answer 8

Gimli son of Gloin

Ten Centuries

Points: 1263

December 3, 2008 at 3:20 pm

#907210

"You should take transaction log backups while this is running to avoid log autogrows which could slow you down even more. " * Noel

This doesn't seem possible with database in simple recovery mode.

Lynn Pettis SSC Guru Points: 442447 More actions · Answer 9

If your database is using the simple recovery model, t-log backups aren't needed. I put that comment in for two reasons; one, you didn't mention the recovery model the database is using; and two, I didn't know the name of the database and didn't fell like writing out the code needed for the t-log backup.

Even though you are using simple recovery model, you still need to batch the updates.

Lynn Pettis SSC Guru Points: 442447 More actions · Answer 10

Doug (12/3/2008)

Thanks Lynn for your input, but it's not helping me much. I ran the script you posted and the 1st statement had 213 batches and took 49 minutes. I let it keep running until it reached 59 minutes and went ahead and killed it. No one else is using the server so most all its resources were available. Perhaps I should go smaller than 10000.

-Doug

Not quite sure what you are telling me here. Were there 213 batches to process, and after 49 minutes it still hadn't finished or in 49 minutes it had done 213 batches? If the later, it had already processed 21,300,000 records, which means you would have been close to finishing. Even though you killed it, you should know that all that would have rolled back when you killed it was the last batch of 10000 records you were updating. All previous batches will have been completed and written to disk.

Lynn Pettis SSC Guru Points: 442447 More actions · Answer 11

[font="Arial Black"]IGNORE THIS POST, but I still want people to see it.[/font]

My original code:



declare @BatchSize int;

set @BatchSize = 10000;

while exists(select 1 from dbo.Claims where admission_date is not null

begin

    UPDATE TOP (@BatchSize) dbo.CLAIMS SET

        admission_date = NULL

    WHERE

        admission_date IS NOT NULL;

    -- I'd put a transaction log backup here

end

while exists(select 1 from dbo.Claims where discharge_date is not null

begin

    UPDATE TOP (@BatchSize) dbo.CLAIMS SET

        discharge_date = NULL

    WHERE

        discharge_date IS NOT NULL;

    -- I'd put a transaction log backup here

end

As I look at this code (and yours) something jumps out at me. In this table (dbo.Claims) set admission_date and discharge_date to null where they aren't null. This implies there are records already in the table where these fields are null. So, let's just do it in one pass:



declare @BatchSize int;

set @BatchSize = 10000;

while exists(select 1 from dbo.Claims where admission_date is not null

begin

    UPDATE TOP (@BatchSize) dbo.CLAIMS SET

        admission_date = NULL,

        discharge_date = NULL

end

Either way, when this code finishes, all records in the table should have admission_date and discharge_date equal to null.

The new code eliminates the test for null, and will use a table/clustered index scan. Hopefully you have a clustered index declared on the table.

Doing it this way, you may actually be able to increase your batch size to 100000 or even as Jeff suggested 1000000.

Lynn Pettis SSC Guru Points: 442447 More actions · Answer 12

Lynn Pettis

SSC Guru

Points: 442447

December 3, 2008 at 3:56 pm

#907225

Never mind my post above, it won't work. I am having a brain dead day here at work and it is seeping over to here. You'd still need to test for not nulls.

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 13

Lynn Pettis (12/3/2008)

Never mind my post above, it won't work. I am having a brain dead day here at work and it is seeping over to here. You'd still need to test for not nulls.

Actually - I don't think you're brain dead at all. You're onto something: the end result from this query will be - "all rows will have NULLs in the admission_date and discharge_date columns"

Given that - I would look at doing the batching, but do it by forcing a "walk the clustered index" scenario, so that you're a. doing just one pass through the table, and b. just touching each data page one time.

You can always just tell it to ignore any rows where BOTH columns are already null (if there are a lot of those). If there wouldn't be a lot of those rows, then just batch them up by clustered index key, and skip checking for null at all (that way you get around the issue Jeff was mentioning).

Doug - what's the clustered index scenario? What's the column name/data type involved?

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

Gimli son of Gloin Ten Centuries Points: 1263 More actions · Answer 14

Lynn Pettis (12/3/2008)

Doug (12/3/2008)

Thanks Lynn for your input, but it's not helping me much. I ran the script you posted and the 1st statement had 213 batches and took 49 minutes. I let it keep running until it reached 59 minutes and went ahead and killed it. No one else is using the server so most all its resources were available. Perhaps I should go smaller than 10000.

-Doug

Not quite sure what you are telling me here. Were there 213 batches to process, and after 49 minutes it still hadn't finished or in 49 minutes it had done 213 batches? If the later, it had already processed 21,300,000 records, which means you would have been close to finishing. Even though you killed it, you should know that all that would have rolled back when you killed it was the last batch of 10000 records you were updating. All previous batches will have been completed and written to disk.

213 batches completed the first statement in your script for entire table:

UPDATE TOP (@BatchSize) dbo.CLAIMS SET

admission_date = NULL

WHERE

admission_date IS NOT NULL;

You are right that it processed all 21,300,000 records that had NOT NULL admission date (all in 49 mins). It started the 2nd statement for the discharge_date but I stopped it after 10 mins since it was taking too long. If I had let it continue, it most likely would have been another 49 mins since it had to update the same number of records for the NOT NULL discharge dates. I added a BEGIN TRANSACTION in front of your script so I could ROLLBACK afterwards, which I did.