A Case FOR Cursors...

  • Andy DBA (6/1/2015)


    Thank you for injecting some humor into my Monday morning. "sql God"? "SQL-Peackocks[i]sic[/i]"? Really? Because I'm more concerned about database performance than I am about the competence of my successor? If you're letting someone maintain your code who's befuddled by those darn parens and confusing ANDs and ORs, then you deserve the consequences of their efforts.

    OF COURSE changing the position of a paren can radically change what the statement does! That's what they are for. Open the hood of your car and swap two spark plug wires around. OMG your engine may still start, but it's performance has been UTTERLY DESTROYED. What a bad design! Now use that as an argument that horses are better because they don't have those confusing spark plug wires for an incompetent mechanic to mess up.

    If you have your befuddled developer re-write that complicated and tricky set based code as a row-by-row "solution", I'll bet the results will be more buggy and less maintainable than before and the code will still have those confusing parens, ANDs and ORs that need to be in just the right place. I almost forgot, it will also probably gobble up more than a few extra nano-seconds, too. But who cares if your end-users have to stare at that spinning hula hoop a little longer or if your system now gets all these mysterious deadlocks. It's not like it's a high-speed trading system or anything.

    I think you grossly misunderstood my point. I was not saying that all set based operations should be replaced by cursor-loops. My point could be distilled to "Just because you can, doesn't mean you should".

    There are certain things that are simply easier understood by most people (even Sql-Gods) than overly complex sql statements containing complex business logic. I happen to work in the real world, and yes I've seen the other comments about re-evaluating your hiring practices, make sure you don't have a "befuddled" successor, etc. Great. Wish I lived in that world where everyone was as perfect as you seem to think you are (on second thought, no I don't).

    I'd rather have a project that's easily maintainable and less prone to a mistake in a business-rule change by "the new guy" and have it cost a few more nanoseconds than one that is only maintainable by the highest paid top sql-jockeys (to throw in another term for your Monday amusement).

    There are simply some things that fall into the category I mentioned and to refuse to acknowledge that is arrogant. Those of us down here in the trenches of the real world recognize that, and some of us actually know what we're doing. Sorry you disagree.

    Jim

  • ray 64276 (6/1/2015)


    Cursors can have a higher performance in many cases where dealing with large updates. There are a couple reasons for this:

    - The deleted and inserted tables used by larger triggers may have no indexes or poor indexes. That alone can make all the difference in the world when updating large chunks of data

    - The locking when updating huge chunks of data in a single set operation in a production system can be prohibitive. Updating them one by one has little impact.

    A decent pattern I've seen mentioned here at SSC is to break up large updates into smaller batches and I've had good success with this (more typically big deletes). If you can handle many rows during each loop iteration, you can spread the loop cost across those many rows, and avoid some of the row by row penalties.

  • Emph. mine.

    Sioban Krzywicki (6/1/2015)


    Is maintaining a complex SQL query more difficult than maintaining a cursor? Sometimes, but in the end, performance is king. The code isn't there to make your life easy, it is there to make the system usable and the end-user's life easy. Make it easy to maintain, sure, but it can't be at the cost of end-user happiness.

    Amen. It always comes down to performance. I have never heard someone complain that an application was too fast. Cursors, loops, rCTEs all perform fine when we're dealing with 50 rows of data but I live in a world where we're dealing with millions, 10's of millions+ rows of data and performance is key.

    Until someone, anyone, can post an example where a cursor or other iterative method was faster than a well-written set-based solution I will remain convinced that the set-based solution is the way to go. Even the author of the article could not provide an example of where a cursor was the way to go, only that they exist in an oob SQL Server install.

    I have found, too, that set-based code is clearly the easiest code to maintain. Cursor code is kludgy and counter-intuitive. For example, you don't need to deallocate and close a tally table.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Not a very helpful article. A really helpful article would have gone something like...

    Here are some places where you might consider a cursor...

    Here are places where you just plain need a cursor... (like doing [process] to a list of databases that must be done serially)

    Here are places/instances where you do NOT want a cursor.

  • pietlinden (6/1/2015)


    Not a very helpful article. A really helpful article would have gone something like...

    Here are some places where you might consider a cursor...

    Here are places where you just plain need a cursor... (like doing [process] to a list of databases that must be done serially)

    FWIW you can do this without a cursor or loop. You do it by building a large dynamic sql statement (and you can do that using set-based logic) that concatenates the operations on each database together. Then execute the dynsql. Of course you're limited by the maximum size of NVARCHAR(max), but 2**30-1 is a pretty big number.

    Not claiming that this is better, just different and cursorless.

    Here are places/instances where you do NOT want a cursor.

    Gerald Britton, Pluralsight courses

  • g.britton (6/1/2015)


    pietlinden (6/1/2015)


    Not a very helpful article. A really helpful article would have gone something like...

    Here are some places where you might consider a cursor...

    Here are places where you just plain need a cursor... (like doing [process] to a list of databases that must be done serially)

    FWIW you can do this without a cursor or loop. You do it by building a large dynamic sql statement (and you can do that using set-based logic) that concatenates the operations on each database together. Then execute the dynsql. Of course you're limited by the maximum size of NVARCHAR(max), but 2**30-1 is a pretty big number.

    Not claiming that this is better, just different and cursorless.

    Here are places/instances where you do NOT want a cursor.

    Much of the time I consider Dynamic SQL to be worse than cursors or at least just as bad in different ways.

    --------------------------------------
    When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
    --------------------------------------
    It’s unpleasantly like being drunk.
    What’s so unpleasant about being drunk?
    You ask a glass of water. -- Douglas Adams

  • g.britton (6/1/2015)


    pietlinden (6/1/2015)


    Not a very helpful article. A really helpful article would have gone something like...

    Here are some places where you might consider a cursor...

    Here are places where you just plain need a cursor... (like doing [process] to a list of databases that must be done serially)

    FWIW you can do this without a cursor or loop. You do it by building a large dynamic sql statement (and you can do that using set-based logic) that concatenates the operations on each database together. Then execute the dynsql. Of course you're limited by the maximum size of NVARCHAR(max), but 2**30-1 is a pretty big number.

    Not claiming that this is better, just different and cursorless.

    Here are places/instances where you do NOT want a cursor.

    It's surprisingly easy to hit varchar(max) when trying to generate dynamic sql for large operations.

  • ZZartin (6/1/2015)


    g.britton (6/1/2015)


    pietlinden (6/1/2015)


    Not a very helpful article. A really helpful article would have gone something like...

    Here are some places where you might consider a cursor...

    Here are places where you just plain need a cursor... (like doing [process] to a list of databases that must be done serially)

    FWIW you can do this without a cursor or loop. You do it by building a large dynamic sql statement (and you can do that using set-based logic) that concatenates the operations on each database together. Then execute the dynsql. Of course you're limited by the maximum size of NVARCHAR(max), but 2**30-1 is a pretty big number.

    Not claiming that this is better, just different and cursorless.

    Here are places/instances where you do NOT want a cursor.

    It's surprisingly easy to hit varchar(max) when trying to generate dynamic sql for large operations.

    Really??? Varchar(max) is 2GB worth of characters. That is over 2 billion characters. The bible has about 3.5 million characters. In other words, you can fit more than 600 copies of the entire King James bible in English in a varchar(max). You must be doing some unbelievably large operations.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • g.britton (6/1/2015)


    Actually, I challenge that claim. Please post some actual results.

    Alan.B (6/1/2015)


    Until someone, anyone, can post an example where a cursor or other iterative method was faster than a well-written set-based solution I will remain convinced that the set-based solution is the way to go.

    Just for kicks, I thought I'd respond to these as a thought exercise.

    First, create a sample table with a bunch of random data:

    set nocount on

    create table dbo.foobar (foo int)

    declare @x int = 1

    while @x <= 100000 begin

    insert into dbo.foobar (foo) values (round(rand()*1000000,0))

    set @x = @x + 1

    end

    Now, let's create a little activity on the system. This simulates imaginary users who are querying this ugly table on a regular basis and it will keep running until you stop it:

    declare @now time, @now_string char(12), @before time, @delay int, @dummy int

    while 1=1 begin

    select @dummy = count(*) from dbo.foobar where foo = (round(rand()*1000000,0))

    set @now = GetDate(); set @now_string = cast(@now AS char(12)); set @delay = datediff(ms, @before, @now)-2000

    RAISERROR ('It is now %s, query took approximately %i ms', 0, 1, @now_string, @delay) WITH NOWAIT

    set @before = @now

    waitfor delay '00:00:02'

    end

    I get this started on my test server and see a bunch of delays in the 15-30 ms range.

    Now, I open a second window, and do a big set-based update of all rows:

    update dbo.foobar set foo = foo - 1

    The update itself is almost instant; but when I look back at the first window, I see a blip, a delay of 1314 ms. Our users noticed a little hiccup there!

    Now I do the same thing with a cursor:

    declare @dummy int

    declare curfoo cursor local forward_only

    for select foo from dbo.foobar

    for update of foo

    open curfoo; fetch next from curfoo into @dummy

    while @@fetch_status = 0 begin

    update dbo.foobar set foo = foo - 1 where current of curfoo

    fetch next from curfoo into @dummy

    end

    close curfoo; deallocate curfoo

    This is way slower, it takes, like, 30 seconds to run; however, over in my original window, the delays go up only slightly: they're about 25 ms to 40 ms.

    I would conclude from this what has been intuitively suggested: although a cursor is way less efficient at doing a given operation, it might have lower impact on others. I guess whether you would call this "faster" is debatable, though... faster for others on the system, not for the cursor operation. And I imagine someone here might be able to suggest a better set-based solution, or point out that I should be using a different isolation level...? (Seriously! I mean, this could be a learning experience for me.)

  • There are some operations where you don't want to do straight set based because it will lock an important table for too long. My usual compromise is a while loop processing 3000 rows at a time to avoid SQL server escalating a lock to a table lock. On large tables sometimes that's too slow so I might process 100,000 at a time with a wait for delay embedded in the while loop. That minimizes disruptions from table locks.

  • Of course, after all my work, I realize the response to my crude example might be:

    update dbo.foobar with (rowlock) set foo = foo - 1

  • Alan.B (6/1/2015)


    Amen. It always comes down to performance.

    Alan, totally agree with you that "it always comes down to performance". I threw together a quick and dirty example to illustrate a case where an iterative method is faster.

    First create the working dataset:

    CREATE TABLE dbo.TestData ( Id BIGINT IDENTITY(1,1) PRIMARY KEY , TestCol BIT )

    INSERT dbo.TestData ( TestCol ) SELECT 0

    WHILE ( SELECT COUNT(1) FROM dbo.TestData ) <= 500000000

    INSERT dbo.TestData (TestCol) SELECT TestCol FROM dbo.TestData

    CREATE TABLE dbo.TestData_History ( StartTime DATETIME2, EndTime DATETIME2, RowsUpdated INT )

    (Originally I had aspirations to wait for 500-million records... but halted this insert script after about 68-million)

    Now let's perform one massive update and track the execution time:

    DECLARE @newVal BIT = 1, @startTime DATETIME2

    SET @startTime = GETDATE()

    UPDATE TestData SET TestCol = @newVal

    INSERT dbo.TestData_History ( StartTime, EndTime, RowsUpdated ) SELECT @startTime, GETDATE(), @@ROWCOUNT

    Locally this 68-million record update took 153-sec (over 2 minutes).

    Now let's perform a series of smaller updates, each limited to 10,000 records:

    DECLARE @newVal BIT = 0, @startTime DATETIME2, @min-2 INT, @max-2 INT, @maxID INT

    SELECT @startTime = GETDATE(), @min-2 = 1, @max-2 = 10001, @maxID = MAX(ID) FROM dbo.TestData

    WHILE (@min < @maxID)

    BEGIN

    UPDATE dbo.TestData SET TestCol = @newVal WHERE Id >= @min-2 AND Id < @max-2

    INSERT dbo.TestData_History ( StartTime, EndTime, RowsUpdated ) SELECT @startTime, GETDATE(), @@ROWCOUNT

    SELECT @min-2 = @max-2

    ,@max = @max-2 + 10000

    ,@startTime = GETDATE()

    END

    The entire process took 113 seconds (LESS than 2 minutes).

    But there are more important things to note. How do you define "performance"? Working with very-large OLTP databases, it is even more important to me that (1) end-users are not impacted by blocking, and (2) downstream systems, including transactional replication, are not affected. Performing the single large update resulted in my TestData table blocking all other update operations for over 2-minutes. But even more scary is this single update caused my tranaction log to swell over 21-GB, which is going to cause massive issues to the replication architecture downstream. Basically: (1) I'll be lucky if i have enough disk to cover my 21-GB swell, (2) I'll be lucky if I have enough disk to cover the swell that the Distribution database is about to be subjected to, (3) i'll be lucky if replication doesn't fall so far behind that I need to reinitialize my subscribers (and thus bringing down my read-only copies), (4) I'll be lucky if my customers are not impacted by my Publisher database which is completely blocked, AND my Subscriber databases which are out-of-sync and (5) I'll be lucky if I have a job tomorrow!!!

    And this was a simple update statement. Can you imagine if my update involved some sort of GROUP BY, which would then throw TEMP_DB into the mix? 🙂

    On the other hand, updating in small batches meant that each update of 10,000 records finished in 10-20 milliseconds each. This allows concurrent transactions to execute in-between each smaller update. My transaction log (after shrinking it to the original 2-GB after the massive update finished) did not have to grow a single time. The log reader job for transactional replication is happy because it works better with smaller transactions. And, maybe I can take a vacation this summer because I'm still employed!

    In my opinion, "blanket statements" suck. They are never 100% applicable to everything. Cursors are not THE DEVIL. They are improperly used in many cases, but sometimes (especially in very-large OLTP datasets) they are the best way to go. I wish I only had to worry about 10-million record tables........ unfortunately many of our tables are around the 500-million mark, and I find more and more proper (and necessary) uses of cursors as the size continues to grow.

    Hope you all enjoy this example... shoot me a message if you have any suggestions.

  • wbrianwhite 33048 (6/1/2015)


    There are some operations where you don't want to do straight set based because it will lock an important table for too long. My usual compromise is a while loop processing 3000 rows at a time to avoid SQL server escalating a lock to a table lock. On large tables sometimes that's too slow so I might process 100,000 at a time with a wait for delay embedded in the while loop. That minimizes disruptions from table locks.

    I'd argue that a pure WHILE loop is still the wrong way to go. However you could use a WHILE loop to put your UPDATEs into batches to minimize blocking, or is that what you mean?

    Gerald Britton, Pluralsight courses

  • [snip]

    This is way slower, it takes, like, 30 seconds to run; however, over in my original window, the delays go up only slightly: they're about 25 ms to 40 ms.

    I would conclude from this what has been intuitively suggested: although a cursor is way less efficient at doing a given operation, it might have lower impact on others. I guess whether you would call this "faster" is debatable, though... faster for others on the system, not for the cursor operation. And I imagine someone here might be able to suggest a better set-based solution, or point out that I should be using a different isolation level...? (Seriously! I mean, this could be a learning experience for me.)

    I think you've proved the main point. Using a CURSOR is usually the wrong way to go. Now, if it's decreasing the impact on other users, that's where SNAPSHOT_ISOLATION can help.

    Gerald Britton, Pluralsight courses

  • Brian J. Parker (6/1/2015)


    g.britton (6/1/2015)


    Actually, I challenge that claim. Please post some actual results.

    Alan.B (6/1/2015)


    Until someone, anyone, can post an example where a cursor or other iterative method was faster than a well-written set-based solution I will remain convinced that the set-based solution is the way to go.

    Just for kicks, I thought I'd respond to these as a thought exercise.

    First, create a sample table with a bunch of random data:

    set nocount on

    create table dbo.foobar (foo int)

    declare @x int = 1

    while @x <= 100000 begin

    insert into dbo.foobar (foo) values (round(rand()*1000000,0))

    set @x = @x + 1

    end

    Now, let's create a little activity on the system. This simulates imaginary users who are querying this ugly table on a regular basis and it will keep running until you stop it:

    declare @now time, @now_string char(12), @before time, @delay int, @dummy int

    while 1=1 begin

    select @dummy = count(*) from dbo.foobar where foo = (round(rand()*1000000,0))

    set @now = GetDate(); set @now_string = cast(@now AS char(12)); set @delay = datediff(ms, @before, @now)-2000

    RAISERROR ('It is now %s, query took approximately %i ms', 0, 1, @now_string, @delay) WITH NOWAIT

    set @before = @now

    waitfor delay '00:00:02'

    end

    I get this started on my test server and see a bunch of delays in the 15-30 ms range.

    Now, I open a second window, and do a big set-based update of all rows:

    update dbo.foobar set foo = foo - 1

    The update itself is almost instant; but when I look back at the first window, I see a blip, a delay of 1314 ms. Our users noticed a little hiccup there!

    Now I do the same thing with a cursor:

    declare @dummy int

    declare curfoo cursor local forward_only

    for select foo from dbo.foobar

    for update of foo

    open curfoo; fetch next from curfoo into @dummy

    while @@fetch_status = 0 begin

    update dbo.foobar set foo = foo - 1 where current of curfoo

    fetch next from curfoo into @dummy

    end

    close curfoo; deallocate curfoo

    This is way slower, it takes, like, 30 seconds to run; however, over in my original window, the delays go up only slightly: they're about 25 ms to 40 ms.

    I would conclude from this what has been intuitively suggested: although a cursor is way less efficient at doing a given operation, it might have lower impact on others. I guess whether you would call this "faster" is debatable, though... faster for others on the system, not for the cursor operation. And I imagine someone here might be able to suggest a better set-based solution, or point out that I should be using a different isolation level...? (Seriously! I mean, this could be a learning experience for me.)

    Thanks for throwing this together Brian. I was going to put together a better way to do this but I'm under the gun at the moment and wanted to look a little deeper at your update code. I will get back to you about that later tonight unless someone beats me to it.

    In the meantime, however, to emphasis why I like set-based solutions vs. cursors or loops, let's loop at two approaches for creating the sample data that you created. I did 100K rows then 1M rows...

    SET NOCOUNT ON

    /***** 100K Row Test *****/

    IF OBJECT_ID('dbo.foobar') IS NOT NULL DROP TABLE dbo.foobar

    GO

    CREATE TABLE dbo.foobar (foo int);

    PRINT 'Creating 100K rows of Sample Data using a loop:';

    DECLARE @starttime datetime = getdate();

    declare @x int = 1

    while @x <= 100000

    begin

    insert into dbo.foobar (foo) values (round(rand()*1000000,0))

    set @x = @x + 1

    end

    PRINT DATEDIFF(MS,@starttime, getdate());

    GO

    TRUNCATE TABLE dbo.foobar;

    PRINT 'Creating 100K rows of Sample Data using a set-based method:';

    DECLARE @starttime datetime = getdate();

    WITH

    L1 AS (SELECT N FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) t(N))

    INSERT INTO dbo.foobar

    SELECT TOP 100000 ABS(CHECKSUM(newid())%1000000)+1 --rand() will not work here for me so I'm creating a random number a different way.

    FROM L1 a, L1 b, L1 c, L1 d, L1 e, L1 f;

    PRINT DATEDIFF(MS,@starttime, getdate());

    GO

    /***** 1M Row Test *****/

    TRUNCATE TABLE dbo.foobar;

    IF OBJECT_ID('dbo.foobar') IS NOT NULL DROP TABLE dbo.foobar

    GO

    CREATE TABLE dbo.foobar (foo int);

    PRINT 'Creating 1M rows of Sample Data using a loop:';

    DECLARE @starttime datetime = getdate();

    declare @x int = 1

    while @x <= 1000000

    begin

    insert into dbo.foobar (foo) values (round(rand()*1000000,0))

    set @x = @x + 1

    end

    PRINT DATEDIFF(MS,@starttime, getdate());

    GO

    TRUNCATE TABLE dbo.foobar;

    PRINT 'Creating 1M rows of Sample Data using a set-based method:';

    DECLARE @starttime datetime = getdate();

    WITH

    L1 AS (SELECT N FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) t(N))

    INSERT INTO dbo.foobar

    SELECT TOP 1000000 ABS(CHECKSUM(newid())%1000000)+1

    FROM L1 a, L1 b, L1 c, L1 d, L1 e, L1 f;

    PRINT DATEDIFF(MS,@starttime, getdate());

    GO

    And the results:

    Creating 100K rows of Sample Data using a loop:

    950

    Creating 100K rows of Sample Data using a set-based method:

    106

    Creating 1M rows of Sample Data using a loop:

    7416

    Creating 1M rows of Sample Data using a set-based method:

    1083

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

Viewing 15 posts - 46 through 60 (of 215 total)

You must be logged in to reply to this topic. Login to reply