Eliminating Cursors

  • Mike C wrote:

    "Very, very, very rarely is a cursor absolutely necessary.  There is almost always a way to conquer T-SQL tasks using T-SQL, sans cursors.  Over-reliance on cursors is a fallback position for many people who come from backgrounds in flat-file processing, and want to use SQL server as a flat-file parsing tool."

    OK Mike and all you other cursor exterminators, do you have ideas on how to use a set based approach when a data set has to be distributed over 2 or more "categories" that overlap? 

    That question sounds pretty vague.  The following real world example should make the problem more clear.  I couldn't figure out how to solve it without a RBAR (Row By Agonizing Row) method (Thanks for the "Modenism", Jeff.):

    Suppose you have a table of payables that need to be referenced by records in a table of payments.  The payment table associates amounts from fund sources to the payable records.  The business rule states that payments are made against fund source A until it is exhausted and then fund source B is used.

    When a row or rows is inserted in the payable table, the insert trigger creates payment records based on that rule.

    Here's simplified table definitions:

    Payable Table(

    PayableID int identity,

    Amount money

    )

    Payment Table(

    PaymentID int identity,

    PayableID int,

    AmountPaid money,

    FundSourceID char(1)

    )

    Here's the insert trigger's pseudo code using a RBAR solution:

    Create cursor on inserted payable records

    While inserted payable records remain

        If payable amount <= fund source A balance

            Insert payment record with payable's ID, fund source A ID, and payable amount

        Else

        If fund source A balance > 0

            Insert payment record with payable's ID, fund source A ID, and fund source A's balance

            Insert payment record with payable's ID, fund source B ID, and payable amount - fund source A's balance

        Else

            Insert payment record with payable's ID, fund source B ID, and payable amount

        End

     

        Get next payable record

    End Loop

    In case this is not clear, here's some sample "data" to illustrate the business need:

    Inserted Payable Records - These need payments from fund sources.

    ID Amount

    1 $10

    2 $20

    3 $30

    Fund Sources - (Pay from Fund A first)

    ID FundAllocation

    A  $15

    B  $1000

    Payment Records - These are the payments inserted by the trigger

    ID PayableID AmountPaid FundSourceID

    1  1             $10           A

    2  2             $5             A

    3  2             $15           B

    4  3             $30           B

    You can see that payableID 2 had to be split across sources A and B.  This is the part that confounds me when trying to create a set based solution.

    Obviously there's error checking code elsewhere to prevent over drawing fund source B, etc.  What I'm interested in is a set based solution for a "rollover" situation like this.

  • It's a variation of the "knapsack" problem.  This is a very old problem, and if you've ever spent time trying to squeeze one more thing into a suitcase for a vacation or business trip you've faced this problem.  The knapsack problem is concerned with the most profitable and efficient way to fill a knapsack with items of differing values and/or proportions.  This is a very common problem when trying to figure out the most efficient use of space in cargo containers, but a variation applies here.  In this instance I'm filling the knapsack with pennies from two jars; we can assume all the pennies are the same proportions and value.  After I assign the pennies from the jars to their respective knapsacks, I recombine them into larger results [dollars and cents].

    Like I said cursors are used a lot more than they need to be, and they're overused mostly by people who can't wrap their head around set-based logic.  Not to blame them though, we normally don't (consciously) think in set-based mathematical formulae, and that's reflected in most of the popular programming languages.  We tend to think in a step-by-step procedural fashion, and set-based takes a while to get used to.

    So anyway here you go, a set-based solution to your problem from Mike C the cursor exterminator:

    CREATE TABLE #Payable (

     PayableID INT NOT NULL IDENTITY(1,1) PRIMARY KEY,

     Amount money

    )

    INSERT INTO #Payable (Amount) VALUES (10.00)

    INSERT INTO #Payable (Amount) VALUES (20.00)

    INSERT INTO #Payable (Amount) VALUES (30.00)

    CREATE TABLE #Funds (

     FundID CHAR(1),

     Amount NUMERIC(10, 2)

    )

    INSERT INTO #Funds (FundID, Amount) VALUES ('A', 15.00)

    INSERT INTO #Funds (FundID, Amount) VALUES ('B', 1000.00)

    CREATE TABLE #Payment (

     PaymentID INT NOT NULL IDENTITY(1,1) PRIMARY KEY,

     PayableID INT,

     AmountPaid NUMERIC(10, 2),

     FundSourceID CHAR(1)

    )

    SELECT TOP 10000 IDENTITY(INT, 1, 1) AS Num

    INTO #Numbers

    FROM dbo.syscolumns a

    CROSS JOIN dbo.syscolumns b

    CROSS JOIN dbo.syscolumns c

    CREATE TABLE #Dollars (Num NUMERIC(10, 2) NOT NULL PRIMARY KEY)

    INSERT INTO #Dollars (Num)

    SELECT CAST(n.Num AS NUMERIC(10, 2)) / 100.0

    FROM #Numbers n

    INSERT INTO #Payment (PayableID, FundSourceID, AmountPaid)

    SELECT a.PayableID, a.FundID, COUNT(*) / 100.0

    FROM

    (

                SELECT PayableID, FundID

                FROM

                (

                            SELECT p.PayableID,

                                        (

                                                    SELECT COALESCE(SUM(Amount),0)

                                                    FROM #Payable p1

                                                    WHERE p1.PayableID < p.PayableID

                                        ) + #Dollars.Num AS Num

                            FROM #Dollars JOIN #Payable p ON (#Dollars.Num BETWEEN 0.01 AND p.Amount)

                ) Payable_Items

                JOIN

                (

                            SELECT f.FundID,

                            (

                                        SELECT COALESCE(SUM(Amount),0)

                                        FROM #Funds f1

                                        WHERE f1.FundID < f.FundID

                            ) + #Dollars.Num AS Num

                            FROM #Dollars JOIN #Funds f ON (#Dollars.Num BETWEEN 0.01 AND f.Amount)

                ) Fund_Items

                ON Fund_Items.Num = Payable_Items.Num

    ) a

    GROUP BY a.PayableID, a.FundID

    SELECT *

    FROM #Payment

    UPDATE #Funds

    SET Amount = Amount -

    (

     SELECT SUM(p.AmountPaid)

     FROM #Payment p

     WHERE #Funds.FundID = p.FundSourceID

    )

    SELECT *

    FROM #Funds

    DROP TABLE #Dollars

    DROP TABLE #Numbers

    DROP TABLE #Funds

    DROP TABLE #Payment

    DROP TABLE #Payable

     

  • Wow!  Look Mom, no cursors! 

    I'll have to expand my set-based thinking to include the imaginary set of "all" possible values.  (The #Dollars table in your example.)

    Thanks Mike!

     

  • No problem   The most useful invention ever created for converting procedural code to set-based code (IMHO) is the inner join to a basic numbers table.  The #Dollars is just a decimal numbers table with two digits of precision, representing pennies; come to think of it, a better name for it probably would have been "#Pennies".   It's pretty amazing what you can do with a numbers table once you start to get the hang of it.

  • "So SQL Server has always had cursors because of its origins.  But set based queries are orders of magnitude more efficient.  I once re-wrote an entire order processing system originally written using cursors* to use set based queries. Prior to the re-write it would take approximately 1.5 - 2 hours to process 10,000 orders.  Post re-write the same process took less than 2 minutes."

    I've had similar experiences with replacing cursors.  By and large, the code that I have changed from cursor to set-based has been 100's to 1000's of times faster.  The few exceptions where the cursor was faster turned out to be issues with the way in which the tables were indexed.  Once we resolved those issues, the cursor-based and set-based solutions both improved, but the set-based solutions ended up outperforming the cursors by a wide margin.

  • "It's a variation of the "knapsack" problem. "

    Darn...I was to to slow. Put I'll post my code anyway, as another variation of the same concept.

    if object_ID('Payable') is not null drop table Payable

    Create Table Payable (

    ID int identity,

    Amount int

    )

    Insert Payable values (10)

    Insert Payable values (20)

    Insert Payable values (30)

    if object_ID('Source') is not null drop table Source

    Create table Source (

    ID char(1),

    Amount int)

    Insert Source values ('A', 15)

    Insert Source values ('B', 35)

    Insert Source values ('C', 100)

    select p.ID as PayableID, s.ID as SourceID, (max(p.NumberID)-min(p.NumberID))+1 as Amount

    From

    (

    select p1.ID, pn.*

    From

    (

    select p2.ID, p2.Amount as maxAmount,

    isnull((

    select top 1 Amount

    from Payable p3

    where p3.ID < p2.ID

    order by p3.ID desc

    ), 0)+1 as minAmount

    from Payable p2

    ) p1

    JOIN dbo.Number pn on pn.NumberID between p1.minAmount and p1.maxAmount

    ) as p

    JOIN

    (

    select s1.ID, sn.*

    From

    (

    select s2.ID, s2.Amount as maxAmount,

    isnull((

    select top 1 Amount

    from Source s3

    where s3.ID < s2.ID

    order by s3.ID desc

    ), 0)+1 as minAmount

    from Source s2

    ) s1

    JOIN dbo.Number sn on sn.NumberID between s1.minAmount and s1.maxAmount

    ) as s on p.NumberID = s.NumberID

    group by p.ID, s.ID

    Signature is NULL

  • my recollection was that SyBase added cursors in the early 90's to prevent it from being run out of business by Oracle.  In the mid-80s I remember listening to a Sybase dude talk about how horrible cursors were and that no one using Sybase should ever need to use them.  After getting laughed at for too long, Sybase changed its tune.

    Now if you guys think that those who use cursors are nuts/stupid/incompetent (non-relational heavens forbid) for using cursors, go right ahead. 

    All it is is a tool.  Those who categorically refuse to even consider using it are hopeless!!!  You can join the guy who thinks using temp tables is a mortal sin!

     

     

     

  • "All it is is a tool.  Those who categorically refuse to even consider using it are hopeless!!!  You can join the guy who thinks using temp tables is a mortal sin!"

    If 99.99999% of your problems are 'nails', the tool you should probably be using is a hammer...  not a hacksaw.

  • I would not "categorically refuse to even consider using (cursors)".  I would simply say that it should be a last resort.  I can't think of any good reason why you would want to use a RAR solution when a set based solution is available.  In more than 20 years of coding for a variety of databases I have yet to see a problem that demands a cursor. Occasionally (very very very rarely) you need to cut that nail with a hacksaw,  It's good that cursors exist.

     

     

    Ron Cicotte
    ron.cicotte@gmail.com
    Data Transformations, LLC

  • It's important to remember to balance the cost of the time to run a routine versus the cost of the time to develop the routine.

    If there's something you only expect to ever need to do once or twice, and if a cursor solution provides adequate speed for your needs, don't spend half a day figuring out a set-based solution (unless your job is such that learning the new techniques involved is viewed as a benefit in and of itself).

    Also, there's the issue of maintainability:  Will the next person to touch the code be able to follow the SQL you've put together?  I've been explicitly asked to keep my solutions very simple, so as to avoid confusing others I've worked with.  Even if the solution isn't necessarily optimal (as long as it's not too bad).

    Of course, if you're going to use cursors, make them LOCAL STATIC; otherwise, performance can be truly horrible (I've seen precisely the sort of performance increases mentioned here in moving from cursors to set-based solutions, when adding LOCAL STATIC to a cursor declaration).

    And yes, I think I did repeat myself, if you're reading through the whole thread.  Sorry.


    R David Francis

  • There is always more than one way to solve a problem.  While my personal bias is toward set based solutions for the many reasons posted here, RD Francis and others have pointed out that there are other things besides performance to consider.  

    The bottom line is know your toolsets and be careful how you use them.  Remember that databases tend to grow and what works today and is "not too bad" may create a horrible bottleneck a few months down the road. 

    Since I tend to avoid cursors I cannot vouch for the contention that "adding LOCAL STATIC to a cursor declaration will produce the kind of performance increases that a sets-based solution would.  I find it conceptually unconvincing though since  it still requires RAR processing whereas a sets based solution changes values simultaneously.  I would need to see a demonstration of a non-trivial example to be convinced.

    Ron Cicotte
    ron.cicotte@gmail.com
    Data Transformations, LLC

  • Ron says:

    "Since I tend to avoid cursors I cannot vouch for the contention that "adding LOCAL STATIC to a cursor declaration will produce the kind of performance increases that a sets-based solution would."

    Allow me to clarify:  in the cases I'm referring to, it's entirely possible that a set-based solution would have been an even more significant improvement over the LOCAL STATIC cursor; I'm just saying that there was a significany jump in performance.

    Also note that I can't think of any cases where I'm using a cursor to update values in an existing field; I'm pretty much always selecting data off into another table for future use.

    For me, the big revelation with regular cursors happened when I had two processes running in parallel, one working on several thousand records in a very large table, the other (as it turned out) working on *four* records.  Both operations were still going after several hours, and some WITH (NOLOCK) queries looking at the results showed that both were moving through the data at a proportional rate, and would have finished at roughly the same time!


    R David Francis

  • Nice code... just a bit of food for thought, though... correlated sub-queries with triangular joins (a<b) can be as bad or worse than cursors on large tables.  Not advocating cursors at all... just want everyone to be aware that correlated sub-queries are not the panacea they seem to be.  They are a serious form of "RBAR" (Row By Agonizing Row) and can impact the code performance based on the number of rows either by (n-1)2/2 or even (n-1)2.  For example... this seemingly harmless running total code appears to fly (if you want to call 1.25 seconds "flying") on 1,000 test records.  Put it in production with as little as 10,000 records and it takes not 10, but 100 times longer at 121.2 seconds and the duration grows exponentially compared to the number of records... even a cursor will beat that!

     SELECT t1.TransactionID,

            t1.Amount,

            (SELECT SUM(Amount)

               FROM TestTable

              WHERE TransactionID <= t1.TransactionID) AS 'Running Total'

       FROM TestTable t1

    Correlated subqueries are inherently not set-based because, as in the code above, the sub-query is executed once for each row in the outer query.  Same thing can make user defined functions a nasty source of hidden RBAR.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • That's why I said "99.99999%" instead of 100%, in recognition that 0.00001% of the time you might find it necessary to pound your nails into the board with that hacksaw.

  • Sometimes you can improve performance even more by using a properly indexed temp table to hold the intermediate results of the two queries instead of one large query with several correlated subqueries.  The efficiency depends, in large part, on how well the query optimizer does its job.

    As an example, I used a very similar (but more complex) query to create "product pull-lists" that match up AdventureWorks inventory on SQL Server 2005 (1,069 rows in the Production.ProductInventory table) to sales order details (121,317 rows in the Sales.SalesOrderDetail table).  The query is a lot more complex since there are many different products that can be ordered, each must match up with the proper order detail rows, there are several orders to fill, and the products are pulled from several different bins.  Here are the business rules I used to design this query:

    • The query should report the SalesOrderID, ProductID, LocationID, Shelf, Bin, Quantity (in bin), Quantity (on order), Quantity (to pull from Bin), and a PartialFill flag.  The PartialFillFlag indicates 'Y' if Quantity to pull from Bin is less than Quantity on order, 'N' otherwise.
    • In some cases the number of ordered items might be more than will fit in one bin.  In that case the pull-list will instruct the employee to grab product from multiple bins.
    • Any partial-fills from a bin should be reported on the list.
    • Any substitution work will be handled by a separate business process, and shouldn’t be allowed on this list.
    • No zero-fills will be reported back on the list.

    I've implemented it as one query, with no explicit intermediate temp tables.  The query I wrote takes about 6 seconds on my workstation.  I haven't done a cursor-based version, but one could be created based on the business rules above for comparison.

Viewing 15 posts - 76 through 90 (of 296 total)

You must be logged in to reply to this topic. Login to reply