Solving the Running Total and Ordinal Rank Problems (Rewritten)

  • tonyrogerson-498793 (9/2/2010)


    Jeff - there is absolutely no need to take that tone with Hugo.

    Hugo is correctly, I have also pointed out the foolishness of relying on this method, a method that is not supported by Microsoft.

    The "product bugs" you mentioned did get support from Microsoft and were fixed.

    Hugo's post was excessively strong, I think. Yours certainly contains at least one error. As has been pointed out often enough, MS published a knowledge-base article on this technique stating conditions under which it does work and conditions under which it doesn't. The article doesn't say it's not supported, it says it only works under certain conditions, and that the resulting order of concatenation (when it's used with a string concatenation operator) is not defined except under certain conditions. I think the article got the conditions wrong, and for myself have more restrictive rules (and Jeff too has stated more restrictive rules); but plenty of MS documentation, including bits of BoL, is wrong so that's nothing to worry about.

    I seriously hope anybody contenplating using this method will consider the wealth and longevity of mine, Hugo's and some of the other SQL MVP's considerable product and industrial experience and use a supported and safe method instead.

    The supported and safe method being a cursor with a while loop, which doesn't work when there is an elapsed time criterion on the process?

    Anyway, have you never heard of using a high performance unguaranteed technique along with a sanity check that can cause you to fall back to a low performance guaranteed technique when the high performance one hits a problem? It's an extremely common technique in HPC, and is based on the sort of advanced error containment and recovery technique that is essential for HRC. It's something which Jeff advocates, as do many others. But you (unlike Hugo) are an "anti-quirky-update" purist just as irrational as the anti-null relational purists and the mathematically illiterate idiots who reject all logics except good old-fashioned two-valued logic and insist that the excluded middle is an essential component of sanity, and your language and sentiments are often throughly intemperate and grossly unprofessional to boot.

    Tom

  • Can you wait a few minutes? I just ran out of popcorn and soda...


    N 56°04'39.16"
    E 12°55'05.25"

  • Hugo,

    I really want to break this down to a simple point which I hope everyone can agree on:

    "The single requirement for Quirky Update is that rows get updated in clustered index order."

    All the debate boils down to whether that can be guaranteed or not.

    The code I posted a while back addresses that:

    DECLARE @PrevAccountID INTEGER;

    DECLARE @AccountRunningTotal MONEY;

    DECLARE @AccountRunningCount INTEGER;

    DECLARE @Sequence INTEGER = 0;

    WITH SafeTable

    AS (

    SELECT Sequence = ROW_NUMBER() OVER (ORDER BY TD.AccountID ASC, TD.Date ASC, TD.TransactionDetailID ASC),

    TD.AccountID,

    TD.AccountRunningCount,

    TD.AccountRunningTotal,

    TD.Amount,

    TD.Date,

    TD.NCID,

    TD.TransactionDetailID

    FROM dbo.TransactionDetail TD

    )

    UPDATE SafeTable

    SET @AccountRunningTotal = AccountRunningTotal =

    CASE

    WHEN AccountID = @PrevAccountID

    THEN @AccountRunningTotal+Amount

    ELSE Amount

    END,

    @AccountRunningCount = AccountRunningCount =

    CASE

    WHEN AccountID = @PrevAccountID

    THEN @AccountRunningCount + 1

    ELSE 1

    END,

    @Sequence =

    CASE

    WHEN Sequence = @Sequence + 1 THEN Sequence -- updating in sequence

    ELSE 1/0 -- quirky update is broken!

    END,

    @PrevAccountID = AccountID;

    Notice the complete absence of hints of any kind.

    Paul

    @peso: :laugh:

  • Tom.Thomson (9/3/2010)


    As has been pointed out often enough, MS published a knowledge-base article on this technique stating conditions under which it does work and conditions under which it doesn't.

    Are you refering to http://support.microsoft.com/kb/287515?

    If so, then do consider that this article is not about the quirky update technique, but about the (related but dissimilar) aggregate concatenation technique. The article guarantees nothing for the quirky update.

    I'd also argue that it guarantees nothing about aggregate concatenation, as it includes this quote: "The correct behavior for an aggregate concatenation query is undefined"). Unfortunately, it then muddles the water by providing a workaround "to achieve the expected results from an aggregate concatenation query" without telling us what exactly the expected results from a query with undefined behaviour are.

    But let's not divert. This topic is about quirky updates, not about aggregate concatenation.

    The supported and safe method being a cursor with a while loop, which doesn't work when there is an elapsed time criterion on the process?

    Please go back to page 4 of this topic. One of my posts on that page includes a very fast algorithm that completely avoids undocumented features. And further down, Jeff even posted an optimized version of that code.

    Jeff's optimized version of my code takes 42 seconds on my system. The last 22 seconds are for copying back the restults from a temporary table to the permanent table. If you compute running totals for reporting (which is their mosst common use), you can omit that step and need only 20 seconds. If you do need to permanently store the running totals, you could do the computations directly on the columns in the permanent table and again omit the final step. (The actual time would be more than 20 seconds though, as you need to create an extra index)

    Jeff's quirky update (figure 19) plus verification takes 45 seconds on my system - 6 for the update, and 39 for the verification. That is longer than the optimized version of my algorithm, and even twice as long in the common case of reporting only.

    Anyway, have you never heard of using a high performance unguaranteed technique along with a sanity check that can cause you to fall back to a low performance guaranteed technique when the high performance one hits a problem?

    But why would you use a high performance unguaranteed technique along with a sanity check, if the sanity check alone takes more time than the low performance guaranteed technique?


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • Paul White NZ (9/3/2010)


    Hugo,

    I really want to break this down to a simple point which I hope everyone can agree on:

    "The single requirement for Quirky Update is that rows get updated in clustered index order."

    All the debate boils down to whether that can be guaranteed or not.

    Exactly! 🙂

    And I think it is impossible to guarantee this, at until Microsoft gives us an optimizer hint that forces it to not only use a particular index, but to use it in this particular way.

    The code I posted a while back addresses that:

    DECLARE @PrevAccountID INTEGER;

    DECLARE @AccountRunningTotal MONEY;

    DECLARE @AccountRunningCount INTEGER;

    DECLARE @Sequence INTEGER = 0;

    WITH SafeTable

    AS (

    SELECT Sequence = ROW_NUMBER() OVER (ORDER BY TD.AccountID ASC, TD.Date ASC, TD.TransactionDetailID ASC),

    TD.AccountID,

    TD.AccountRunningCount,

    TD.AccountRunningTotal,

    TD.Amount,

    TD.Date,

    TD.NCID,

    TD.TransactionDetailID

    FROM dbo.TransactionDetail TD

    )

    UPDATE SafeTable

    SET @AccountRunningTotal = AccountRunningTotal =

    CASE

    WHEN AccountID = @PrevAccountID

    THEN @AccountRunningTotal+Amount

    ELSE Amount

    END,

    @AccountRunningCount = AccountRunningCount =

    CASE

    WHEN AccountID = @PrevAccountID

    THEN @AccountRunningCount + 1

    ELSE 1

    END,

    @Sequence =

    CASE

    WHEN Sequence = @Sequence + 1 THEN Sequence -- updating in sequence

    ELSE 1/0 -- quirky update is broken!

    END,

    @PrevAccountID = AccountID;

    Notice the complete absence of hints of any kind.

    Wow!

    That is a pretty nifty piece of work. My compliments. It runs in 8 seconds on my computer, slightly slower than Jeff's version, but since it already includes the verification, it actaully saves time when compared to the checked version of Jeff's code you'd need to use when the results have to be correct.

    (Also, note that you can omit the last three columns in the "SafeTable" CTE, as they are not referenced in the update - it won't change the time taken, but it does save some lines of code to maintain).

    I see no way how this code could not work as expected - with "expected" meaning to imply it either computes correct running totals, or it returns an error. And I currently do not see a way to force the error, but that does not imply that there is none!

    So what it now boils down to, is the choice between a method that takes 20 seconds to compute the running totals with 100% reliability, or a method that takes 8 seconds to either compute the running totals or page an operator. If the schedule you have is so tight that you need to shave off those 12 seconds, you can - but be aware that at the same time you introduce the risk of a several-hour downtime! (Or, you can enclose this code in a TRY CATCH block and include the 20 second alternative in the CATCH block, but that is a lot of extra code to write, test, and maintain. And you have to make certain that you don't acciddentally suppress other errors).


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • Hugo,

    Excellent, thank you. I do agree (as I have mentioned more than once before on this thread) that the SQLCLR or set-iteration alternatives are a better choice for production systems in the majority of cases. I'm not the proponent of Quirky Update that some people have assumed.

    That said, I will defend Jeff's right to publish the details of Quirky Update and guide people on its proper implementation, to allow them to make an informed choice. If he hadn't written it, someone else would have done, and the chances are extremely high that they would not have made such a good job of it.

    Even so, I do think Jeff erred (with the best of intentions) in rewriting this article. In attempting to 'head off' criticism, the simplicity was lost. Retiring the INDEX hint 'rule' was an error. For anyone interested, my strong preference is for INDEX(1) over INDEX(0) for performance reasons: QO may choose to introduce an explicit sort with INDEX(0) since the TABLOCK hint allows for an IAM-ordered scan. This sort is responsible for the drop in performance he saw, motivating him to retire the INDEX hint.

    Of course the real issue stands: SQL Server should provide an efficient native mechanism for simple running totals. The Segment and Sequence Project iterators are ideal for this (Sequence Project is an order-aware Compute Scalar with the ability to maintain state between rows).

    Until that happens, all methods have weaknesses. SQLCLR is efficient but not acceptable to everyone. I am a fan of your Set Iteration method - but even that performs less well than a cursor if only one group (to iterate over) exists. Nevertheless, it is a very fine solution.

    Paul

  • Hugo Kornelis (9/3/2010)


    ...at until Microsoft gives us an optimizer hint that forces it to not only use a particular index, but to use it in this particular way.

    Interestingly, there may be hope for this.

    In SQL Server 2008, the QO can require the Storage Engine to ensure that rows are processed by a Clustered Index Insert iterator in sorted order. This occurs in plans that make use of 2008's ability to perform minimally-logged INSERT...SELECTs (with or without TF610). The query plan attribute to look for on the Insert is [DMLRequestSort].

    This is quite a subtle point, so I want to emphasise it: even with an explicit Sort iterator immediately before the Insert iterator, SQL Server does not guarantee that the (sorted) rows will actually be inserted in that order. The [DMLRequestSort] attribute is required to ensure that the internal implementation does respect the required order.

    There's a detailed illustration of this on Brad Schultz's blog: http://bradsruminations.blogspot.com/2010/05/truth-about-cursors-part-3.html (my contribution is in the comments section).

  • Hugo Kornelis (9/3/2010)


    Jeff Moden (9/2/2010)


    No one has yet observed out-of-sequence updates if they've met ALL the rules. That's been since the earliest versions of Sybase and all versions, hot fixes, CU's, and SP's of SQL Server EVER including the current versions. And, yes, the rules needed to be changed to include all the rules because things changed over time including my understanding of a very valuable technique.

    Some modifications of the condition list are indeed due to product changes. But not all.

    Agreed... and that's exactly what I said above. "including my understanding of a very valueable technique.".

    I usually value your opinion but you have to stop hammering at me personally. The technique works, I've added code to check that it works (and Paul White has recently done a much better job), I've identified that Cursors are good and reliable secondary should the code ever fail, and I've even acknowledged that under the correct conditions your fine code is a good secondary as well.

    I've also stated that if people don't trust the "Quirky Update" that they should use one of those secondary methods as the primary to begin with.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • tonyrogerson-498793 (9/2/2010)


    I wasn't going to respond because I knew I'd get grief for taking the opposite view to you Paul and Jeff and no doubt a few others that are proponents of this un-supported method.

    Readers need only go back over the old threads to see the reality and context of your remark, labelling somebody unprofessional for pointing out the utter stupidity in using this method is quite frankly pathetic and only serves to highlight the weak position you have that forces you into personal insults but I'll leave it there.

    Readers - please do check previous posts by myself, Hugo and others before playing with your employers most valued asset - their data, in the real world there are other fast but more importantly supported and stable approaches to this - be safe, be professional and use them!

    If people want to email me directly on this subject with specific update problems they are trying to address then feel free: tonyrogerson@torver.net.

    Many thanks,

    Tony Rogerson, SQL Server MVP.

    Heh... you are SUCH a loveable guy, Tony! Tell you what... since you think this method is "utterly stupid", how about if you demonstrate how a properly working "Quirky Update" that follows all the rules will break on it's own. I know... that's probably a "pathetic" request or something like that, huh? 😛

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Hugo... you're recent comments are quite lengthy and deserve a bit of study before further reply. I'll try to do that after work tonight.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Paul White NZ (9/3/2010)


    Hugo,

    Excellent, thank you. I do agree (as I have mentioned more than once before on this thread) that the SQLCLR or set-iteration alternatives are a better choice for production systems in the majority of cases. I'm not the proponent of Quirky Update that some people have assumed.

    That said, I will defend Jeff's right to publish the details of Quirky Update and guide people on its proper implementation, to allow them to make an informed choice. If he hadn't written it, someone else would have done, and the chances are extremely high that they would not have made such a good job of it.

    Even so, I do think Jeff erred (with the best of intentions) in rewriting this article. In attempting to 'head off' criticism, the simplicity was lost. Retiring the INDEX hint 'rule' was an error. For anyone interested, my strong preference is for INDEX(1) over INDEX(0) for performance reasons: QO may choose to introduce an explicit sort with INDEX(0) since the TABLOCK hint allows for an IAM-ordered scan. This sort is responsible for the drop in performance he saw, motivating him to retire the INDEX hint.

    Of course the real issue stands: SQL Server should provide an efficient native mechanism for simple running totals. The Segment and Sequence Project iterators are ideal for this (Sequence Project is an order-aware Compute Scalar with the ability to maintain state between rows).

    Until that happens, all methods have weaknesses. SQLCLR is efficient but not acceptable to everyone. I am a fan of your Set Iteration method - but even that performs less well than a cursor if only one group (to iterate over) exists. Nevertheless, it is a very fine solution.

    Paul

    Thanks for the awesome support on this. Lot's of people do use the "Quirky Update" and they do it without following the rules such as the MAXDOP and TABLOCKX rules even on this site. I'd rather people not use the technique if they're not going to do it correctly.

    The reason I originally used the INDEX(0) hint is because of what is stated in BOL...

    If a clustered index exists, INDEX(0) forces a clustered index scan and INDEX(1) forces a clustered index scan or seek.

    If a SEEK occurs, you could get an out of order update and that's why I didn't recommend INDEX(1). INDEX(0) forces the clustered index scan that's needed to guarantee the order of the update.

    Shifting gears... the code you provided in the "other" post is an absolutely awesome method of double checking the update order. Would you mind if I added the method to the article?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Jeff Moden (9/3/2010)


    If a SEEK occurs, you could get an out of order update and that's why I didn't recommend INDEX(1). INDEX(0) forces the clustered index scan that's needed to guarantee the order of the update.

    Seeks are always ordered 😉 It's the scan that might not be. Honest!

    Shifting gears... the code you provided in the "other" post is an absolutely awesome method of double checking the update order. Would you mind if I added the method to the article?

    Ha, so polite! Do I really have to answer that?

  • Paul... If you change index IXC_Transaction_AccountID_Date_TransactionDetailID for DATE column to be descending, your CTE update fails.

    This behaviour is covered both here (and a correct implementation of "Ordered CTE Update") http://www.simple-talk.com/sql/performance/writing-efficient-sql-set-based-speed-phreakery/[/url]

    and here http://weblogs.sqlteam.com/mladenp/archive/2009/07/28/SQL-Server-2005-Fast-Running-Totals.aspx

    A correct implementation of "Ordered CTE Update" would look something like this (no clustered index).

    SELECT *

    INTO #Temp

    FROM dbo.TransactionDetail

    GO

    DECLARE @PrevAccountID INTEGER;

    DECLARE @AccountRunningTotal MONEY;

    DECLARE @AccountRunningCount INTEGER;

    DECLARE @Sequence INTEGER = 0

    ;WITH SafeTable

    AS (

    SELECT TOP(2147483647)

    ROW_NUMBER() OVER (ORDER BY AccountID ASC, Date ASC, TransactionDetailID ASC) AS Sequence,

    AccountID,

    AccountRunningCount,

    AccountRunningTotal,

    Amount,

    Date,

    NCID,

    TransactionDetailID

    FROM #Temp

    ORDER BY AccountID,

    Date,

    TransactionDetailID

    )

    UPDATE SafeTable

    SET @AccountRunningTotal = AccountRunningTotal =

    CASE

    WHEN AccountID = @PrevAccountID

    THEN @AccountRunningTotal+Amount

    ELSE Amount

    END,

    @AccountRunningCount = AccountRunningCount =

    CASE

    WHEN AccountID = @PrevAccountID

    THEN @AccountRunningCount + 1

    ELSE 1

    END,

    @Sequence =

    CASE

    WHEN Sequence = @Sequence + 1 THEN Sequence -- updating in sequence

    ELSE 1/0 -- quirky update is broken!

    END,

    @PrevAccountID = AccountID;

    -- Clean up

    DROP TABLE #Temp


    N 56°04'39.16"
    E 12°55'05.25"

  • I think that implementing running totals high performance style brings you to techniques that cannot just be copy/pasted by anyone. You have to understanding of how and why it works and when it doesn't. The level of in-depth knowledge brewing up here in these last posts makes me feel quite humble :).

    On another note, I still want to find some time to build an incrementally updating 'running total' by means of triggers. I done such a thing for multiple running totals (over several entities) in the past using more accessible SQL. It works well for the (relatively) modest volume of data it had to work on (and still is). I like to see how combining this new information would speed such logic up.

    And I might have another view then others here about the need for running totals in the database. First of all, you nearly always need it for speed purposes, but sometimes also for data integrity checks. Either way you need to guarantee the running total is always correct, which means guarding and updating it by means of triggers.

  • SwePeso (9/3/2010)


    Paul... If you change index IXC_Transaction_AccountID_Date_TransactionDetailID for DATE column to be descending, your CTE update fails.

    Unless I am misunderstanding you, I think you have missed the point 😛 🙂

    The single requirement for Quirky Update is that rows get updated in clustered index order.

    If you change the date column to be descending, you get running totals according to the changed specification - exactly as expected. The point of my modification is to throw an error if rows don't get updated in that order.

    You approach the problem in a slightly different way - copying the source table to a heap first and relying on the ranking function to provide the required update ordering. It's broadly similar to a true Quirky Update, but obviously much less performant where an in-place update is possible. My 'safety' modification works just as well with the intermediate materialization idea of course.

    BTW if you're going to use intermediate materialization like that, you should really specify:

    TOP (9223372036854775807) since the parameter is BIGINT, and tables can, and do, have more than 2147483647 rows.

    Paul

Viewing 15 posts - 166 through 180 (of 307 total)

You must be logged in to reply to this topic. Login to reply