Running total (cummulative Sum)

  • Hi ,

    Have a table that looks like this:

    id count

    1 100

    2 50

    3 10

    I want to add a new column called cumulative_sum, so the table would look like this:

    id count cumulative_sum

    1 100 100

    2 50 150

    3 10 160

    Is there any SQL statement that can do this easily? What's the best way to accomplish this?

    _______________________________________________________________

    Need help? Help us help you.

  • --drop table #data

    create table #data

    (

    id int identity(1,1),

    num numeric(18,0),

    cnum numeric(18,0)

    )

    insert into #data (num, cnum)

    select 100,0

    union

    select 50,0

    union

    select 10,0

    union

    select 65,0

    declare @cnt int

    set @cnt = 1

    while (@cnt <= (select COUNT(*) from #data))

    begin

    update #data

    set cnum = (select SUM(num) from #data

    where id <= @cnt)

    where [id] = @cnt

    set @cnt = @cnt +1

    end

    select * from #data

    HTH :-):-)
    taybre

    For better, quicker answers on T-SQL questions, click on the following...
    http://www.sqlservercentral.com/articles/Best+Practices/61537/

    For better answers on performance questions, click on the following...
    http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • tommey152 (7/6/2011)


    Hi ,

    .....

    Is there any SQL statement that can do this easily? What's the best way to accomplish this?

    Easiest is the recursive CTE:

    DROP TABLE #SourceData

    CREATE TABLE #SourceData (id INT IDENTITY(1,1), [count] INT, cumulative_sum INT)

    INSERT INTO #SourceData (count) VALUES (100),(50),(10)

    ;WITH Accumulator AS (

    SELECT id,

    [count],

    cumulative_sum = [count]

    FROM #SourceData

    WHERE id = 1

    UNION ALL

    SELECT s.id,

    s.[count],

    a.cumulative_sum+s.[count]

    FROM #SourceData s

    INNER JOIN Accumulator a

    ON a.id + 1 = s.id

    ) SELECT id,

    [count],

    cumulative_sum

    FROM Accumulator OPTION (MAXRECURSION 0)

    This method can calculate values for a million row in around ten seconds providing there's a unique clustered index on the column which holds the sequential values

    Best is probably the "Running Totals Update"[/url] - it's fastest anyway.

    Worst are those methods using a triangular join - where the value for say row 10 is calculated by summing the values for rows 1 to 10, then the value for row 11 is calculated by summing the values for rows 1 to 11...

    Cheers

    ChrisM

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Chris,

    Do you know of a way to use the rCTE to do a running total without the ID's being perfectly sequential?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Jeff Moden (7/6/2011)


    Chris,

    Do you know of a way to use the rCTE to do a running total without the ID's being perfectly sequential?

    There's a way using Cross Apply by taking the Min of the greater IDs as your next connection, but it's data volume and index dependent on if you're better off slapping a Row_Number() on it at that point.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Craig Farrell (7/6/2011)


    Jeff Moden (7/6/2011)


    Chris,

    Do you know of a way to use the rCTE to do a running total without the ID's being perfectly sequential?

    There's a way using Cross Apply by taking the Min of the greater IDs as your next connection, but it's data volume and index dependent on if you're better off slapping a Row_Number() on it at that point.

    You'd think so, but it's not straightforward:

    DROP TABLE #SourceData

    CREATE TABLE #SourceData (id INT IDENTITY(1,1), [count] INT, cumulative_sum INT)

    INSERT INTO #SourceData (count) VALUES (100),(50),(10)

    ;WITH Accumulator AS (

    SELECT id,

    [count],

    cumulative_sum = [count]

    FROM #SourceData

    WHERE id = 1

    UNION ALL

    SELECT s.id,

    s.[count],

    a.cumulative_sum+s.[count]

    FROM Accumulator a

    CROSS APPLY (SELECT NextID = MIN(id) FROM #SourceData WHERE id > a.id) n

    INNER JOIN #SourceData s ON s.id = n.NextID

    ) SELECT id,

    [count],

    cumulative_sum

    FROM Accumulator OPTION (MAXRECURSION 0)

    Msg 467, Level 16, State 1, Line 5

    GROUP BY, HAVING, or aggregate functions are not allowed in the recursive part of a recursive common table expression 'Accumulator'.

    Jeff, thanks for asking this. I've been working on a rCTE running totals article for a few months (when not wrestling with SSIS) but it's too bland to offer as it stands. Cracking this particular nut could make it more interesting and worthwhile.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • ChrisM@Work (7/7/2011)


    You'd think so, but it's not straightforward...

    MIN and MAX may not be allowed in the recursive part, but MIN and MAX are no more than the first row (TOP 1) ordered in some particular way. Now, TOP isn't allowed in the recursive part either, but nobody told the query optimizer that, so we can use ROW_NUMBER = 1 to achieve a TOP operator in the plan.

    DROP TABLE #SourceData;

    GO

    CREATE TABLE #SourceData

    (

    id INTEGER IDENTITY(1,1) PRIMARY KEY,

    [count] INTEGER NOT NULL,

    cumulative_sum INTEGER NULL,

    );

    GO

    INSERT #SourceData

    (count)

    VALUES

    (100),

    (50),

    (10);

    WITH Accumulator AS

    (

    SELECT

    sd.id,

    sd.[count],

    cumulative_sum = sd.[count]

    FROM #SourceData AS sd

    WHERE

    sd.id = 1

    UNION ALL

    SELECT

    iTVF.id,

    iTVF.[count],

    a.cumulative_sum + iTVF.[count]

    FROM Accumulator AS a

    CROSS APPLY

    (

    SELECT

    r.id,

    r.[count]

    FROM

    (

    SELECT

    sd.id,

    sd.[count],

    rn = ROW_NUMBER() OVER (

    ORDER BY sd.id)

    FROM #SourceData AS sd

    WHERE

    sd.id > a.id

    ) AS r

    WHERE

    r.rn = 1

    ) AS iTVF

    )

    SELECT

    a.id,

    a.[count],

    a.cumulative_sum

    FROM Accumulator AS a

    OPTION (MAXRECURSION 0);

  • Nice one, Paul. The triangular join put me off writing and testing it, but in practice the performance is stunningly good - half a million rows return in 12s on this box.

    INSERT INTO #SourceData ([count]) SELECT a.[count] FROM #SourceData a, #SourceData b, #SourceData c, #SourceData d

    INSERT INTO #SourceData ([count]) SELECT a.[count] FROM #SourceData a, #SourceData b, #SourceData c

    go

    ;WITH Accumulator AS

    (

    SELECT

    sd.id,

    sd.[count],

    cumulative_sum = sd.[count]

    FROM #SourceData AS sd

    WHERE

    sd.id = 1

    UNION ALL

    SELECT

    s.id,

    s.[count],

    a.cumulative_sum+s.[count]

    FROM Accumulator AS a

    CROSS APPLY

    (

    SELECT

    NextID = r.id

    FROM

    (

    SELECT

    sd.id,

    rn = ROW_NUMBER() OVER (ORDER BY sd.id)

    FROM #SourceData AS sd

    WHERE

    sd.id > a.id

    ) AS r

    WHERE

    r.rn = 1

    ) AS n

    INNER JOIN #SourceData s ON

    s.id = n.NextID

    )

    SELECT

    a.id,

    a.[count],

    a.cumulative_sum

    FROM Accumulator AS a

    OPTION (MAXRECURSION 0);

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Hi Chris,

    There's no need for a join at all - a point I realised after editing the TOP trick into your code. I have updated the code and query plan to show what I'm getting at here. Eight seconds on my machine. The only real issue I have is that recursive query plans cannot use parallelism directly. Shame.

    Paul

  • The new version runs faster but not by much.

    So the optimizer converts what looks like a triangular join (the entire CROSS APPLY part) into a TOP query?

    Interestingly, you can put SUM OVER () into the iTVF without an error message, and so long as you don't reference the product in the outer query, the optimizer will still recognise the iTVF as a TOP query.

    SELECT

    sd.id,

    sd.[count],

    SUMcount = SUM([count]) OVER (PARTITION BY 1),

    rn = ROW_NUMBER() OVER (ORDER BY sd.id)

    FROM #SourceData AS sd

    WHERE

    sd.id > a.id

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • SQLkiwi (7/7/2011)


    Hi Chris,

    There's no need for a join at all - a point I realised after editing the TOP trick into your code. I have updated the code and query plan to show what I'm getting at here. Eight seconds on my machine. The only real issue I have is that recursive query plans cannot use parallelism directly. Shame.

    Paul

    This looks similar to your DISTINCT trick...

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • ChrisM@Work (7/7/2011)


    So the optimizer converts what looks like a triangular join (the entire CROSS APPLY part) into a TOP query?

    There's a bit more going on than that, but essentially yes.

    Interestingly, you can put SUM OVER () into the iTVF without an error message, and so long as you don't reference the product in the outer query, the optimizer will still recognise the iTVF as a TOP query.

    You can do that, but why would you want to? By the way, a windowing function in the recursive part of a CTE has a weird semantic (quite counter-intuitive, in fact) but works the way it does because they are intended for use in hierarchy queries, where apparently the implemented semantic is the desired one.

  • SQLkiwi (7/7/2011)


    ... By the way, a windowing function in the recursive part of a CTE has a weird semantic (quite counter-intuitive, in fact) but works the way it does because they are intended for use in hierarchy queries, where apparently the implemented semantic is the desired one.

    I heard that too but the full text of the article hasn't reached this end of the solar system yet!

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • ChrisM@Work (7/7/2011)


    SQLkiwi (7/7/2011)


    Hi Chris,

    There's no need for a join at all - a point I realised after editing the TOP trick into your code. I have updated the code and query plan to show what I'm getting at here. Eight seconds on my machine. The only real issue I have is that recursive query plans cannot use parallelism directly. Shame.

    Paul

    This looks similar to your DISTINCT trick...

    Hey Chris, the DISTINCT trick you are mentioning is the one Paul coded as an altenative to DISTINCT operation ? I have been searching for that piece of code but in vain. Can you point me to that piece?

    Paul, if u still on this thread, can you point me to your altenative to DISTINCT function?

  • ColdCoffee (7/7/2011)


    Paul, if u still on this thread, can you point me to your altenative to DISTINCT function?

    Sure.

    http://qa.sqlservercentral.com/Forums/FindPost1013407.aspx

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply