% of total count

  • I want to add 4th column (% of total count) for example:

    State Name------------Count1--------Total Charge Off Amount--------% of Total Count

    Alabama------------------- 5 ---------------- 10000 ------------------ 50%

    Arizona------------------- 2 ---------------- 40000 ---------------------20%

    Arkansas -------------------3 --------------- 5000 -------------------- 30%

    SELECT

    A.[State Name]

    ,COUNT(A.[State Name]) AS [Count1]

    ,SUM (A.ChargeOffAmount) AS [Total Charge Off Amount]

    FROM

    (

    SELECT DISTINCT

    L.LoanID

    , H.TransactionDate

    , CASE WHEN H.TransactionAmt > 0 THEN H.TransactionAmt ELSE L.PrincipalBal END AS [ChargeOffAmount]

    , ST.Name AS [State Name]

    FROM Loan AS L

    JOIN History AS H ON L.LoanID = H.LoanID

    LEFT JOIN CreditScoreDetail AS CSD ON L.LoanID = CSD.LoanID AND L.AssmRecCounter = CSD.AssmRecCounter

    LEFT JOIN Property AS PRO ON L.LoanID = PRO.LoanID

    LEFT JOIN State AS ST ON PRO.State = ST.Code

    WHERE

    CONVERT(varchar(20), H.TransactionDate, 101) >= @StartDate

    AND CONVERT(varchar(20), H.TransactionDate, 101) <= @EndDate

    ) AS A

    LEFT OUTER JOIN Loan AS L ON L.loanID = A.LoanID

    GROUP BY A.[State Name]

    Can somebody suggest the code for 4th column

  • DECLARE @table TABLE (stateName varchar(20), Count1 int, TotalChargeOffAmount int);

    INSERT @table

    VALUES

    ('Alabama',5,1000),

    ('Arizona',2,4000),

    ('Arkansas',3,5000);

    SELECT

    stateName,

    Count1,

    TotalChargeOffAmount,

    [% of Total Count] = ((Count1*1.)/CountTotal)*100

    FROM @table t1

    CROSS APPLY (SELECT CountTotal = SUM(Count1) FROM @table) t2;

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Also SUM with OVER()

    DECLARE @table TABLE (stateName varchar(20), Count1 int, TotalChargeOffAmount int);

    INSERT @table

    VALUES

    ('Alabama',5,1000),

    ('Arizona',2,4000),

    ('Arkansas',3,5000);

    SELECT

    stateName,

    Count1,

    TotalChargeOffAmount,

    100.0*Count1 / SUM(t1.[Count1]) OVER () AS [Sum]

    FROM @table t1

  • Both Alan.B and Gazareth started with the summarized data. Here is how you would do it starting with the raw data.

    DECLARE @table TABLE (stateName varchar(20), Amount int);

    INSERT @table

    VALUES

    ('Alabama',200),

    ('Alabama',200),

    ('Alabama',200),

    ('Alabama',200),

    ('Alabama',200),

    ('Arizona',2000),

    ('Arizona',2000),

    ('Arkansas',2000),

    ('Arkansas',2000),

    ('Arkansas',1000)

    ;

    SELECT t.stateName, COUNT(*), SUM(t.Amount), 100.0 * COUNT(*) / SUM(COUNT(*)) OVER()

    FROM @table t

    GROUP BY t.stateName

    Note that the denominator is an windowed aggregate of a regular aggregate.

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • It also appears that you may have a logic error in your code:

    WHERE

    CONVERT(varchar(20), H.TransactionDate, 101) >= @StartDate

    AND CONVERT(varchar(20), H.TransactionDate, 101) <= @EndDate

    If the H.TransactionDate field is a datetime, and contains a time portion, this appears to leave off everything for the end date.

    As an example, the @EndDate is 9/30/2016, which is midnight.

    If there are records that are created AFTER midnight, then this will not include them.

    Michael L John
    If you assassinate a DBA, would you pull a trigger?
    To properly post on a forum:
    http://www.sqlservercentral.com/articles/61537/

  • Michael L John (9/15/2016)


    It also appears that you may have a logic error in your code:

    WHERE

    CONVERT(varchar(20), H.TransactionDate, 101) >= @StartDate

    AND CONVERT(varchar(20), H.TransactionDate, 101) <= @EndDate

    If the H.TransactionDate field is a datetime, and contains a time portion, this appears to leave off everything for the end date.

    As an example, the @EndDate is 9/30/2016, which is midnight.

    If there are records that are created AFTER midnight, then this will not include them.

    The 101 format does not include the time portion. That being said, you would be better off casting these to DATE than VARCHAR. SQL 2012 can still use an index seek when casting DATETIME data to DATE, but it cannot when converting it to VARCHAR.

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • If its after midnight 09/30 then shouldn't the date be 10/01?

  • That's right, it does not have any time portion, so everything 09/30 should be listed

  • SQLPain (9/15/2016)


    That's right, it does not have any time portion, so everything 09/30 should be listed

    Good, because this is a frequent error that I see.

    2016-09-30 15:39:50.197 would be an example of a time after midnight. if your data contained times, then this would not appear in your query.

    Michael L John
    If you assassinate a DBA, would you pull a trigger?
    To properly post on a forum:
    http://www.sqlservercentral.com/articles/61537/

  • Gazareth (9/15/2016)


    Also SUM with OVER()

    DECLARE @table TABLE (stateName varchar(20), Count1 int, TotalChargeOffAmount int);

    INSERT @table

    VALUES

    ('Alabama',5,1000),

    ('Arizona',2,4000),

    ('Arkansas',3,5000);

    SELECT

    stateName,

    Count1,

    TotalChargeOffAmount,

    100.0*Count1 / SUM(t1.[Count1]) OVER () AS [Sum]

    FROM @table t1

    The problem with SUM OVER() in this case (and often with Window Aggregate Functions in general) is that it requires work tables to be written the the tempdb and makes the query slower. Consider the following two queries:

    DECLARE @table TABLE (stateName varchar(20), Count1 int, TotalChargeOffAmount int);

    INSERT @table

    VALUES

    ('Alabama',5,1000),

    ('Arizona',2,4000),

    ('Arkansas',3,5000);

    SET STATISTICS IO ON;

    PRINT 'Using CROSS APPLY and SUM';

    SELECT

    stateName,

    Count1,

    TotalChargeOffAmount,

    [% of Total Count] = (Count1*100.)/CountTotal

    FROM @table t1

    CROSS APPLY (SELECT CountTotal = SUM(Count1) FROM @table) t2;

    PRINT CHAR(13)+CHAR(10)+'Using SUM OVER()';

    SELECT

    stateName,

    Count1,

    TotalChargeOffAmount,

    100.0*Count1 / SUM(t1.Count1) OVER () AS [Sum]

    FROM @table t1;

    SET STATISTICS IO OFF;

    and here's the IO results:

    Using CROSS APPLY and SUM

    Table '#ABCCA541'. Scan count 2, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Using SUM OVER()

    Table 'Worktable'. Scan count 3, logical reads 11, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table '#ABCCA541'. Scan count 1, logical reads 1, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Using SUM OVER() required 6X the number of reads. This is why I generally avoid Window Aggregate Functions. This is discussed in this article:

    Nasty Fast PERCENT_RANK[/url]

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Alan.B (9/15/2016)


    Gazareth (9/15/2016)


    Also SUM with OVER()

    DECLARE @table TABLE (stateName varchar(20), Count1 int, TotalChargeOffAmount int);

    INSERT @table

    VALUES

    ('Alabama',5,1000),

    ('Arizona',2,4000),

    ('Arkansas',3,5000);

    SELECT

    stateName,

    Count1,

    TotalChargeOffAmount,

    100.0*Count1 / SUM(t1.[Count1]) OVER () AS [Sum]

    FROM @table t1

    The problem with SUM OVER() in this case (and often with Window Aggregate Functions in general) is that it requires work tables to be written the the tempdb and makes the query slower. Consider the following two queries:

    DECLARE @table TABLE (stateName varchar(20), Count1 int, TotalChargeOffAmount int);

    INSERT @table

    VALUES

    ('Alabama',5,1000),

    ('Arizona',2,4000),

    ('Arkansas',3,5000);

    SET STATISTICS IO ON;

    PRINT 'Using CROSS APPLY and SUM';

    SELECT

    stateName,

    Count1,

    TotalChargeOffAmount,

    [% of Total Count] = (Count1*100.)/CountTotal

    FROM @table t1

    CROSS APPLY (SELECT CountTotal = SUM(Count1) FROM @table) t2;

    PRINT CHAR(13)+CHAR(10)+'Using SUM OVER()';

    SELECT

    stateName,

    Count1,

    TotalChargeOffAmount,

    100.0*Count1 / SUM(t1.Count1) OVER () AS [Sum]

    FROM @table t1;

    SET STATISTICS IO OFF;

    and here's the IO results:

    Using CROSS APPLY and SUM

    Table '#ABCCA541'. Scan count 2, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Using SUM OVER()

    Table 'Worktable'. Scan count 3, logical reads 11, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table '#ABCCA541'. Scan count 1, logical reads 1, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Using SUM OVER() required 6X the number of reads. This is why I generally avoid Window Aggregate Functions. This is discussed in this article:

    Nasty Fast PERCENT_RANK[/url]

    Although I absolutely agree that a large number of reads are usually an indication that something has gone haywire and could stand some real improvement, it's not always the case. "It Depends". In the past, I've seen Paul White solve a running total using a Triangular Join but the CPU and Duration were incredibly low rivaling even the Quirky Update. And, Tuesday night, I gave a demo of solving a problem using an initial single column GROUP BY to pre-aggregate 7 million row / 8GB table and then used 4 SUM() OVERs to very quickly solve for 4 other sums using to calculate percentages related to the first along with 4 DENSE RANKs. There were "only" ~31 thousand reads from the original table and ~482,000 reads from the work table. The SUM() OVERs also replaced the need for 4 other GROUP BY queries that would have required a join to the original GROUP BY to solve the problem. In the cCTE, that would have required the pre-aggregation to execute 4 times and doing it with a TempTable already took a bit more than 7 seconds not to mention the fact that 4 additional queries where starting to make the code a fair bit more complex.

    The code with initial preagg GROUP BY followed by a cCTE that used SUM() OVER 4 times after that (all in same query, BTW), solved the problem in 4.5 seconds duration (INCLUDING display time of ~57,000 rows) and just over 9 seconds of CPU time and made the code incredibly simple to read and maintain.

    The reason I came back on this was that I didn't want anyone to develop the habit of avoiding things like (SUM) OVER (or any query) based solely on the number of READs sometimes produced. You have to look at the "big 4" (CPU, Reads, Writes, and Duration, which is the most important to the persons using the code) as a complete package compared to how other methods would perform.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Jeff Moden (9/16/2016)


    Alan.B (9/15/2016)


    Gazareth (9/15/2016)


    Also SUM with OVER()

    DECLARE @table TABLE (stateName varchar(20), Count1 int, TotalChargeOffAmount int);

    INSERT @table

    VALUES

    ('Alabama',5,1000),

    ('Arizona',2,4000),

    ('Arkansas',3,5000);

    SELECT

    stateName,

    Count1,

    TotalChargeOffAmount,

    100.0*Count1 / SUM(t1.[Count1]) OVER () AS [Sum]

    FROM @table t1

    The problem with SUM OVER() in this case (and often with Window Aggregate Functions in general) is that it requires work tables to be written the the tempdb and makes the query slower. Consider the following two queries:

    DECLARE @table TABLE (stateName varchar(20), Count1 int, TotalChargeOffAmount int);

    INSERT @table

    VALUES

    ('Alabama',5,1000),

    ('Arizona',2,4000),

    ('Arkansas',3,5000);

    SET STATISTICS IO ON;

    PRINT 'Using CROSS APPLY and SUM';

    SELECT

    stateName,

    Count1,

    TotalChargeOffAmount,

    [% of Total Count] = (Count1*100.)/CountTotal

    FROM @table t1

    CROSS APPLY (SELECT CountTotal = SUM(Count1) FROM @table) t2;

    PRINT CHAR(13)+CHAR(10)+'Using SUM OVER()';

    SELECT

    stateName,

    Count1,

    TotalChargeOffAmount,

    100.0*Count1 / SUM(t1.Count1) OVER () AS [Sum]

    FROM @table t1;

    SET STATISTICS IO OFF;

    and here's the IO results:

    Using CROSS APPLY and SUM

    Table '#ABCCA541'. Scan count 2, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Using SUM OVER()

    Table 'Worktable'. Scan count 3, logical reads 11, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table '#ABCCA541'. Scan count 1, logical reads 1, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Using SUM OVER() required 6X the number of reads. This is why I generally avoid Window Aggregate Functions. This is discussed in this article:

    Nasty Fast PERCENT_RANK[/url]

    Although I absolutely agree that a large number of reads are usually an indication that something has gone haywire and could stand some real improvement, it's not always the case. "It Depends". In the past, I've seen Paul White solve a running total using a Triangular Join but the CPU and Duration were incredibly low rivaling even the Quirky Update. And, Tuesday night, I gave a demo of solving a problem using an initial single column GROUP BY to pre-aggregate 7 million row / 8GB table and then used 4 SUM() OVERs to very quickly solve for 4 other sums using to calculate percentages related to the first along with 4 DENSE RANKs. There were "only" ~31 thousand reads from the original table and ~482,000 reads from the work table. The SUM() OVERs also replaced the need for 4 other GROUP BY queries that would have required a join to the original GROUP BY to solve the problem. In the cCTE, that would have required the pre-aggregation to execute 4 times and doing it with a TempTable already took a bit more than 7 seconds not to mention the fact that 4 additional queries where starting to make the code a fair bit more complex.

    The code with initial preagg GROUP BY followed by a cCTE that used SUM() OVER 4 times after that (all in same query, BTW), solved the problem in 4.5 seconds duration (INCLUDING display time of ~57,000 rows) and just over 9 seconds of CPU time and made the code incredibly simple to read and maintain.

    The reason I came back on this was that I didn't want anyone to develop the habit of avoiding things like (SUM) OVER (or any query) based solely on the number of READs sometimes produced. You have to look at the "big 4" (CPU, Reads, Writes, and Duration, which is the most important to the persons using the code) as a complete package compared to how other methods would perform.

    Interesting stuff both! The actual execution plans are weighted in the other direction, with the cross apply 2x the cost of the sum() over().

    Presumable due to 2 table scans with apply, 1 with sum() over().

  • What you have to remember is, all of the costs are estimates. The only thing that ultimately matters is the actual performance.

    Here's my favorite example on performance. Run the following code with the Actual execution plan turned on. The first query will give a "% of Batch" = 0 and the second query will = 100. Then read the messages tab and find out that nearly then exact opposite is true.

    RAISERROR('Reursive CTE',0,0) WITH NOWAIT;

    SET STATISTICS TIME,IO ON;

    WITH cte AS

    (

    SELECT N = 1

    UNION ALL

    SELECT N = N + 1

    FROM cte

    WHERE N < 1000000

    )

    SELECT *

    INTO #MyHead1

    FROM cte

    OPTION (MAXRECURSION 0)

    ;

    SET STATISTICS TIME,IO OFF;

    PRINT REPLICATE('-',119)

    ;

    RAISERROR('Pseudo-Cursor',0,0) WITH NOWAIT;

    SET STATISTICS TIME,IO ON;

    SELECT TOP 1000000

    N = ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

    INTO #MyHead2

    FROM sys.all_columns ac1

    CROSS JOIN sys.all_columns ac2

    ;

    SET STATISTICS TIME,IO OFF;

    PRINT REPLICATE('-',119);

    GO

    DROP TABLE #MyHead1,#MyHead2

    GO

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply