Mass Percentile(like) calculation

  • Hi All,

    I have a large dataset containing distributions of a variety of measures. I need to extract what I'd call smoothed percentiles - find the appropriate record for the percentile in question, then average N values either side of it. I also need to get these results for multiple percentile points at once, eg 0.3, 0.5 , 0.7. I'm not a whiz with SQL so please tolerate ugly code in my attempt so far!

    A table can be created using

    CREATE TABLE PercTest

    (Trial smallint, Measure char(12), Value float)

    INSERT INTO PercTest

    SELECT '1', 'Investments', '100.5' UNION ALL

    SELECT '2', 'Investments', '104.8' UNION ALL

    SELECT '3', 'Investments', '92.5' UNION ALL

    SELECT '4', 'Investments', '107.1' UNION ALL

    SELECT '5', 'Investments', '100.4' UNION ALL

    SELECT '6', 'Investments', '93.5' UNION ALL

    SELECT '7', 'Investments', '110.4' UNION ALL

    SELECT '8', 'Investments', '101.3' UNION ALL

    SELECT '9', 'Investments', '109.9' UNION ALL

    SELECT '10', 'Investments', '98.1' UNION ALL

    SELECT '1', 'Premium', '20.6' UNION ALL

    SELECT '2', 'Premium', '43.3' UNION ALL

    SELECT '3', 'Premium', '62.4' UNION ALL

    SELECT '4', 'Premium', '10.5' UNION ALL

    SELECT '5', 'Premium', '13.6' UNION ALL

    SELECT '6', 'Premium', '53.8' UNION ALL

    SELECT '7', 'Premium', '70.2' UNION ALL

    SELECT '8', 'Premium', '39.5' UNION ALL

    SELECT '9', 'Premium', '12.7' UNION ALL

    SELECT '10', 'Premium', '18.5'

    For these inputs, with a window size of one trial either side, what I'd hope for as an output is:

    Measure | Value| Percentile

    Investments | 97.33| 0.3

    Premium | 14.93| 0.3

    Investments |100.73| 0.5

    Premium | 26.2 | 0.5

    Investments |103.05| 0.7

    Premium | 41.4 | 0.7

    So far I have a fairly painstaking way of doing this by creating an intermediate table which adds the percentiles like so:

    SELECT

    Trial,

    Measure,

    Value,

    CONVERT(float,

    (RANK()OVER(

    PARTITION BY Measure

    ORDER BY Value ASC)))

    /(MAX(Trial) OVER(PARTITION BY Measure)) AS Percentile

    INTO PercTestIntermediate

    FROM PercTest

    And then I have this little bit of code which extracts values for all measures but only the one given percentile.

    DECLARE @target float

    DECLARE @window_size float

    DECLARE @window_offset float

    --Change the desired percentile here

    SET @target = 0.7

    --Change the size of the averaging window - note this is the number of trials

    --which will be considered from each side. This is float to force floating

    --point division

    SET @window_size = 1.0

    SELECT @window_offset=@window_size/MAX(Trial) FROM [ICMDataMartDev].[HISCOX\LawrencA].[PercTest]

    SELECT Measure, Avg(Value) AS Value, @target AS Percentile

    FROM [ICMDataMartDev].[HISCOX\LawrencA].[PercTestIntermediate]

    WHERE Percentile between @target-@window_offset and @target+@window_offset

    GROUP BY Measure

    I'd appreciate input on the value of the intermediate table (I don't know if that could be used as one of these CTEs I've seen mentioned) but the thing that's giving me the biggest headache at the moment is trying to get multiple percentile 'windows' calculated in one query. I can run my kludgy one with different values and UNION ALL the results, but that strikes me as A Bad Way To Do Things.

    What Would SQLServerCentral Folks Do?

  • Try the following out. It uses Jeff Moden's DelimitedSplit8K function which I didn't include. I highly recommend that you read the (absolutely awesome!) article on it (http://qa.sqlservercentral.com/articles/Tally+Table/72993/). My method uses Excel's method of calculating the percentile, which is just one of many. I selected that one because it can easily be verified. I think that this code doesn't work properly for the 0 and 100 percentiles but those are just the min and max of the data.

    DECLARE @Percentiles AS VARCHAR(100) = '0.3, 0.5, 0.7'; --Desired percentiles

    WITH Data

    AS ( SELECT Measure ,

    Value ,

    ROW_NUMBER() OVER ( PARTITION BY Measure ORDER BY Value ) AS rn

    FROM dbo.PercTest AS pt

    ),

    Percentiles

    AS ( SELECT CAST(Item AS DECIMAL(9, 5)) AS Percentile

    FROM dbo.DelimitedSplit8k(@Percentiles,',') AS DS8K --Search SQLServerCentral for Jeff Moden's DelimitedSplit8K for code

    ),

    Calc

    AS ( SELECT Data.Measure ,

    Percentiles.Percentile ,

    CAST(( COUNT(*) - 1 ) * Percentiles.Percentile AS INT) --Uses Excel's version of calculating percentile

    + 1 AS k ,

    ( COUNT(*) - 1 ) * Percentiles.Percentile

    - CAST(( COUNT(*) - 1 ) * Percentiles.Percentile AS INT) AS d ,

    COUNT(*) AS totalCount

    FROM Data

    CROSS JOIN Percentiles

    GROUP BY Data.Measure ,

    Percentiles.Percentile

    )

    SELECT Data.Measure ,

    Calc.Percentile ,

    SUM(CASE WHEN Data.rn = Calc.k THEN ( 1 - Calc.d ) * Data.Value

    WHEN Data.rn = Calc.k + 1 THEN Calc.d * Data.Value

    ELSE NULL

    END) AS Value

    FROM Data

    INNER JOIN Calc ON Data.Measure = Calc.Measure

    AND Data.rn IN ( Calc.k, Calc.k + 1 )

    GROUP BY Data.Measure ,

    Calc.Percentile

    ORDER BY Data.Measure ,

    Calc.Percentile

    Edit: Error in the SQL code

Viewing 2 posts - 1 through 1 (of 1 total)

You must be logged in to reply to this topic. Login to reply