HISTORIC SNAPSHOTS OF DATA - COUNT (*) - GROUP BY

  • Hi

    Having recently set up a Data Warehouse at work, we have been asked to come up with a solution to the following. I was wondering if anyone counld advise on the best way to achieve this.

    We need to be able to count how many patients were in the hospital at any given time in the past.

    Each patient is given a an admissions and discharge date. if a patient does not have a discharge date then they are considered to still be in the hospital.

    I need to be able to enter a start and end date for any given period in the past (upto and including today) and have a count of patients who were admitted and discharged in that period, by dates within that period.

    Below is some sample data followed by expected results. Thanks in advance for any advice.

    SAMPLE DATA

    drop table PS_TestForOnline

    CREATE TABLE PS_TestForOnline

    (

    rowkey int,

    crn int,

    admdate date ,

    disdate date,

    )

    INSERT INTO PS_TestForOnline

    VALUES('1','11111','2011/01/01','2011/01/07' );

    INSERT INTO PS_TestForOnline

    VALUES('2','22222','2011/01/02','2011/01/07');

    INSERT INTO PS_TestForOnline

    VALUES('3','33333','2011/01/03','2011/01/07');

    INSERT INTO PS_TestForOnline

    VALUES('4','44444','2011/01/04','2011/01/07');

    INSERT INTO PS_TestForOnline

    VALUES('5','55555','2011/01/05','2011/01/07' );

    INSERT INTO PS_TestForOnline

    VALUES('6','66666','2011/01/06','2011/01/07' );

    INSERT INTO PS_TestForOnline

    VALUES('7','77777','2011/01/07','2011/01/07');

    INSERT INTO PS_TestForOnline

    VALUES('8','88888','2011/01/08', (NULL));

    select * from PS_TestForOnline

    EXPECTED RESULTS for period 1st to 8th Jan

    CREATE TABLE PS_TestForOnline_Answer

    (

    DATE DATE ,

    COUNT_OF_PATIENTS INT,

    );

    INSERT INTO PS_TestForOnline_Answer

    VALUES('2011/01/01','1' );

    INSERT INTO PS_TestForOnline_Answer

    VALUES('2011/01/02','2' );

    INSERT INTO PS_TestForOnline_Answer

    VALUES('2011/01/03','3' );

    INSERT INTO PS_TestForOnline_Answer

    VALUES('2011/01/04','4' );

    INSERT INTO PS_TestForOnline_Answer

    VALUES('2011/01/05','5' );

    INSERT INTO PS_TestForOnline_Answer

    VALUES('2011/01/06','6' );

    INSERT INTO PS_TestForOnline_Answer

    VALUES('2011/01/07','7' );

    INSERT INTO PS_TestForOnline_Answer

    VALUES('2011/01/08','1' );

    select * from PS_TestForOnline_Answer

  • [EDIT: Just found this is wrong and gives wrong count on last day; left here as it may give some inspiration. Am working on as output is currently annoying me :)]

    This'll get you somewhere near there.

    You can either use a loop (instead of params table) to iterate from first date to last date.

    Alternatively one of them fancy recursive CTEs might get you there

    DROP TABLE params;

    CREATE TABLE params (StartDate date, EndDate date);

    INSERT INTO params VALUES ('2011-01-01', '2011-01-01');

    INSERT INTO params VALUES ('2011-01-01', '2011-01-02');

    INSERT INTO params VALUES ('2011-01-01', '2011-01-03');

    INSERT INTO params VALUES ('2011-01-01', '2011-01-04');

    INSERT INTO params VALUES ('2011-01-01', '2011-01-05');

    INSERT INTO params VALUES ('2011-01-01', '2011-01-06');

    INSERT INTO params VALUES ('2011-01-01', '2011-01-07');

    INSERT INTO params VALUES ('2011-01-01', '2011-01-08');

    WITH dateparams AS ( SELECT StartDate, EndDate FROM params )

    SELECT COALESCE(t2.EndDate, t1.admdate) AS DateCounted, COUNT(*)

    FROM PS_TestForOnline AS t1

    LEFT JOIN dateparams AS t2

    ON t1.admdate BETWEEN t2.StartDate AND t2.EndDate

    OR t1.disdate BETWEEN t2.StartDate AND t2.EndDate

    GROUP BY t2.StartDate, COALESCE(t2.EndDate, t1.admdate)

    ORDER BY COALESCE(t2.EndDate, t1.admdate)

    ---------------------------------------------
    If data can screw you, it will; never assume the data are correct.

  • A conditional SUM() based on a calendar table should do it.



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • Sorry, the query has stumped me...

    However taking a step back from the problem makes me wonder whether it can be resolved with SQL, maybe a snapshot table is required, that can be updated daily.

    Even if we want 1st to the 8th, any query will only show patients from that time period, unless it goes back to day one to keep a tally.

    In short.

    Even on the 1st Jan 2011, there could already be 40 patients from previous days.

    Therefore, if on 1st Jan a patient is admitted that takes the tally to 41, not necessarily 1 as shown in the sample data (which is where I went flying off, trying to write SQL earlier :blush: ).

    If a nightly job were run to add admissions and discharged each day, a snapshot table could be filled with these values and easily referenced.

    Dunno, stumped...

    ---------------------------------------------
    If data can screw you, it will; never assume the data are correct.

  • I have the following script which works.

    Although when running against the live data in the warehouse it take a long, long time. I am running it against a table of about 800,000 admissions.

    I does give me correct answer. i have indexed the date fields but it still takes a good 2 mins to do the last 2 months of activity.

    Any Ideas on how to enhance performace??

    declare @startDate datetime, @endDate datetime;

    set @startDate = '20101230';

    set @endDate = '20110112';

    with dates as -- this part is just generating a calendar for period

    (

    select @startDate as [date]

    union all

    select dateadd(dd,1,[date]) from dates where [date] <= @endDate

    )

    select

    [date],

    p.count_of_patients

    from

    dates d

    cross apply

    (

    select

    count(*) as count_of_patients

    from

    PS_TestForOnline

    where

    admdate <= d.date

    and (disdate >= d.date or disdate is null)

    ) p

  • Rewriting as a join (using your logic, which I couldn't get my head around yesterday - trying to get back up to speed with SQL) gives:

    With (550000 source rows)

    DECLARE @startDate datetime, @endDate datetime;

    SET @startDate = '20101230';

    SET @endDate = '20110112';

    WITH dates AS -- this part is just generating a calendar for period

    (

    SELECT @startDate as [date]

    UNION ALL

    SELECT DATEADD(dd, 1, [date]) FROM dates WHERE [date] <= @endDate

    )

    SELECT [date], COUNT(PS_TestForOnline.rowkey)

    FROM PS_TestForOnline

    RIGHT JOIN dates

    ON admdate <= dates.[date]

    AND (disdate >= dates.[date] OR disdate IS NULL)

    GROUP BY [date]

    But it's no faster than yours and consumes more resource.

    Looking at statistics the JOIN gives.

    Table 'Worktable'. Scan count 4, logical reads 3094315, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'PS_TestForOnline'. Scan count 2, logical reads 3322, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 9922 ms, elapsed time = 5089 ms

    CROSS APPLY gives

    Table 'PS_TestForOnline'. Scan count 15, logical reads 24915, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 2, logical reads 91, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 2636 ms, elapsed time = 2634 ms.

    Conclusion CROSS APPLY quicker and more efficient, time taken JOIN takes ~5s APPLY ~2.5s

    Looking at covering indexes gives:

    JOIN

    CREATE NONCLUSTERED INDEX idx_admdatedisdate

    ON [dbo].[PS_TestForOnline] ([admdate],[disdate])

    INCLUDE ([rowkey])

    Table 'PS_TestForOnline'. Scan count 15, logical reads 24870, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 2, logical reads 91, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 3744 ms, elapsed time = 3780 ms.

    CROSS APPLY gives

    CREATE NONCLUSTERED INDEX idx_admdatedisdate2

    ON [dbo].[PS_TestForOnline] ([admdate],[disdate])

    Table 'PS_TestForOnline'. Scan count 15, logical reads 20775, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 2, logical reads 91, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 1794 ms, elapsed time = 1782 ms.

    Conclusion CROSS APPLY quicker and more efficient, time taken JOIN takes ~4s APPLY ~2s

    Interested to see where this ends up.

    ---------------------------------------------
    If data can screw you, it will; never assume the data are correct.

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply