HISTORIC SNAPSHOTS OF DATA - COUNT (*) - GROUP BY

Question

Post reply

HISTORIC SNAPSHOTS OF DATA - COUNT (*) - GROUP BY

LoosinMaMind

Hall of Fame

Points: 3366
More actions
March 27, 2011 at 10:09 am

#389148

Hi

Having recently set up a Data Warehouse at work, we have been asked to come up with a solution to the following. I was wondering if anyone counld advise on the best way to achieve this.

We need to be able to count how many patients were in the hospital at any given time in the past.

Each patient is given a an admissions and discharge date. if a patient does not have a discharge date then they are considered to still be in the hospital.

I need to be able to enter a start and end date for any given period in the past (upto and including today) and have a count of patients who were admitted and discharged in that period, by dates within that period.

Below is some sample data followed by expected results. Thanks in advance for any advice.

SAMPLE DATA

drop table PS_TestForOnline

CREATE TABLE PS_TestForOnline

(

rowkey int,

crn int,

admdate date ,

disdate date,

)

INSERT INTO PS_TestForOnline

VALUES('1','11111','2011/01/01','2011/01/07' );

INSERT INTO PS_TestForOnline

VALUES('2','22222','2011/01/02','2011/01/07');

INSERT INTO PS_TestForOnline

VALUES('3','33333','2011/01/03','2011/01/07');

INSERT INTO PS_TestForOnline

VALUES('4','44444','2011/01/04','2011/01/07');

INSERT INTO PS_TestForOnline

VALUES('5','55555','2011/01/05','2011/01/07' );

INSERT INTO PS_TestForOnline

VALUES('6','66666','2011/01/06','2011/01/07' );

INSERT INTO PS_TestForOnline

VALUES('7','77777','2011/01/07','2011/01/07');

INSERT INTO PS_TestForOnline

VALUES('8','88888','2011/01/08', (NULL));

select * from PS_TestForOnline

EXPECTED RESULTS for period 1st to 8th Jan

CREATE TABLE PS_TestForOnline_Answer

(

DATE DATE ,

COUNT_OF_PATIENTS INT,

);

INSERT INTO PS_TestForOnline_Answer

VALUES('2011/01/01','1' );

INSERT INTO PS_TestForOnline_Answer

VALUES('2011/01/02','2' );

INSERT INTO PS_TestForOnline_Answer

VALUES('2011/01/03','3' );

INSERT INTO PS_TestForOnline_Answer

VALUES('2011/01/04','4' );

INSERT INTO PS_TestForOnline_Answer

VALUES('2011/01/05','5' );

INSERT INTO PS_TestForOnline_Answer

VALUES('2011/01/06','6' );

INSERT INTO PS_TestForOnline_Answer

VALUES('2011/01/07','7' );

INSERT INTO PS_TestForOnline_Answer

VALUES('2011/01/08','1' );

select * from PS_TestForOnline_Answer

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply

KevinGurney SSCommitted Points: 1550 More actions · Answer 1

[EDIT: Just found this is wrong and gives wrong count on last day; left here as it may give some inspiration. Am working on as output is currently annoying me :)]

This'll get you somewhere near there.

You can either use a loop (instead of params table) to iterate from first date to last date.

Alternatively one of them fancy recursive CTEs might get you there



DROP TABLE params;

CREATE TABLE params (StartDate date, EndDate date);

INSERT INTO params VALUES ('2011-01-01', '2011-01-01');

INSERT INTO params VALUES ('2011-01-01', '2011-01-02');

INSERT INTO params VALUES ('2011-01-01', '2011-01-03');

INSERT INTO params VALUES ('2011-01-01', '2011-01-04');

INSERT INTO params VALUES ('2011-01-01', '2011-01-05');

INSERT INTO params VALUES ('2011-01-01', '2011-01-06');

INSERT INTO params VALUES ('2011-01-01', '2011-01-07');

INSERT INTO params VALUES ('2011-01-01', '2011-01-08');

WITH dateparams AS ( SELECT StartDate, EndDate FROM params )

SELECT COALESCE(t2.EndDate, t1.admdate) AS DateCounted, COUNT(*)

  FROM PS_TestForOnline AS t1

    LEFT JOIN dateparams AS t2

      ON t1.admdate BETWEEN t2.StartDate AND t2.EndDate

	OR t1.disdate BETWEEN t2.StartDate AND t2.EndDate

  GROUP BY t2.StartDate, COALESCE(t2.EndDate, t1.admdate)

  ORDER BY COALESCE(t2.EndDate, t1.admdate)

---------------------------------------------
If data can screw you, it will; never assume the data are correct.

LutzM SSC Guru Points: 107049 More actions · Answer 2

LutzM

SSC Guru

Points: 107049

March 27, 2011 at 1:39 pm

#1304105

A conditional SUM() based on a calendar table should do it.

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

KevinGurney SSCommitted Points: 1550 More actions · Answer 3

Sorry, the query has stumped me...

However taking a step back from the problem makes me wonder whether it can be resolved with SQL, maybe a snapshot table is required, that can be updated daily.

Even if we want 1st to the 8th, any query will only show patients from that time period, unless it goes back to day one to keep a tally.

In short.

Even on the 1st Jan 2011, there could already be 40 patients from previous days.

Therefore, if on 1st Jan a patient is admitted that takes the tally to 41, not necessarily 1 as shown in the sample data (which is where I went flying off, trying to write SQL earlier :blush: ).

If a nightly job were run to add admissions and discharged each day, a snapshot table could be filled with these values and easily referenced.

Dunno, stumped...

---------------------------------------------
If data can screw you, it will; never assume the data are correct.

LoosinMaMind Hall of Fame Points: 3366 More actions · Answer 4

I have the following script which works.

Although when running against the live data in the warehouse it take a long, long time. I am running it against a table of about 800,000 admissions.

I does give me correct answer. i have indexed the date fields but it still takes a good 2 mins to do the last 2 months of activity.

Any Ideas on how to enhance performace??

declare @startDate datetime, @endDate datetime;

set @startDate = '20101230';

set @endDate = '20110112';

with dates as -- this part is just generating a calendar for period

(

select @startDate as [date]

union all

select dateadd(dd,1,[date]) from dates where [date] <= @endDate

)

select

[date],

p.count_of_patients

from

dates d

cross apply

(

select

count(*) as count_of_patients

from

PS_TestForOnline

where

admdate <= d.date

and (disdate >= d.date or disdate is null)

) p

KevinGurney SSCommitted Points: 1550 More actions · Answer 5

Rewriting as a join (using your logic, which I couldn't get my head around yesterday - trying to get back up to speed with SQL) gives:

With (550000 source rows)



DECLARE @startDate datetime, @endDate datetime;

SET @startDate = '20101230';

SET @endDate = '20110112';

WITH dates AS -- this part is just generating a calendar for period

(

  SELECT @startDate as [date]

    UNION ALL

    SELECT DATEADD(dd, 1, [date]) FROM dates WHERE [date] <= @endDate

)

SELECT [date], COUNT(PS_TestForOnline.rowkey)

  FROM PS_TestForOnline

    RIGHT JOIN dates

      ON admdate <= dates.[date]

        AND (disdate >= dates.[date] OR disdate IS NULL)

  GROUP BY [date]

But it's no faster than yours and consumes more resource.

Looking at statistics the JOIN gives.

Table 'Worktable'. Scan count 4, logical reads 3094315, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'PS_TestForOnline'. Scan count 2, logical reads 3322, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 9922 ms, elapsed time = 5089 ms

CROSS APPLY gives

Table 'PS_TestForOnline'. Scan count 15, logical reads 24915, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Worktable'. Scan count 2, logical reads 91, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 2636 ms, elapsed time = 2634 ms.

Conclusion CROSS APPLY quicker and more efficient, time taken JOIN takes ~5s APPLY ~2.5s

Looking at covering indexes gives:

JOIN

CREATE NONCLUSTERED INDEX idx_admdatedisdate

ON [dbo].[PS_TestForOnline] ([admdate],[disdate])

INCLUDE ([rowkey])

Table 'PS_TestForOnline'. Scan count 15, logical reads 24870, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Worktable'. Scan count 2, logical reads 91, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 3744 ms, elapsed time = 3780 ms.

CROSS APPLY gives

CREATE NONCLUSTERED INDEX idx_admdatedisdate2

ON [dbo].[PS_TestForOnline] ([admdate],[disdate])

Table 'PS_TestForOnline'. Scan count 15, logical reads 20775, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Worktable'. Scan count 2, logical reads 91, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 1794 ms, elapsed time = 1782 ms.

Conclusion CROSS APPLY quicker and more efficient, time taken JOIN takes ~4s APPLY ~2s

Interested to see where this ends up.

---------------------------------------------
If data can screw you, it will; never assume the data are correct.