Compressing out consecutive runs of data quickly

Question

Post reply

Compressing out consecutive runs of data quickly

Ferguson

SSCertifiable

Points: 5837
More actions
April 30, 2011 at 11:47 am

#249689

This is part of a long and much more complicated process, but I have tried to simplify it for posting. Note that the tables involved are expected to have hundreds of millions of rows with new (and older) data being added all the time. The goal is to compress out adjacent periods that may be created as data is added. Periods can have gaps, but when data is added that create adjacencies, we want to merge them togeter (but only within "like" records, here represented by the value of field A).

I have a technique that does it (a distillation of what I do in practice which is actually more complex), but it is iterative. What I am looking for is some approach that can do this without iteration, as one big set operation. To demonstrate that, this code has to loop at least twice, but in practice that could be a LOT of loops, with table scans each time.

Note that the data addition is messy also -- it can split an interval. But it is what it is, and I have that working fine. The main goal of this is to represent sparse and repetitive data; the existing technique (one row per day) generates billions of rows on an old DB2 system, and as we migrate it, we want to redesign to be more efficient in representing the sparse data.

use somedatabase go if object_id('runs') is not null drop table runs go create table runs(A int, FromDate date, ToDate date) Insert into runs values (1, '1/1/2011','1/1/2011') Insert into runs values (1, '1/2/2011','1/3/2011') Insert into runs values (1, '1/5/2011','1/6/2011') Insert into runs values (1, '1/7/2011','1/7/2011') Insert into runs values (2, '1/1/2011','1/3/2011') Insert into runs values (2, '1/4/2011','1/5/2011') select * from runs order by a, FromDate declare @rows int set @rows=-1 while @rows<>0 begin if object_id('tempdb..#c') is not null drop table #c create table #c (A int, FromDate date, ToDate date, NextFromDate date, NextToDate date, Ranking int) Insert into #c (A, FromDate, ToDate, NextFromDate, NextToDate, Ranking) select r1.A, r1.FromDate, r1.ToDate, r2.FromDate as NextFromdate, r2.ToDate as NextTodate, Row_number() over (partition by r1.A order by r1.FromDate asc) from runs r1 inner join runs r2 on r1.A=r2.A and Dateadd(day,1,r1.ToDate)=r2.FromDate Update r set r.ToDate=#c.NextToDate from runs r inner join #c on r.A=#c.A and r.FromDate=#c.FromDate where #c.Ranking=1 Delete r from runs r inner join #c on r.A=#c.A and r.FromDate=#c.NextFromDate where #c.Ranking=1 set @rows=@@rowcount end select * from runs order by a, FromDate

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply

Ron McCullough SSC Guru Points: 63877 More actions · Answer 1

Hopefully this article by Jeff Moden, will give you a starting point for what you need to do.

http://qa.sqlservercentral.com/articles/T-SQL/71550/

If everything seems to be going well, you have obviously overlooked something.

Ron

Please help us, help you -before posting a question please read[/url]
Before posting a performance problem please read[/url]

Ferguson SSCertifiable Points: 5837 More actions · Answer 2

I will think through it. It is a very clever approach, the relationship of the date and row_number. My problem is that the original dates are already partially compressed out to ranges, so there is a need for considering both adjacent rows (i.e. whose start dates are consecutive) but rows representing an interval where the end of one is adjacent to the beginning of another.

But an interesting approach I had not considered or stumbled across. Thank you.