Compressing out consecutive runs of data quickly

  • This is part of a long and much more complicated process, but I have tried to simplify it for posting. Note that the tables involved are expected to have hundreds of millions of rows with new (and older) data being added all the time. The goal is to compress out adjacent periods that may be created as data is added. Periods can have gaps, but when data is added that create adjacencies, we want to merge them togeter (but only within "like" records, here represented by the value of field A).

    I have a technique that does it (a distillation of what I do in practice which is actually more complex), but it is iterative. What I am looking for is some approach that can do this without iteration, as one big set operation. To demonstrate that, this code has to loop at least twice, but in practice that could be a LOT of loops, with table scans each time.

    Note that the data addition is messy also -- it can split an interval. But it is what it is, and I have that working fine. The main goal of this is to represent sparse and repetitive data; the existing technique (one row per day) generates billions of rows on an old DB2 system, and as we migrate it, we want to redesign to be more efficient in representing the sparse data.

    use somedatabase

    go

    if object_id('runs') is not null drop table runs

    go

    create table runs(A int, FromDate date, ToDate date)

    Insert into runs values (1, '1/1/2011','1/1/2011')

    Insert into runs values (1, '1/2/2011','1/3/2011')

    Insert into runs values (1, '1/5/2011','1/6/2011')

    Insert into runs values (1, '1/7/2011','1/7/2011')

    Insert into runs values (2, '1/1/2011','1/3/2011')

    Insert into runs values (2, '1/4/2011','1/5/2011')

    select * from runs order by a, FromDate

    declare @rows int

    set @rows=-1

    while @rows<>0

    begin

    if object_id('tempdb..#c') is not null drop table #c

    create table #c (A int, FromDate date, ToDate date, NextFromDate date, NextToDate date, Ranking int)

    Insert into #c (A, FromDate, ToDate, NextFromDate, NextToDate, Ranking)

    select r1.A, r1.FromDate, r1.ToDate, r2.FromDate as NextFromdate, r2.ToDate as NextTodate, Row_number() over (partition by r1.A order by r1.FromDate asc)

    from runs r1

    inner join runs r2 on r1.A=r2.A and Dateadd(day,1,r1.ToDate)=r2.FromDate

    Update r

    set r.ToDate=#c.NextToDate

    from runs r

    inner join #c on r.A=#c.A and r.FromDate=#c.FromDate

    where #c.Ranking=1

    Delete r

    from runs r

    inner join #c on r.A=#c.A and r.FromDate=#c.NextFromDate

    where #c.Ranking=1

    set @rows=@@rowcount

    end

    select * from runs order by a, FromDate

  • Hopefully this article by Jeff Moden, will give you a starting point for what you need to do.

    http://qa.sqlservercentral.com/articles/T-SQL/71550/

    If everything seems to be going well, you have obviously overlooked something.

    Ron

    Please help us, help you -before posting a question please read[/url]
    Before posting a performance problem please read[/url]

  • I will think through it. It is a very clever approach, the relationship of the date and row_number. My problem is that the original dates are already partially compressed out to ranges, so there is a need for considering both adjacent rows (i.e. whose start dates are consecutive) but rows representing an interval where the end of one is adjacent to the beginning of another.

    But an interesting approach I had not considered or stumbled across. Thank you.

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply