Help needed filling in values based on previous row

  • Background: I have code that calculates daily cumulative point balance per account (from a detailed transaction table) and puts it into a temp table (#PointBalTotal). That table has the account ID and the balance per transaction date (there may be gaps, as there are not transactions every day). I have a another temp table (#DATE_DIM) that has a record per account per calendar day.

    #PointBalTotal has these columns:

    - TransactionDate (datetime)

    - AccountID (int)

    - PointsEarnedTotal (numeric)

    #DATE_DIM has these columns:

    - AccountID (int)

    - Date (datetime)

    In my code, I create #DailyBalance by using an outer join to combine #PointBalTotal and #DATE_DIM in order to have a record per account, per day. Obviously, this creates null values in the PointsEarnedTotal column where #DATE_DIM has a date that #PointBalTotal does not. My challenge is in populating the null values with PointsEarnedTotal from the previous date.

    For example:

    RowNum AccountID date PointsEarnedTotal

    1 55349 2008-08-11 1

    2 55349 2008-08-12 10

    3 55349 2008-08-13 19

    4 55349 2008-08-14 NULL

    5 55349 2008-08-15 NULL

    6 55349 2008-08-16 31

    7 55349 2008-08-17 NULL

    Should be:

    RowNum AccountID date PointsEarnedTotal

    1 55349 2008-08-11 1

    2 55349 2008-08-12 10

    3 55349 2008-08-13 19

    4 55349 2008-08-14 19

    5 55349 2008-08-15 19

    6 55349 2008-08-16 31

    7 55349 2008-08-17 31

    I thought I could use a CTE to get the nulls filled in, but I'm not able to get it to work. It fills in the first instance of the NULL, but not the subsequent ones (For instance, RowNum 4 gets the 19 filled in for PointsEarnedTotal, but RowNum 5 does not; it remains null)

    I am attaching the sample data for #PointBalTotal, #DATE_DIM and the code I am using to create #DailyBalance. My stab at the CTE is included as well.

    I would appreciate any guidance/assistance. (Note: I am not allowed to use while loops or cursors, so I am looking for a set-based solution)

    Thank you,

    Marianne

  • If I understand what you are after correctly, then you just need to join from that table to itself like this:

    SELECT a.TransactionDate, a.AccountID, a.PointsEarnedTotal, b.PointsEarnedTotal as 'YesterdayPointsEarnedTotal'

    FROM #PointBalTotal a

    JOIN #PointBalTotal b ON a.AccountID = b.AccountID

    AND b.TransactionDate = DATEADD(d, -1, a.TransactionDate)

    Watch out with the DATEADD though. If the time's are different you'll need to compare DAY, MONTH, and YEAR each rather than just checking equality.

  • Unfortunately, that won't work. #PointBalTotal has gaps in the dates (it shows the balance only for dates that actually had transactions), which is why I join to #DATE_DIM - to bring in the missing dates. In the example I posted, #PointBalTotal had a balance on 8/13/2008 and 8/16/2008. I need to be able to see that the balance on 8/14 and 8/15 was the same as the balance on 8/13 (the balance is cumulative, and there was no activity on 8/14 and 8/15, so the balance stayed the same).

    Thanks,

    marianne

  • Not allowed to use loops or cursors? I like that =). Add this after #DailyBalance is generated.

    [font="Courier New"]ALTER TABLE #DailyBalance

    ALTER COLUMN RowNum INT NOT NULL

    ALTER TABLE #DailyBalance

    ALTER COLUMN AccountID INT NOT NULL

    ALTER TABLE #DailyBalance

    ADD CONSTRAINT PK_RowNum PRIMARY KEY CLUSTERED (AccountID, RowNum)

    UPDATE #DailyBalance

    SET PointsEarnedTotal = ISNULL(PointsEarnedTotal,(    SELECT TOP 1 PointsEarnedTotal

                                                       FROM #DailyBalance

                                                       WHERE RowNum < D.RowNum AND AccountID = D.AccountID

                                                           AND PointsEarnedTotal IS NOT NULL ORDER BY RowNum DESC))

    FROM #DailyBalance D WITH (INDEX(PK_RowNum))

    SELECT * FROM #DailyBalance[/font]

    It looks like a correlated subquery, but unless I screwed it up, it performs about the same as setting variables. Method based off of: http://qa.sqlservercentral.com/articles/Advanced+Querying/61716/

    Seth Phelabaum


    Consistency is only a virtue if you're not a screwup. 😉

    Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

  • Seth,

    Thank you!! That appears to work perfectly.

    -Marianne

  • My solution is much like Seth's except it didn't require altering your base tables to have rowIDs. You can get the job done just by looking for the last transaction date with any points. I am including it here because the execution plan is interesting. Apparently the optimizer resolves the "SELECT TOP 1" subquery for all instances in one sort operation, producing a small worktable that is joined in with a nested loop. The whole thing ran in about 16ms with the volumes you supplied... faster than I would have expected.

    --------------------- Gaps Problem

    select row_number() over (order by d.accountID,d.date) as rowID,d.*, isnull(p.pointsEarnedTotal

    ,(select top 1 pointsEarnedTotal

    from #pointBalTotal

    where accountID = d.accountID

    and transactionDate < d.date

    and pointsEarnedTotal is not null

    order by transactionDate desc)) as pointsEarnedTotal

    from #date_dim d

    left join #pointBalTotal p on p.accountID = d.accountID and p.transactionDate = d.date

    order by d.accountID,d.date

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Thanks for the additional option - I'll give it a try. It looks like it would allow me to eliminate one of my temp tables (#DailyBalance), which would be nice.

    Thanks,

    Marianne

  • You're very welcome.

    If you have time, please let us know if performance is comparable. I'm sure you can speed it up by making accountID/date and accountID/transaction date the primary keys when you create your two temporary tables.

    Good luck 🙂

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Marianne,

    Actually, I'd be very interested in the performance of the queries. Could you tell us how many rows ended up in the table and how many of them had NULLs that needed to be updated... I'd like to do some testing and I'll post the results. Thanks.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • While my subquery seems to fit all that criteria, I'm not sure that it's creating a triangular join in this case. The subquery method is blazingly fast, it was one of a dozen different things I tried when I was evaluating your running totals method last week, and it was the only one that actually came close to equaling it. It may be that the triangular join is created, but because all the fields are covered by the clustered index, it's quick enough that it's not a big factor.

    My original tests on this in a comparison: http://qa.sqlservercentral.com/Forums/Topic582878-338-1.aspx#bm583453

    Seth Phelabaum


    Consistency is only a virtue if you're not a screwup. 😉

    Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

  • I'll try to spend some time comparing the two methods today or tomorrow. The data I posted yesterday was just a small sample, but since I'm working in dev, even when I run it against all the data, I won't have much there. This is just a small (but critical) part of a larger stored proc I am developing.

    What kind of information do you want? Just the execution plan?

    -Marianne

  • Time and IO stats would be nice too, please 🙂

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Probably:

    CPU

    Reads

    Writes

    Duration

    Number of total rows you are working with

    Number of rows being replaced by the function. (Number of NULLS)

    Nuber of rows in the final output.

    Execution plan

    Seth Phelabaum


    Consistency is only a virtue if you're not a screwup. 😉

    Links: How to Post Sample Data[/url] :: Running Totals[/url] :: Tally Table[/url] :: Cross Tabs/Pivots[/url] :: String Concatenation[/url]

  • ... and prevailing wind direction and velocity. Most important. 😉

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Well, most importantly - in Dallas, TX, the wind is blowing NNW at 10 mph. It's cloudy and 62 degrees. 😀

    I hope I was able to make an equitable comparison. The data that gets into #PointBalTotal and #DATE_DIM gets filtered earlier in the procedure (due to various business rules), so I removed the filter in order to get a larger volume of data with which to work. I have not included the sample data for this larger volume. If necessary, I can provide some additional data, but probably not all of it (see the record counts below).

    I created those tables in tempdb and added primary keys.

    ALTER TABLE tempdb.dbo.DATE_DIM

    ADD CONSTRAINT PK_DateDim PRIMARY KEY CLUSTERED (AccountID, Date)

    ALTER TABLE tempdb.dbo.PointBalTotal

    ADD CONSTRAINT PK_PointBal PRIMARY KEY CLUSTERED (AccountID, TransactionDate)

    tempdb.dbo.PointBalTotal has 90,302 records (37983 distinct AccountID);

    tempdb.dbo.DATE_DIM has 2,889,040 records (37520 distinct AccountID; dates from 8/1/2008 - 10/16/2008).

    Once joined, there are 2,850,672 records with NULL PointsEarnedTotal values to be updated.

    I have attached the output from my queries (MethodOne.txt and MethodTwo.txt).

    Method One - #DailyBalance is created using row_number(), then the null values are updated. I have to do a separate update for null values on the first date at the beginning.

    SET STATISTICS IO ON

    SET STATISTICS TIME ON

    SET STATISTICS PROFILE ON

    --create #DailyBalance and fill in missing dates using the date dimension table

    SELECT row_number() OVER (PARTITION BY dd.AccountID ORDER BY Date) as RowNum,

    dd.AccountID, Date, PointsEarnedTotal

    INTO #DailyBalance

    FROM tempdb.dbo.PointBalTotal pbt

    RIGHT JOIN tempdb.dbo.DATE_DIM dd on Date = Transactiondate

    AND pbt.AccountID = dd.AccountID

    --if the initial record in the date range has null points, update it to 0

    UPDATE #DailyBalance

    SET PointsEarnedTotal = 0

    WHERE RowNum = 1 AND PointsEarnedTotal is null

    ALTER TABLE #DailyBalance

    ALTER COLUMN RowNum INT NOT NULL

    ALTER TABLE #DailyBalance

    ALTER COLUMN AccountID INT NOT NULL

    ALTER TABLE #DailyBalance

    ADD CONSTRAINT PK_RowNum PRIMARY KEY CLUSTERED (AccountID, RowNum)

    UPDATE #DailyBalance

    SET PointsEarnedTotal

    = ISNULL(PointsEarnedTotal,(SELECT TOP 1 PointsEarnedTotal

    FROM #DailyBalance

    WHERE RowNum < D.RowNum AND AccountID = D.AccountID

    AND PointsEarnedTotal IS NOT NULL ORDER BY RowNum DESC))

    FROM #DailyBalance D WITH (INDEX(PK_RowNum))

    SET STATISTICS IO OFF

    SET STATISTICS TIME OFF

    SET STATISTICS PROFILE OFF

    select 'DailyBalance Count', count(1) from #DailyBalance

    Method Two - #DailyBalance is created without null values (except where the value was null on the first date - I update those at the end)

    SET STATISTICS IO ON

    SET STATISTICS TIME ON

    SET STATISTICS PROFILE ON

    select --row_number() over (order by d.accountID,d.date) as rowID,

    d.*, isnull(p.pointsEarnedTotal

    ,(select top 1 pointsEarnedTotal

    from tempdb.dbo.pointBalTotal

    where accountID = d.accountID

    and transactionDate < d.date

    and pointsEarnedTotal is not null

    order by transactionDate desc)) as pointsEarnedTotal

    INTO #DailyBalance

    from tempdb.dbo.date_dim d

    left join tempdb.dbo.pointBalTotal p on p.accountID = d.accountID and p.transactionDate = d.date

    order by d.accountID,d.date

    UPDATE #DailyBalance

    SET PointsEarnedTotal = 0

    WHERE PointsEarnedTotal is null

    SET STATISTICS IO OFF

    SET STATISTICS TIME OFF

    SET STATISTICS PROFILE OFF

    select 'DailyBalance Count', count(1) from #DailyBalance

    I hope this give enough information. If not, please let me know.

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply