Populate new record data based on previous month record data

Question

Post reply

Populate new record data based on previous month record data

yingchai

SSCommitted

Points: 1669
More actions
January 16, 2012 at 11:35 pm

#262828

Hi SQL gurus,

I have a fact table which looks like below. Currently, it has two quarters data loaded in which is jun and sept. In reality, there is data being captured in period 06 but not in period 09 for a particular account. Please see example below:

[Original Fact table]

Account | Organization | Period | Year | Amount ------------------------------------------------- CMP1000 | PM1 | 06 | 2011 | 100 CMP1001 | PM1 | 06 | 2011 | 250 CMP1000 | PM1 | 09 | 2011 | 400 CMP1002 | PM1 | 09 | 2011 | 500

If there is no data in period 09 but there is data in period 06 for that particular account, I would like to insert a new record so that the data for period 09 will be taking the period 06 data. Please see my desired output below:

[Derived Fact table]

Account | Organization | Period | Year | Amount ------------------------------------------------- CMP1000 | PM1 | 06 | 2011 | 100 CMP1001 | PM1 | 06 | 2011 | 250 CMP1000 | PM1 | 09 | 2011 | 400 CMP1002 | PM1 | 09 | 2011 | 500 CMP1001 | PM1 | 09 | 2011 | 250

Appreciate if any gurus out there can assist me in this problem. Thanks!

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply

Manjuke SSChasing Mays Points: 629 More actions · Answer 1

Hi,

This is possible using the SQL MERGE. You can do something like this :



--Test Data

declare @SampleTable as table(

	Account varchar(10)

	,Organization varchar(20)

	,Period varchar(2)

	,[Year] int

	,Amount money

)

insert into @SampleTable

        (

         Account,

         Organization,

         Period,

         [Year],

         Amount

        )

select 'CMP1000','PM1','06',2011,100 union

select 'CMP1001','PM1','06',2011,250 union

select 'CMP1000','PM1','09',2011,400 union

select 'CMP1002','PM1','09',2011,500 

-----------------------------------------------------

;with cte as(

	select

		Account,

        Organization,

        Period,

        [Year],

        Amount

	from

		@SampleTable

	where Period = '06'

	--<required_condition>

)

merge @SampleTable as T

using (select

			Account,

			Organization,

			Period,

			[Year],

			Amount

		from cte

) as S on (T.Account = S.Account and T.Period = '09')

when not matched then

	insert(Account,Organization,Period,[Year],Amount)

	values(S.Account,S.Organization,S.Period,S.[Year],S.Amount);

select * from @SampleTable order by Period asc

Though I have used the period as static values, you can pass it as parameters. Hope this is what you want.. 🙂

** MERGE is only available on SQL 2008 and higher. If you are using other than SQL 2008 (2000 or 2005) this will not work. Then you have to do a Join Update.

--------
Manjuke
http://www.manjuke.com

yingchai SSCommitted Points: 1669 More actions · Answer 2

yingchai

SSCommitted

Points: 1669

January 17, 2012 at 1:58 am

#1434453

Hi manjuke,

Thanks for the reply. But your solution seems not so flexible as it is just comparing period 06 and 09. In my actual fact table, it contains period 01 to 12. It would be ideal if your script can compare something like period to period-1 and then populate a new record.

Thanks.

fahey.jonathan Hall of Fame Points: 3617 More actions · Answer 3

Let's make sure I have this right. You have an "OriginalFact" table and you want to populate data into a "DerivedFact" table, using the "prior" month's data if "this" month's data is missing, without affecting the data in the "OriginalFact" table. Is that right?

Does the data wrap years, meaning that the "prior" month to January is December?

yingchai SSCommitted Points: 1669 More actions · Answer 4

Hi fahey.jonathan,

Let's make sure I have this right. You have an "OriginalFact" table and you want to populate data into a "DerivedFact" table, using the "prior" month's data if "this" month's data is missing, without affecting the data in the "OriginalFact" table. Is that right?

Yes

Does the data wrap years, meaning that the "prior" month to January is December?

Yes. Example; the prior month of January 2011 is December 2010.

fahey.jonathan Hall of Fame Points: 3617 More actions · Answer 5

CREATE -- DROP



TABLE	OriginalFact

	(

	Account		VARCHAR(10)	NOT NULL,

	Organization	VARCHAR(10)	NOT NULL,

	[Year]		INTEGER		NOT NULL,

	Period		SMALLINT	NOT NULL,

	Amount		INTEGER		NOT NULL

	)

CREATE	-- DROP

TABLE	DerivedFact

	(

	Account		VARCHAR(10)	NOT NULL,

	Organization	VARCHAR(10)	NOT NULL,

	[Year]		INTEGER		NOT NULL,

	Period		SMALLINT	NOT NULL,

	Amount		INTEGER		NOT NULL,

	TypeCode	CHAR(1)		NOT NULL	CHECK(TypeCode IN ('C', 'P'))

	)

-- Sample data for the OriginalFact table.

INSERT

INTO	OriginalFact

	(Account, Organization, [Year], Period, Amount)

SELECT	Account, Organization, [Year], Period, Amount

FROM	(VALUES

	('CMP1000', 'PM1',	2011,	11,	100),

	('CMP1001', 'PM1',	2011,	11,	101),

	('CMP1002', 'PM1',	2011,	11,	102),

	('CMP1003', 'PM1',	2011,	11,	103),

	('CMP1001', 'PM1',	2011,	12,	111),

	('CMP1002', 'PM1',	2011,	12,	112),

	('CMP1003', 'PM1',	2011,	12,	113),

	('CMP1002', 'PM1',	2012,	01,	122),

	('CMP1003', 'PM1',	2012,	01,	123)

	) x (Account, Organization, [Year], Period, Amount)

-- Set the variables for the year and period desired.

-- Run this section for each month.

DECLARE	@Year	INTEGER	= 2011

DECLARE	@Period	SMALLINT = 11

INSERT

INTO	DerivedFact

	(Account, Organization, [Year], Period, Amount, TypeCode)

SELECT	ISNULL(c.Account, p.Account) AS Account,

	ISNULL(c.Organization, p.Organization) AS Organization,

	@Year	AS [Year],

	@Period	AS Period,

	ISNULL(c.Amount, p.Amount) AS Amount,

	CASE WHEN c.Account IS NULL THEN 'P' ELSE 'C' END -- Either "C"urrent value or "P"rior value.

FROM		(-- Get records from the Current month from the OriginalFact table.

		SELECT	Account, Organization, [Year], Period, Amount

		FROM	OriginalFact

		WHERE		[Year] = @Year

			AND	Period = @Period

		) c -- Current

	FULL JOIN (-- Get records from the Prior month.

		SELECT	Account, Organization, [Year], Period, Amount

		--FROM	OriginalFact  -- Use OriginalFact if values missing last month should NOT be included this month. See Notes **

		FROM	DerivedFact  -- Use DerivedFact if values missing last month SHOULD be included this month.

		WHERE		[Year] = CASE WHEN @Period = 1 THEN @Year - 1 ELSE @Year END

			AND	Period = CASE WHEN @Period = 1 THEN 12 ELSE @Period - 1 END

		) p -- Prior

	ON		p.Account = c.Account

		AND	p.Organization = c.Organization

SELECT	*

FROM	DerivedFact

ORDER BY Account,

	Organization,

	[Year],

	Period

/*  Notes

** CMP1000 was missing for both 2001/12 and 2012/01.  If the values from 2011/11 should be

rolled forward into 2011/12 and again into 2012/01, then use DerivedFact because it will

contain records for CMP1000 - CMP1003.  If CMP1000 should not be included in 2012/01 because

it was not in 2011/12 or in 2012/01, then use OriginalFact because it will contain only

CMP1001 - CMP1003 for the prior month.

*/

yingchai SSCommitted Points: 1669 More actions · Answer 6

Hi fahey.jonathan,

Based on your sample data here:



	('CMP1000', 'PM1',	2011,	11,	100),

	('CMP1001', 'PM1',	2011,	11,	101),

	('CMP1002', 'PM1',	2011,	11,	102),

	('CMP1003', 'PM1',	2011,	11,	103),

	('CMP1001', 'PM1',	2011,	12,	111),

	('CMP1002', 'PM1',	2011,	12,	112),

	('CMP1003', 'PM1',	2011,	12,	113),

	('CMP1002', 'PM1',	2012,	01,	122),

	('CMP1003', 'PM1',	2012,	01,	123)

The desired output should be:



	('CMP1000', 'PM1',	2011,	11,	100),--Row A

	('CMP1001', 'PM1',	2011,	11,	101),--Row B

	('CMP1002', 'PM1',	2011,	11,	102),

	('CMP1003', 'PM1',	2011,	11,	103),

	('CMP1001', 'PM1',	2011,	12,	111),

	('CMP1002', 'PM1',	2011,	12,	112),

	('CMP1003', 'PM1',	2011,	12,	113),

        ('CMP1000', 'PM1',        2011, 12,   100),--Copy from Row A

	('CMP1002', 'PM1',	2012,	01,	122),

	('CMP1003', 'PM1',	2012,	01,	123),

        ('CMP1000', 'PM1',	2012,	01,	100),--Copy from Row A

        ('CMP1001', 'PM1',	2012,	01,	101),--Copy from Row B

Originally, the DerivedFact table does not exist in the database. I only translate the original fact table to a derived view which is the DerivedFact table now. Sorry for the confusion.

Thanks.

fahey.jonathan Hall of Fame Points: 3617 More actions · Answer 7

Two questions:

1 - Are you saying that DerivedFact is a view, not a table?

2 - Why is CMP1001 copied from two months prior (2011/11) rather than one month prior (2011/12)? You have the 2011/11 row labeled as Row B; I expected the 2011/12 version to be labeled as Row B. I thought the goal was to bring the prior month forward if it existed, which for 2012/01 would be 2011/12.

yingchai SSCommitted Points: 1669 More actions · Answer 8

Hi,

Apologized for my mistake. The output in the views should look like this:



	('CMP1000', 'PM1',	2011,	11,	100),

	('CMP1001', 'PM1',	2011,	11,	101),

	('CMP1002', 'PM1',	2011,	11,	102),

	('CMP1003', 'PM1',	2011,	11,	103),

	('CMP1001', 'PM1',	2011,	12,	111),

	('CMP1002', 'PM1',	2011,	12,	112),

	('CMP1003', 'PM1',	2011,	12,	113),

        ('CMP1000', 'PM1',        2011, 12,   100),

	('CMP1002', 'PM1',	2012,	01,	122),

	('CMP1003', 'PM1',	2012,	01,	123),

        ('CMP1000', 'PM1',	2012,	01,	100),

        ('CMP1001', 'PM1',	2012,	01,	111),

The goal was to always bring the prior month forward if it existed. Sorry for my mistake.

DerivedFact is a view, not a table because the DerivedFact originally is not in the database tables.

Thanks.

fahey.jonathan Hall of Fame Points: 3617 More actions · Answer 9

I'm going to suggest a different strategy because of the complexity of the need. You need to bring forward each row from the prior month, meaning that if a row is missing for three months, you either need a recursive call to roll it forward one month at a time, or a cross join to get the last prior available record. Both of those are going to be fairly expensive, especially with a large data set. If you will be using this data with some frequency, that will probably be taxing on the server to run those queries every time.

I recommend that you add a column to your fact table that says whether the record is a base fact or a derived fact. Once a month when you add data, insert the base facts marked with a B ("base") or F ("Fact"), etc. Then insert the derived facts from the prior month where applicable, marked with a D ("Derived"), etc. This way, when using the fact table, there are no calculations that need to be done and your derived facts are indexed along with the base facts, so queries should be fast.

If you need a version of the table that contains only base facts, create a view for it by selecting only records with a B (or F, etc).

JonFox SSCrazy Points: 2157 More actions · Answer 10

JonFox

SSCrazy

Points: 2157

January 19, 2012 at 5:12 pm

#1436247

How many rows are in the base fact table? What will the growth rate be? What is the maximum number of periods that will be stored in the fact table?

yingchai SSCommitted Points: 1669 More actions · Answer 11

yingchai

SSCommitted

Points: 1669

January 19, 2012 at 9:10 pm

#1436286

Hi JonFox,

Based on two periods that I had loaded in the fact table, I can estimate that the fact table has around 150,000++ record. I would say that the database size growth is quite huge every month.

Thanks.

JonFox SSCrazy Points: 2157 More actions · Answer 12

How wedded are you to the schema of your base fact table? Is this a work in progress that you can make some changes to, or is it already in production? The reason I ask is because I think there are some changes you might want to consider. When designing a dimensional schema, data such as the period and year would normally be a reference to a date dimension's surrogate key, rather than being stored directly in the fact table. This would apply to account/organization info as well. I would definitely recommend reading one of the Kimball Group books such as The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling in order to really get a handle on dimensional modeling techniques.

While playing around with the sample data, I ended up adding a new DATE column to the OriginalFact table, to store the year/period in a way that is easier to index and compare against. Again, I would really recommend that this field actually be an attribute on a Date dimension, but this is a start. Since we have to sort and filter by year/period in order to get the most recent Amount for each Account/Date, this gives us a good value that can be indexed.

Here's a potential solution using CROSS APPLY:



-- If the temp table with sample data already exists, drop it

IF OBJECT_ID('tempdb..#OriginalFact') IS NOT NULL DROP TABLE #OriginalFact;

-- Create temp table to hold sample data

CREATE TABLE #OriginalFact

	(

	Account		VARCHAR(10)	NOT NULL,

	Organization	VARCHAR(10)	NOT NULL,

	[Year]		INTEGER		NOT NULL,

	Period		SMALLINT	NOT NULL,

	Amount		INTEGER		NOT NULL,

	PeriodStart	DATE

	)

-- In order for the technique below to be anywhere close to efficient, we must have a good index in place!

CREATE INDEX Account_Organization_PeriodStart ON #OriginalFact (Account, Organization, PeriodStart DESC) INCLUDE (Amount)

-- Load sample data

INSERT

INTO #OriginalFact

	(Account, Organization, [Year], Period, Amount)

SELECT	Account, Organization, [Year], Period, Amount

FROM	(VALUES

	('CMP1000', 'PM1',	2011,	11,	100),

	('CMP1001', 'PM1',	2011,	11,	101),

	('CMP1002', 'PM1',	2011,	11,	102),

	('CMP1003', 'PM1',	2011,	11,	103),

	('CMP1001', 'PM1',	2011,	12,	111),

	('CMP1002', 'PM1',	2011,	12,	112),

	('CMP1003', 'PM1',	2011,	12,	113),

	('CMP1002', 'PM1',	2012,	01,	122),

	('CMP1003', 'PM1',	2012,	01,	123)

	) x (Account, Organization, [Year], Period, Amount)

-- Set the value for the PeriodStart date

UPDATE #OriginalFact SET PeriodStart = dateadd(month,((Year-1900)*12)+Period-1,0)

-- Common Table Expression using a CROSS JOIN to generate a set containing all

-- Accounts/Organizations for each Year/Period. Without doing this, we would

-- wind up with "holes" in the end result.

;WITH PeriodsAccounts_CTE

AS

(

	SELECT [Year], Period, PeriodStart, Account, Organization

	FROM

	(	-- I think doing the distincts before the cross join is more efficient

		-- than doing it after; not 100% sure on this, bears testing with a

		-- larger data set.

		SELECT DISTINCT [Year], Period, PeriodStart

		FROM #OriginalFact

	) AS D CROSS JOIN

	(

		SELECT DISTINCT Account, Organization

		FROM #OriginalFact

	) AS A

)

-- Main SELECT statement

SELECT PA.[Year], PA.Period, PA.Account, PA.Organization, F.Amount,

	   CASE WHEN PA.PeriodStart = F.PeriodStart THEN 'Original'

		    ELSE 'Derived'

	   END AS FactType	-- Flag to indicate if the row was original or derived, in case that's helpful

FROM PeriodsAccounts_CTE AS PA CROSS APPLY

	 (	-- For each row in PA, get all fact rows with the same account/org

		-- that occurred on or before the current row's date.

		-- Sort by date descending and grab the TOP 1 to get most recent.

		SELECT TOP 1 Amount, PeriodStart

		FROM #OriginalFact AS O

		WHERE O.Account = PA.Account

		  AND O.Organization = PA.Organization

		  AND O.PeriodStart <= PA.PeriodStart

		ORDER BY O.PeriodStart DESC

	 ) AS F

Note the index; if this isn't built, performance of the CROSS APPLY will be absolutely abysmal. Even with the index in place, I suspect that this approach won't scale very well at the volumes you are working with. Still, it's worth a shot.

I do suspect, however, that (as suggested by a fahey.jonathan) you will ultimately need to pre-calculate these values as part of your ETL process. That's the good thing about warehoused data; lengthy, complex calculations can be run overnight and persisted for fast retrieval during the day. In this type of environment, implementing costly views is not likely to be in your best interest.

In any case, I'm interested in hearing how this performs with your larger data set.

yingchai SSCommitted Points: 1669 More actions · Answer 13

Hi JonFox,

I had tried execute your sql in my fact table which contains 123591 records.

The performance is quite good actually...took around 2 seconds to complete...but the results seems not quite right as it only returns 72854 rows...I was expecting more than 123591 rows being returned.

I will update my findings to you later.

Thanks.

yingchai SSCommitted Points: 1669 More actions · Answer 14

JonFox (1/21/2012)

How wedded are you to the schema of your base fact table? Is this a work in progress that you can make some changes to, or is it already in production? The reason I ask is because I think there are some changes you might want to consider. When designing a dimensional schema, data such as the period and year would normally be a reference to a date dimension's surrogate key, rather than being stored directly in the fact table. This would apply to account/organization info as well. I would definitely recommend reading one of the Kimball Group books such as The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling in order to really get a handle on dimensional modeling techniques.

While playing around with the sample data, I ended up adding a new DATE column to the OriginalFact table, to store the year/period in a way that is easier to index and compare against. Again, I would really recommend that this field actually be an attribute on a Date dimension, but this is a start. Since we have to sort and filter by year/period in order to get the most recent Amount for each Account/Date, this gives us a good value that can be indexed.

Here's a potential solution using CROSS APPLY:

-- If the temp table with sample data already exists, drop it IF OBJECT_ID('tempdb..#OriginalFact') IS NOT NULL DROP TABLE #OriginalFact; -- Create temp table to hold sample data CREATE TABLE #OriginalFact ( Account VARCHAR(10) NOT NULL, Organization VARCHAR(10) NOT NULL, [Year] INTEGER NOT NULL, Period SMALLINT NOT NULL, Amount INTEGER NOT NULL, PeriodStart DATE ) -- In order for the technique below to be anywhere close to efficient, we must have a good index in place! CREATE INDEX Account_Organization_PeriodStart ON #OriginalFact (Account, Organization, PeriodStart DESC) INCLUDE (Amount) -- Load sample data INSERT INTO #OriginalFact (Account, Organization, [Year], Period, Amount) SELECT Account, Organization, [Year], Period, Amount FROM (VALUES ('CMP1000', 'PM1', 2011, 11, 100), ('CMP1001', 'PM1', 2011, 11, 101), ('CMP1002', 'PM1', 2011, 11, 102), ('CMP1003', 'PM1', 2011, 11, 103), ('CMP1001', 'PM1', 2011, 12, 111), ('CMP1002', 'PM1', 2011, 12, 112), ('CMP1003', 'PM1', 2011, 12, 113), ('CMP1002', 'PM1', 2012, 01, 122), ('CMP1003', 'PM1', 2012, 01, 123) ) x (Account, Organization, [Year], Period, Amount) -- Set the value for the PeriodStart date UPDATE #OriginalFact SET PeriodStart = dateadd(month,((Year-1900)*12)+Period-1,0) -- Common Table Expression using a CROSS JOIN to generate a set containing all -- Accounts/Organizations for each Year/Period. Without doing this, we would -- wind up with "holes" in the end result. ;WITH PeriodsAccounts_CTE AS ( SELECT [Year], Period, PeriodStart, Account, Organization FROM ( -- I think doing the distincts before the cross join is more efficient -- than doing it after; not 100% sure on this, bears testing with a -- larger data set. SELECT DISTINCT [Year], Period, PeriodStart FROM #OriginalFact ) AS D CROSS JOIN ( SELECT DISTINCT Account, Organization FROM #OriginalFact ) AS A ) -- Main SELECT statement SELECT PA.[Year], PA.Period, PA.Account, PA.Organization, F.Amount, CASE WHEN PA.PeriodStart = F.PeriodStart THEN 'Original' ELSE 'Derived' END AS FactType -- Flag to indicate if the row was original or derived, in case that's helpful FROM PeriodsAccounts_CTE AS PA CROSS APPLY ( -- For each row in PA, get all fact rows with the same account/org -- that occurred on or before the current row's date. -- Sort by date descending and grab the TOP 1 to get most recent. SELECT TOP 1 Amount, PeriodStart FROM #OriginalFact AS O WHERE O.Account = PA.Account AND O.Organization = PA.Organization AND O.PeriodStart <= PA.PeriodStart ORDER BY O.PeriodStart DESC ) AS F

Note the index; if this isn't built, performance of the CROSS APPLY will be absolutely abysmal. Even with the index in place, I suspect that this approach won't scale very well at the volumes you are working with. Still, it's worth a shot.

I do suspect, however, that (as suggested by a fahey.jonathan) you will ultimately need to pre-calculate these values as part of your ETL process. That's the good thing about warehoused data; lengthy, complex calculations can be run overnight and persisted for fast retrieval during the day. In this type of environment, implementing costly views is not likely to be in your best interest.

In any case, I'm interested in hearing how this performs with your larger data set.

* Double post - pls ignore this reply.