What kind of Fact table is this

  • Certainly agree on not updating large, fast changing transactional fact tables.

    Regarding the OP's question: I think the grain of the fact table is one order (which can contain many items), since that's the lifecycle you want to track.

    I believe that making queries simple is one of the (many) reasons for a DWH, otherwise you could just use the source system (ignoring concurrency issues of course), or use a data vault model. In a true accumulating snapshot scenario, you can derive all of the measures from one single row. If it's modeled as a transactional tables, you would have to do a number of self-joins (x times the number of events you want to track), which is not always possible in every front-end tool.

    Two side remarks:

    * most of the time I model the incoming data as a transactional fact table, and then the accumulating snapshot as a 2nd level fact. If storage permits of course 😀

    * I'd also argue that accumulating snapshots typically are for lifecycles that go over a couple of hours or days. In that case updates are slower and they are not really an issue.

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • Koen Verbeeck (9/13/2016)


    Certainly agree on not updating large, fast changing transactional fact tables.

    Regarding the OP's question: I think the grain of the fact table is one order (which can contain many items), since that's the lifecycle you want to track.

    I believe that making queries simple is one of the (many) reasons for a DWH, otherwise you could just use the source system (ignoring concurrency issues of course), or use a data vault model. In a true accumulating snapshot scenario, you can derive all of the measures from one single row. If it's modeled as a transactional tables, you would have to do a number of self-joins (x times the number of events you want to track), which is not always possible in every front-end tool.

    Two side remarks:

    * most of the time I model the incoming data as a transactional fact table, and then the accumulating snapshot as a 2nd level fact. If storage permits of course 😀

    * I'd also argue that accumulating snapshots typically are for lifecycles that go over a couple of hours or days. In that case updates are slower and they are not really an issue.

    Yes I agree that simplicity is one of the key tenets of data warehousing. The simplicity comes from the design of a good bus matrix that both adequately answers the questions the business asks and conforms to this simplicity, all the way to queries. There's also keeping the ETL as simple, yet effective, too. I always have half an eye on OLAP as well (those are the sort of jobs I usually take though, again, back to personal experience). I see what you're saying about transactional tables and self joins etc. to build a history and I have seen some truly awful ways that this has been implemented (why not truncate every day and reload everything with a ROW_NUMBER?????!!! :crazy:) but there are elegant ways of dealing with this (I would go with OLAP but you could encapsulate this logic for data as it's incoming too, or in views, or indeed as you mention a 2nd level fact).

    I hope that what the OP takes from all of this is that even though there are whole books dedicated to this subject, in the real world you sometimes have to make concessions. Plus everyone has their own opinion on these things based on their own dealings over the years - I think the responses to this post prove that adequately :laugh:


    I'm on LinkedIn

  • This post has been a valuable read for me. Im finally getting into a position where i can spend most of my time in data warehousing to build that experience up that helps make these decisions that yous have proven.

    Thanks very much guys 🙂

  • As a side note: I found the following book (aside from the Kimball books) an immensily good reference:

    Star Schema - The Complete Reference

    I did a review a while back:

    Star Schema - The Complete Reference Review[/url]

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • yep

    i read that review then and ordered a copy. Working through it just now. 😀

  • Ralph Kimball has said everything there is to say, past and future, about accumulating snapshot fact tables. All the wisdom is enshrined in his website and books.

    http://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/accumulating-snapshot-fact-table/

    http://www.kimballgroup.com/2014/06/design-tip-167-complementary-fact-table-types/

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • What's the benefit of using integers in that context versus just using dates? For example, is there performance gains? Isn't date datatypes less bytes than int for one good reference?

    I would never use dates for the primary key, if only because it doesn't allow for other possibilities. In my principal warehouse 1 is '1/1/2006'. There is a date field and a string field for the date. The string field is what the user sees because for key 0 the date is "1/1/1970" and the string field is "Unknown." Finally there is a -1 key with the date of "12/31/1969" and a string field for "N/A". Using dates would remove that flexibility. Also, because they are dates, I can use a smallint. I can understand someone using the int, although that's far more days than will ever be needed. Having done my original data warehouses for a subsidiary of RadioShack, I don't worry if the numbers are going to run out in 90 years or so. Some have cautioned against the use of 0 as the Unknown value, and admittedly is has caused me the occasional problem on non-date dimensions. But for date dimensions it's fine, and it's one byte smaller than the date field.

    Using human-readable formats (eg. 20160913) for dates and times is a much better approach. These dates can very easily be converted and are much easier to read when one is debugging an issue or performing maintenance on the DW.

    No user ever reads these or at least they shouldn't be. At an interview once I was asked which I preferred. They did not say what they did. I said I used the integers. It allowed for easy date range calculations and used a little less space. If I needed to see the dates because I was working in the back end, it was easy enough to make the join. I phrased it as "who can't make the join." I ended up being hired and they told me that was exactly the answer they wanted to hear. I can see the one advantage of the other approach, but in the final analysis it's not the better system.

  • RonKyle (9/13/2016)


    What's the benefit of using integers in that context versus just using dates? For example, is there performance gains? Isn't date datatypes less bytes than int for one good reference?

    I would never use dates for the primary key, if only because it doesn't allow for other possibilities. In my principal warehouse 1 is '1/1/2006'. There is a date field and a string field for the date. The string field is what the user sees because for key 0 the date is "1/1/1970" and the string field is "Unknown." Finally there is a -1 key with the date of "12/31/1969" and a string field for "N/A". Using dates would remove that flexibility. Also, because they are dates, I can use a smallint. I can understand someone using the int, although that's far more days than will ever be needed. Having done my original data warehouses for a subsidiary of RadioShack, I don't worry if the numbers are going to run out in 90 years or so. Some have cautioned against the use of 0 as the Unknown value, and admittedly is has caused me the occasional problem on non-date dimensions. But for date dimensions it's fine, and it's one byte smaller than the date field.

    Using human-readable formats (eg. 20160913) for dates and times is a much better approach. These dates can very easily be converted and are much easier to read when one is debugging an issue or performing maintenance on the DW.

    No user ever reads these or at least they shouldn't be. At an interview once I was asked which I preferred. They did not say what they did. I said I used the integers. It allowed for easy date range calculations and used a little less space. If I needed to see the dates because I was working in the back end, it was easy enough to make the join. I phrased it as "who can't make the join." I ended up being hired and they told me that was exactly the answer they wanted to hear. I can see the one advantage of the other approach, but in the final analysis it's not the better system.

    If you use a date data type, you can use 0001-01-01 as "dummy" dates for N/A, unknown et cetera. You can give proper names in other attributes.

    I find using dates easier for debugging (quite lazy from my part :-)). Sure I can write the join. But I can also avoid it 😉

    In short:

    * space is about the same, unless you use smallint then you're better off

    * there is almost no performance impact, at least that I know off

    * date arithmetic is easy in both systems. With dates you just use DATEDIFF.

    Conclusion: it's a tabs vs spaces argument 😉

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • Koen Verbeeck (9/13/2016)


    The type of fact table that you want is called an "accumulating snapshot".

    It contains one row for each lifecycle (e.g. a customer that creates a specific order in your case). Each time an event happens, a column containing the date for that event is updated.

    @PB_BI: it's certainly normal to model a fact table that can be updated. Here it offers significant advantages because it's very easy to calculate a duration between two defined events. If you would model the fact table as a normal transaction table, the queries would become too complex.

    I mean, it has it's uses, but I wouldn't say it's that complex to not consider it. I chose the normal transactional route for my lifecycles (or consumer paths as we call them here) because SQL handles verticle data better than horizontal data. When you get into transactions that enter the billions, having a wide table is an impact. Having to update a single record for every event that comes in is an impact.

    Looking at the queries, it's not exactly complex to query the full lifecycle vertically. You select all the events, order by UserID and Date/Time. That your lifecycle in verticle form.

    SELECT

    [CustomerID]

    ,[Time]

    ,[EventID]

    FROM [DW].[dbo].[FactEvents]

    ORDER BY

    [Time]

    If you partition events across multiple transactional fact tables, then you do the same and UNION them.

    SELECT

    [CustomerID]

    ,[Time]

    ,[EventID]

    FROM [DW].[dbo].[FactEventReg]

    UNION ALL

    SELECT

    [CustomerID]

    ,[Time]

    ,[EventID]

    FROM [DW].[dbo].[FactEventDist]

    UNION ALL

    SELECT

    [CustomerID]

    ,[Time]

    ,[EventID]

    FROM [DW].[dbo].[FactEventConv]

    ORDER BY

    [Time]

    At the base level, this is pretty easy querying. Course, you don't to query big cuts of data like this without additional filters, but this is the base level of querying measurable events vertically.

    If you need to do time gap analysis like calculating time between events, then you can easily factor in:

    ROW_NUMBER() OVER(PARTITION BY [a].[CustomerID], [a].[Time]

    ORDER BY a.[Time] DESC) AS [EventNumber]

    And JOIN the dataset back on top of each other with:

    INNER JOIN [DataWithEventNumber] a ON a.[EventNumber] - 1 = b.[EventNumber] AND ...

    Nothing really too complex there. You're selecting a dataset, formulating that lifecycle based on time and customer, then you are calculating the position of the event. Once you have that, you simply join it back ontop of each other minus 1 step to offset the data for a gap analysis.

    That's just one simplistic way that allows you to completely reconfigure the path in many different ways to support many questions if that makes sense for the business. If you have a set number of events that never change then accumulating snapshot also makes perfect sense too. But I wouldn't not do one or the other in the sense that query complexity is too high.

  • The way I interpret what the OP is asking for is that the customer can visit a page multiple times before a order is placed. An accumulating snapshot fact table has to be more deterministic in the workflow it is capturing. Quite simply, you could maintain the different events as differing fact tables. From there it is a simple matter of comparing aggregates in a ratio like fashion.

    Between dates d1 and d2,

    Entity A had x visitors and y purchases. Ratio --> y/x. You do this for all Entities (storefronts?) You dont exactly need to link the path for every user to get some sort of higher level conversion measure.

    ----------------------------------------------------

Viewing 10 posts - 16 through 24 (of 24 total)

You must be logged in to reply to this topic. Login to reply