Date Ranges - Simple, but seemingly impossible

  • Thank you all so much for kindly taking the time to come up with your solutions. I'm truly grateful as I thought there was no possible solution for my university project. It seems there's so much that I can learn and you guys have helped me understand this and allowed me to learn so much!

    I promise I will take the time to go over each and every solution you guys posted here 😀

    Joe Celko, please don't hate me. I wasn't being lazy. I'm new to this forum and I'm trying my best. The data for the timeframe is just an extra visual presentation of the data that is in the SQL Query to create the table. No need to transcribe anything. I just decided to put it in excel file with colors and lines for better understanding.
    As for the Image, I don't know how to enlarge the image, but if you right click it and open in another tab it enlarges to 100% 🙂
    I'm just trying to optimize a fact table on date ranges to make the next steps possible. The Stack and Product will have surrogate keys when I apply this transformation to the real database. I was just trying to figure out this transformation and asked for help with quick example data as this transformation is what's blocking me from advancing.
    Anyway, apologies for anything I did wrong and thank you for your input too. I will be sure to apply those constraints you pointed out. It wasn't my intention to disrespecting anyone by asking for help the best way I could.

  • Vegeta7 - Tuesday, November 7, 2017 2:34 AM

    Thank you all so much for kindly taking the time to come up with your solutions. I'm truly grateful as I thought there was no possible solution for my university project. It seems there's so much that I can learn and you guys have helped me understand this and allowed me to learn so much!

    I promise I will take the time to go over each and every solution you guys posted here 😀

    Joe Celko, please don't hate me. I wasn't being lazy, I just don't know all this stuff 🙂 I'm just trying to optimize a fact table on date ranges to make the next steps possible. The Stack and Product will have surrogate keys when I apply this transformation to the real database. I was just trying to figure out this transformation and asked for help as it is what's blocking me from advancing. Anyway, apologies for anything I did wrong and thank you for your input too.

    Being jumped on by me in SQL forum is a right of passage. I have a horrible reputation to maintain, after all 🙂

    Please post DDL and follow ANSI/ISO standards when asking for help. 

  • Based on what's been show so far, I believe the best clustering key for this table would be:
    ( ProductCode, ValidFrom, ValidTo )
    rather than just ValidFrom and ValidTo, assuming you (almost) always use ProductCode in the join to this table.

    As to having a "primary key", it's a theoretical requirement, but not necessary in real life.  For performance, by far the most important factor is getting the best clustered index, whether you officially designate it a PK or not.  In fact, for complex tables or in a rapid-development environment, you don't want to officially declare a PK too early, as they are much harder to change than just a non-constraint clustered index.  Later on, after the indexing requirements stabilize, you can create an official PK if you prefer.

    SQL DBA,SQL Server MVP(07, 08, 09) "Money can't buy you happiness." Maybe so, but it can make your unhappiness a LOT more comfortable!

  • ScottPletcher - Tuesday, November 7, 2017 2:02 PM

    Based on what's been show so far, I believe the best clustering key for this table would be:
    ( ProductCode, ValidFrom, ValidTo )
    rather than just ValidFrom and ValidTo, assuming you (almost) always use ProductCode in the join to this table.

    As to having a "primary key", it's a theoretical requirement, but not necessary in real life.  For performance, by far the most important factor is getting the best clustered index, whether you officially designate it a PK or not.  In fact, for complex tables or in a rapid-development environment, you don't want to officially declare a PK too early, as they are much harder to change than just a non-constraint clustered index.  Later on, after the indexing requirements stabilize, you can create an official PK if you prefer.

    Choose either validfrom or validto, only one of the two dates will be employed by a seek operation, the other will almost always be a residual predicate.


    [font="Arial"]Low-hanging fruit picker and defender of the moggies[/font]

    For better assistance in answering your questions, please read this[/url].


    Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

    Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

  • ChrisM@home - Tuesday, November 7, 2017 2:07 PM

    ScottPletcher - Tuesday, November 7, 2017 2:02 PM

    Based on what's been show so far, I believe the best clustering key for this table would be:
    ( ProductCode, ValidFrom, ValidTo )
    rather than just ValidFrom and ValidTo, assuming you (almost) always use ProductCode in the join to this table.

    As to having a "primary key", it's a theoretical requirement, but not necessary in real life.  For performance, by far the most important factor is getting the best clustered index, whether you officially designate it a PK or not.  In fact, for complex tables or in a rapid-development environment, you don't want to officially declare a PK too early, as they are much harder to change than just a non-constraint clustered index.  Later on, after the indexing requirements stabilize, you can create an official PK if you prefer.

    Choose either validfrom or validto, only one of the two dates will be employed by a seek operation, the other will almost always be a residual predicate.

    Yes, but depending on the specific key values, SQL may still be able to "short-circuit" rows if both date columns are present in the key, at least theoretically.  For example, if you specify ValidFrom <= '20170930' AND ValidTo <= '20171231' (among whatever other conditions), once SQL hits row '20170930', '20180101', it can stop searching, even if there are a million more rows with a ValidFrom of '20170930', because SQL "knows" that all future ValidTo dates will be beyond the range specified.

    SQL DBA,SQL Server MVP(07, 08, 09) "Money can't buy you happiness." Maybe so, but it can make your unhappiness a LOT more comfortable!

  • jcelko212 32090 - Tuesday, November 7, 2017 1:00 PM

    Vegeta7 - Tuesday, November 7, 2017 2:34 AM

    Thank you all so much for kindly taking the time to come up with your solutions. I'm truly grateful as I thought there was no possible solution for my university project. It seems there's so much that I can learn and you guys have helped me understand this and allowed me to learn so much!

    I promise I will take the time to go over each and every solution you guys posted here 😀

    Joe Celko, please don't hate me. I wasn't being lazy, I just don't know all this stuff 🙂 I'm just trying to optimize a fact table on date ranges to make the next steps possible. The Stack and Product will have surrogate keys when I apply this transformation to the real database. I was just trying to figure out this transformation and asked for help as it is what's blocking me from advancing. Anyway, apologies for anything I did wrong and thank you for your input too.

    Being jumped on by me in SQL forum is a right of passage. I have a horrible reputation to maintain, after all 🙂

    jcelko212 32090 - Tuesday, November 7, 2017 1:00 PM

    Vegeta7 - Tuesday, November 7, 2017 2:34 AM

    Thank you all so much for kindly taking the time to come up with your solutions. I'm truly grateful as I thought there was no possible solution for my university project. It seems there's so much that I can learn and you guys have helped me understand this and allowed me to learn so much!

    I promise I will take the time to go over each and every solution you guys posted here 😀

    Joe Celko, please don't hate me. I wasn't being lazy, I just don't know all this stuff 🙂 I'm just trying to optimize a fact table on date ranges to make the next steps possible. The Stack and Product will have surrogate keys when I apply this transformation to the real database. I was just trying to figure out this transformation and asked for help as it is what's blocking me from advancing. Anyway, apologies for anything I did wrong and thank you for your input too.

    Being jumped on by me in SQL forum is a right of passage. I have a horrible reputation to maintain, after all 🙂

    Heh... yeah... and it's about as effective as genital mutilation rites.  Without code, it really makes little sense.  You still haven't posted any code to solve the problem the way you suggested.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Vegeta7 - Sunday, November 5, 2017 6:52 PM

    ...the end result will be to have a discount fact table, where I will have all the discounts correctly segmented by period..

    ...So if a sale were to occur on the 2009-02-01, I would know that I would have to apply those 3 discounts on the ListPrice to get that FinalPrice...

    Please make sure you're clear on the difference between facts and dimensions.  Dimensions describe the possible options for an event, even before the event takes place.  The fact contains the realization or occurrence of the event referencing the dimension options that were selected for the event.  My preferred example is a menu at a restaurant.  The menu is the dimension showing all of the possibilities for a customer to order.  Everything on the menu is available (and in the dimension), but there is no fact record until an order is actually placed, and the fact record only references the menu item that was ordered (dimension record).

    In your example, discount should be a dimension table, referenced by the sales fact table.  The sales event on 2009-02-01 stored in the fact would have a reference to the discount dimension, describing what discount the sale used.  The fact doesn't care when the discount started or expired, it only references which discount record was used.  The discount dimension table has all of the details of the discount (start/end/amount/level/etc...)

    It is critical to understand the differences between facts and dimensions to ensure they are used properly.

    Wes
    (A solid design is always preferable to a creative workaround)

  • whenriksen - Monday, November 20, 2017 2:51 PM

    Vegeta7 - Sunday, November 5, 2017 6:52 PM

    ...the end result will be to have a discount fact table, where I will have all the discounts correctly segmented by period..

    ...So if a sale were to occur on the 2009-02-01, I would know that I would have to apply those 3 discounts on the ListPrice to get that FinalPrice...

    Please make sure you're clear on the difference between facts and dimensions.  Dimensions describe the possible options for an event, even before the event takes place.  The fact contains the realization or occurrence of the event referencing the dimension options that were selected for the event.  My preferred example is a menu at a restaurant.  The menu is the dimension showing all of the possibilities for a customer to order.  Everything on the menu is available (and in the dimension), but there is no fact record until an order is actually placed, and the fact record only references the menu item that was ordered (dimension record).

    In your example, discount should be a dimension table, referenced by the sales fact table.  The sales event on 2009-02-01 stored in the fact would have a reference to the discount dimension, describing what discount the sale used.  The fact doesn't care when the discount started or expired, it only references which discount record was used.  The discount dimension table has all of the details of the discount (start/end/amount/level/etc...)

    It is critical to understand the differences between facts and dimensions to ensure they are used properly.

    Thanks for the reply 🙂

    I'm constructing a Fact Constellation Schema (I believe this is a galaxy schema but where the fact tables can be joined. Still need to research more around this).

    The idea is to construct 2 independent fact tables that are interconnected by dimensions, but there are 2 measures that should be equal to both fact tables (but will only be needed for certain validations).

    Without trying to generate further confusion, this is the final outcome:

    Fact Table: Discounts (Accumulated Snapshot)
    Will include the following surrogate keys: SK_Stack, SK_Customer, SK_Product, SK_Currency, SK_Valid_From, SK_Valid_To, [other SKs]
    Measures: ListPrice, Discount1, Discount2, FinalPrice

    This is an independent fact table on it's own that will be used to make many analysis around the discounts that are being applied.

    Fact Table: Sales (Transactional Fact Table)
    Surrogate Keys: SK_Customer, SK_Product, SK_Currency, SK_OrderDate, [other SKs]
    Measures: SalesAmount, QuantityOrdered, ListPrice, FinalPrice

    This is another independent Fact Table that can also be used to make many analysis around sales.

    Both these Fact Tables can be joined through common dimensions such as Customer, Product, Date, Currency, etc and also directly through ListPrice and FinalPrice for direct comparison between sales and discounts applied.

    The Goal is to provide the possibility to make analysis for each fact table on it's own and also across the 2 fact tables (combining sales with the discounts applied) so that the business can understand how to maximize profits for each customer looking at historical data (did the sales increase because of discounts? were there discounts applied? how many? what percentages? etc)

  • Vegeta7 - Tuesday, November 21, 2017 5:36 AM

    Fact Table: Discounts (Accumulated Snapshot)
    Will include the following surrogate keys: SK_Stack, SK_Customer, SK_Product, SK_Currency, SK_Valid_From, SK_Valid_To, [other SKs]
    Measures: ListPrice, Discount1, Discount2, FinalPrice

    Fact Table: Sales (Transactional Fact Table)
    Surrogate Keys: SK_Customer, SK_Product, SK_Currency, SK_OrderDate, [other SKs]
    Measures: SalesAmount, QuantityOrdered, ListPrice, FinalPrice

    I don't know your business model, but to me it looks like you have two fact tables to describe a single event, as evidenced by ListPrice & FinalPrice being listed in both places.  That should be a big warning flag.  Logically, a Discount (the event captured in FactDiscount) does not have a ListPrice, but a sale does.
    I would instead expect to see a column in FactSales for DiscountDimension.SK_Discount so I can identify which discount was applied to the sale, and the calculated discount amount (ListPrice * .10 As DiscountAmount).

    Wes
    (A solid design is always preferable to a creative workaround)

  • whenriksen - Tuesday, November 21, 2017 7:05 AM

    I don't know your business model, but to me it looks like you have two fact tables to describe a single event, as evidenced by ListPrice & FinalPrice being listed in both places.  That should be a big warning flag.  Logically, a Discount (the event captured in FactDiscount) does not have a ListPrice, but a sale does.
    I would instead expect to see a column in FactSales for DiscountDimension.SK_Discount so I can identify which discount was applied to the sale, and the calculated discount amount (ListPrice * .10 As DiscountAmount).

    I understand your idea and you may very well be right 🙂 but since the discounts are prepared in advance for the whole year and because the business needs to be able to make comparisons just using discounts (ex: for this customer and this product, what discounts and how many are applied in these periods compared to the year before that) so they can change them according to the analysis, I'm not sure I can put the discounts as a dimension only to be used by the Sales Fact Table. Since there are many discounts that will never have a sale, but still need to be used in the reports for analysis.

    I really understand your point and I will have to dig more into this before I can advance with the final architecture.

  • Vegeta7 - Tuesday, November 21, 2017 8:29 AM

    Since there are many discounts that will never have a sale, but still need to be used in the reports for analysis.

    This statement is exactly why you should consider a discount dimension.  A fact table documents what has happened.  If a discount is never used, you'll never see a sale for it.  If you have a discount dimension, you'll be able to put the discount "name" attribute up against the total sales count/amount and see how much each discount was used, including if it was not used at all.

    On a separate note, the schema design is star or snowflake.  It is describing how the dimensions are related to a specific fact.  A star has the dimension referenced by the fact directly, a snowflake can have intermediate dimensions between a dimension and the fact. 
    A "constellation" is a collection of related stars (facts that share dimensions) and a "galaxy" is a collection of constellations that are not related (all facts for the whole company).  Technically, a constellation and a galaxy can use the snowflake schema design as well, as they can be intermixed within the same solution, but that is not very common.

    Wes
    (A solid design is always preferable to a creative workaround)

  • whenriksen - Tuesday, November 21, 2017 8:56 AM

    Vegeta7 - Tuesday, November 21, 2017 8:29 AM

    Since there are many discounts that will never have a sale, but still need to be used in the reports for analysis.

    This statement is exactly why you should consider a discount dimension.  A fact table documents what has happened.  If a discount is never used, you'll never see a sale for it.  If you have a discount dimension, you'll be able to put the discount "name" attribute up against the total sales count/amount and see how much each discount was used, including if it was not used at all.

    On a separate note, the schema design is star or snowflake.  It is describing how the dimensions are related to a specific fact.  A star has the dimension referenced by the fact directly, a snowflake can have intermediate dimensions between a dimension and the fact. 
    A "constellation" is a collection of related stars (facts that share dimensions) and a "galaxy" is a collection of constellations that are not related (all facts for the whole company).  Technically, a constellation and a galaxy can use the snowflake schema design as well, as they can be intermixed within the same solution, but that is not very common.

    Thank you for your help. This feedback is really interesting and making a lot of sense to me. But there are still a few things I see that might generate some debate.

    This Discount table, as we discussed, will have 4 non-additive measures:
    Listprice
    Discount1
    DIscount2
    FinalPrice

    Is it wise to include non-additive measures in a dimension?

    Also, this table will include a lot of SKs to other tables such as Customer, Product, etc but I guess that would just make it a snowflake.

    This table would only include Discount1 and Discount2 by default, but I need to include the ListPrice and the FinalPrice to make sure that the FinalPrice matches the sales amount in the Sales Fact Table, because there's quite a complex logic behind the discounts and how they are applied and sometimes the prices might not match due to some mistake on the sales side or in how I try to replicate the discount logic.

  • Vegeta7 - Wednesday, November 22, 2017 3:50 AM

    This Discount table, as we discussed, will have 4 non-additive measures:
    Listprice
    Discount1
    DIscount2
    FinalPrice

    What is the ListPrice/FinalPrice of the discount prior to the sale?  If they don't exist yet, then these measures are part of the sale..

    Is it wise to include non-additive measures in a dimension?


    Everything in a dimension is not additive and there are no measures in a dimension.  Dimensions are used for slicing and filtering.  Don't confuse the model with the underlying table.  There are some dimensions and measures in the model built from the same table.  For example, a distinct count of customers does not need a separate fact table.  You can perform the distinct count based on the dimension table, but it is still considered a measure in the model.

    Also, this table will include a lot of SKs to other tables such as Customer, Product, etc but I guess that would just make it a snowflake.


    Does the discount have a customer and product prior to a purchase being made?  If yes, establishing that purchase contract would be an event on its own.  In that case, you'd likely have the fact as described.  However, you'd also have a dimension for the purchase contract (research degenerate dimensions) and your fact sales would relate to the contract dimension and would not relate to the customer or product.  Instead, the relationship from the sales fact to the customer or product would go through the contract dimension and contract fact.

    This table would only include Discount1 and Discount2 by default, but I need to include the ListPrice and the FinalPrice to make sure that the FinalPrice matches the sales amount in the Sales Fact Table, because there's quite a complex logic behind the discounts and how they are applied...


    Having the measure calculated in two different locations is another big warning flag.  Part of building your model is ensuring your calculations are correct.  However, your sale has already happened.  The final price is known and should not be calculated again, even if it is wrong.  (If your sales reps are able to manually override the total price, that is an additional event you should be capturing). 

    I would also recommend you research the grain of a fact table.  You shouldn't mix grains within a fact table and you should also avoid making redundant grains.  Your revenues should be the aggregate of your sales, which should be the aggregate of your line items, which should be quantity * price.  The rollup is handled by the model when you slice by a different dimension value.  This ensures you always using the same calculation no matter what you slice by.  If you are recalculating the values at different grains, you WILL run into data quality problems.  You will also confuse your users by having different measure names for the same thing.  Your verification of the calculation is not a separate measure.  That is something you can do behind the scenes.

    ...and sometimes the prices might not match due to some mistake on the sales side or in how I try to replicate the discount logic.

    DO NOT "correct" the data from your source.  It is very common for a model to surface data quality issues in an application.  Surface the problem in your model and let the source fix the problem.  (Cleaning the data is not the same as correcting it).  Once the source discovers you can manipulate the data, they will always let you handle it, and then you are on the hook for fixing all their other problems and explaining the values and why you didn't tell them earlier.  If a source says they can't fix it for some reason or another, try to hold strong that you can't fix the numbers either.  When you pre-filter data from a model, you allow company problems to be hidden from leadership.  Who do you think will take the heat for that?  Bottom line, your model should use the data in the application.

    Wes
    (A solid design is always preferable to a creative workaround)

  • whenriksen - Wednesday, November 22, 2017 7:33 AM

    DO NOT "correct" the data from your source.  It is very common for a model to surface data quality issues in an application.  Surface the problem in your model and let the source fix the problem.  (Cleaning the data is not the same as correcting it).  Once the source discovers you can manipulate the data, they will always let you handle it, and then you are on the hook for fixing all their other problems and explaining the values and why you didn't tell them earlier.  If a source says they can't fix it for some reason or another, try to hold strong that you can't fix the numbers either.  When you pre-filter data from a model, you allow company problems to be hidden from leadership.  Who do you think will take the heat for that?  Bottom line, your model should use the data in the application.

    Amen to that!  "Correcting" data from another source also puts the company at a greater liability especially if your corrections turn out to be incorrect. It could also invalidate any "audit" certifications the company may have and if they lose those, there will be hell to pay. Of course, the company won't take the blame... they'll use you as the scapegoat.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Hi,
    I'm sorry I only have sql server 2000 now...


    CREATE TABLE #ProductDiscounts (
    Stack VARCHAR (255),
    ProductCode VARCHAR (255),
    ValidFrom DATEtime,
    ValidTo DATEtime,
    Discount1 decimal(4,3))

    CREATE TABLE #Periodos (
    Ind integer IDENTITY (1,1),
    Ini DATEtime,
    Fin DATEtime)

    INSERT INTO #ProductDiscounts (Stack, ProductCode, ValidFrom, ValidTo, Discount1) VALUES ('A','PROD40','20000101','20401231',0.1)
    INSERT INTO #ProductDiscounts (Stack, ProductCode, ValidFrom, ValidTo, Discount1) VALUES ('B','PROD40','20090201','20110131',0.2)
    INSERT INTO #ProductDiscounts (Stack, ProductCode, ValidFrom, ValidTo, Discount1) VALUES ('C','PROD40','20110201','20411231',0.15)
    INSERT INTO #ProductDiscounts (Stack, ProductCode, ValidFrom, ValidTo, Discount1) VALUES ('D','PROD40','20090201','20401231',0.3)

    Insert Into #Periodos
     ( Ini )
    Select
     x.Data
    From
     (
      Select
       ValidFrom As Data
      From #ProductDiscounts
      Union
      Select
       DateAdd(day, 1,ValidTo) As Data
      From #ProductDiscounts
     ) As x
    Group By
     x.Data

    Update #Periodos
     Set Fin = x.F
    From #Periodos
     Inner Join
    (
    Select
    Ori.Ind  As I
    , Min(Ant.Ini) As F
    From #Periodos As Ori
     Inner Join #Periodos As Ant
      On Ori.Ind < Ant.Ind
    Group By
    Ori.Ind ) As x
    On Ind = x.I

    Delete #Periodos
    From #Periodos
     Inner Join ( Select Max(Ind) As M From #Periodos ) As x On Ind = x.M

    Select -- *
    Stack,
    ProductCode,
    Discount1,
    Ini As ValidFrom,
    DateAdd(day, -1, Fin ) As ValidTo
    From #ProductDiscounts
     Inner Join #Periodos
      On Not ( Fin <= ValidFrom
        Or Ini > ValidTo )
    Order By Ind

    -- Select * From #Periodos
    -- Select * From #ProductDiscounts

    Drop Table #ProductDiscounts
    Drop Table #Periodos

    MAC

Viewing 15 posts - 16 through 29 (of 29 total)

You must be logged in to reply to this topic. Login to reply