Building ProductDim from normalized tables.

  • even tho it will double the rows to load

    We can provide general guidelines, but you're the one with the data, so you have to make some judgements. You don't indicate what doubling of the rows means in terms of numbers. I have a dimension where I load every possible combination even though only about 25% are actually in use. But the totat dimension size is about 10,000 rows. If it were 10x that, I would probably load only the combinations used, and alter the ETL to load new combinations automatically.

  • RonKyle (8/25/2016)


    even tho it will double the rows to load

    We can provide general guidelines, but you're the one with the data, so you have to make some judgements. You don't indicate what doubling of the rows means in terms of numbers. I have a dimension where I load every possible combination even though only about 25% are actually in use. But the totat dimension size is about 10,000 rows. If it were 10x that, I would probably load only the combinations used, and alter the ETL to load new combinations automatically.

    Well, reporting only on what is used is the same as using independent dimensions again.

    Regardless, you can take the approach of reporting on non-sales in the fact. This reduces the dimension size of DimProduct, but increasing the record count of the fact. However, you can split the fact to offload the processing on reporting. FactSales and FactNonSales.

    Taking that approach means you can go from one big DimProduct table to DimProduct, DimSize and DimColor because the FactNonSales now answers the question of everything not sold for the day as long as the source system can answer those questions, which it should because it has every possible combination to begin with.

  • Well, reporting only on what is used is the same as using independent dimensions again.

    No, it's not. Chances are independent dimensions when the items would be an attribute of product would be a design error. This is an ETL choice. But as you point out that it's important to know what hasn't moved, then why are you asking if you should load the unsized products even if it will double the size? It sounds like you know you have to load them. If in the dimension I described I had to report on the 75% unused combinations, there would have been no choice but to load them.

    I don't see how including the unsized doubles the data in any case. I would see it something like

    Shirt XS

    Shirt S

    Shirt M

    Shirt L

    Shirt XL

    Shirt US

    In this case, the unsized adds one more row to the 5 already there. I'm sure you're real life situation is not a simplistic, but the idea likely pertains.

  • RonKyle (8/25/2016)


    Well, reporting only on what is used is the same as using independent dimensions again.

    No, it's not. Chances are independent dimensions when the items would be an attribute of product would be a design error. This is an ETL choice. But as you point out that it's important to know what hasn't moved, then why are you asking if you should load the unsized products even if it will double the size? It sounds like you know you have to load them. If in the dimension I described I had to report on the 75% unused combinations, there would have been no choice but to load them.

    I don't see how including the unsized doubles the data in any case. I would see it something like

    Shirt XS

    Shirt S

    Shirt M

    Shirt L

    Shirt XL

    Shirt US

    In this case, the unsized adds one more row to the 5 already there. I'm sure you're real life situation is not a simplistic, but the idea likely pertains.

    Sorry, I misunderstood. You were referring to adding unsized to only products that have unsized via the ETL and not just adding it to all products. I thought you were saying that you could just build the dimension based on what's used in the fact, i.e.: only include attributes that have sold in the dimension.

    That would be the smart bet, but if you do have product colors, it wouldn't just be 1 new record for the product, if you have 5 colors, it would be 5 new records at max for the product. That's when you want to ensure you are only adding what has changed per the ETL (as Ron mentioned). Thus, if you have 10,000 products with 10 colors, then you're talking at max, 100,000 new records added to the DimProduct table over time (the amount that unsized could increase if all products and colors have it over time). Not exactly double per his comments, but a lot more records if unsized happens for all 10,000 products and 10 color combinations.

  • I think overweighting the size portion is not taking into consideration the many benefits of the denormalised model when it comes to reporting, and can gravitate ones thinking away from the positives. Warehouses will/should be big. 100,000 rows or 200,000 rows for that matter is no real concern. Remember the purpose of a warehouse is to retain well organized history. A dimensional table that will probably be in the MBs of storage space will pale in comparison to the GBs of size of a fact table.

    ----------------------------------------------------

  • MMartin1 (8/25/2016)


    I think overweighting the size portion is not taking into consideration the many benefits of the denormalised model when it comes to reporting, and can gravitate ones thinking away from the positives. Warehouses will/should be big. 100,000 rows or 200,000 rows for that matter is no real concern. Remember the purpose of a warehouse is to retain well organized history. A dimensional table that will probably be in the MBs of storage space will pale in comparison to the GBs of size of a fact table.

    It can really depend on the use case. You shouldn't assume that big means you should not consider the impact of large dimensions for reporting. Using my warehouse in example, I have facts that have half a billion if not a billion in records. Reports can span a good portion of those records due to the nature of the questions that are being asked of the warehouse. When you add large denormalized dimensions into the mix, you have to consider the perfomance implications.

    If you're just supporting canned reporting where the users are looking at slices of data in weekly, monthly or quarterly rollups, then the performance implications is not as wide. Yet, you toss in 1,000 users hitting the same tables for the same report, things may change.

  • xsevensinzx (8/25/2016)


    MMartin1 (8/25/2016)


    I think overweighting the size portion is not taking into consideration the many benefits of the denormalised model when it comes to reporting, and can gravitate ones thinking away from the positives. Warehouses will/should be big. 100,000 rows or 200,000 rows for that matter is no real concern. Remember the purpose of a warehouse is to retain well organized history. A dimensional table that will probably be in the MBs of storage space will pale in comparison to the GBs of size of a fact table.

    It can really depend on the use case. You shouldn't assume that big means you should not consider the impact of large dimensions for reporting. Using my warehouse in example, I have facts that have half a billion if not a billion in records. Reports can span a good portion of those records due to the nature of the questions that are being asked of the warehouse. When you add large denormalized dimensions into the mix, you have to consider the perfomance implications.

    If you're just supporting canned reporting where the users are looking at slices of data in weekly, monthly or quarterly rollups, then the performance implications is not as wide. Yet, you toss in 1,000 users hitting the same tables for the same report, things may change.

    I dont know where I assumed that size can never be an issue. I did not give advise to forgo indexing or even compression. And dont forget having all attributes to a dimension in the dimension means you can do analysis with that dimension only. You especially dont want to break up the dimension if the attributes form a natural or some user defined hierarchy. Why I would not want not have a month dimension , year dimension etc...

    If you have a subset of attributes that change a lot, as I mention prior you can have the minidimension or even an outtrigger dimension if ,in your case, color where to appear in other dimensions as well. http://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/outrigger-dimension/

    ----------------------------------------------------

Viewing 7 posts - 16 through 21 (of 21 total)

You must be logged in to reply to this topic. Login to reply