help with dupicated values in dimension and facts

  • I have a situation in DWH where one of the dimension is not loading properly which is product dimension.

    The product dimension data comes from four oltp databases.so data which is coming

    from one of the oltp databases is creating a new record instead of doing a update for the already existing record.

    So for a same product id there are duplicates.

    the dimension is related to 6 different fact tables.

    Now i need the remove the duplicated dimension values in the dimension table

    and update the correct productid in the facttable

  • It sounds like you've got yourself a hairy problem.

    A few questions I'd be asking to start off with are:

    - Is this a new DWH, or an existing one? Why has this become a problem now and not before?

    - Are there appropriate primary keys set on the dimensions and facts (I really hope so)? Are they composite keys?

    - How are the key lookups done? (could be a casing issue or similar logic)

    - What is your ETL tool?

    - Presumably you want to fix the cause of the problem and not just the data, correct?

    You can fix the data by picking the dimension members you want to remove, joining them to the fact table(s), updating the fact tables based on that set to use the desired surrogate keys, then removing the dimension members. That's if it's a simple DWH, if SCDs are involved it could be more complicated. It sounds like the simple scenario based on your description but I must stress don't take any action if you don't fully understand your data and ETL.

  • This one is old DWH.

    there is a primary key on the dimension table and composite keys in the fact table.

    i am using SSIS

    i am able to fix the package but the thing is i need to correct the old data

  • Curious, may I ask what was causing the duplicates?

    Anyway, like I said before, deleting the duplicates should be a case of first updating the fact table to reflect the desired surrogate key values, then removing the duplicates themselves.

    Is it one product with multiple duplicates or multiple products with one or multiple duplicates?

    If you want to make all the duplicates the same value then it's fairly simple, if each duplicate needs to be replaced by a different value you may have to do it case by case.

  • for a single product instead of updating a single field in a row its creating a new row for the same product.But here we are not keeping any history.

    For the same product its creating a duplicates.

  • So is this an SSAS processing question?

    If so, you probably need to perform the steps above to clean up the tables, then run a ProcessUpdate on the dimension and a ProcessData and ProcessIndex on the fact partitions that were affected.

    Point being, it may be important to determine which fact records you are changing, so you know which partitions this will affect.

    One side note on the ProcessUpdate, this will not actually remove the members from the attribute store, it will actually just move the removed ones to the deleted member store, you will have to run a ProcessFull to actually remove the members from the database completely.

  • You probably want to run something like this against each of the fact tables:

    UPDATE F

    SET

    F.<Product_SKey> = '<Required_Surrogate_Key_Value>'

    FROM <F_FactTableName> F

    INNER JOIN

    (SELECT <Product_SKey>

    FROM <D_ProductDimension>

    WHERE <Criteria that defines affected product>) U

    ON F.<Product_SKey> = U.<Product_SKey>

    And then run something like this against the dimension:

    DELETE

    FROM <D_ProductDimension>

    WHERE ( <Criteria that defines affected product> )

    AND NOT <Product_SKey> = '<Required_Surrogate_Key_Value>'

    There are of course many ways it could be done.

    This code has not been tested, use it at your own risk. Always test first 😉

  • I have attached the sameple records of the fact and dimension table.

    So instead of updating pdesc for pid=1 in the dimension it created a new pid=4

    same is the case for pid=2 it created pid=5 .

    As it created new records in dimension so it is updated in the fact table too.

    So now i have to update the pid(1,2) in the fact table to some value and delete the pid (1,2) in the dimension table..

    I hope this should be helpful

  • During the ETL process of the particular dimension you could utilise the MERGE function to alleviate these kinds of issues...

    The advice here http://qa.sqlservercentral.com/Forums/Topic735049-23-1.aspx may be of assistance...

    Also http://qa.sqlservercentral.com/scripts/Common+Table+Expression+(CTE)/62599/

    ________________________________________________________________________________

    Can I ask you a rhetorical question...?
    ________________________________________________________________________________

  • Sachin4all (6/30/2011)


    for a single product instead of updating a single field in a row its creating a new row for the same product.But here we are not keeping any history.

    For the same product its creating a duplicates.

    This is a DIMension table, isn't it?

    Let me guess... natural key is the same on all rows for the same product but surrogate keys are different, is that correct?

    If that's the case please be sure system is not tracking changes by treating this DIM table as Slowly Changing Dimension of the Type 2.

    SCD Type 2 dimensions appear to have duplicates on the natural key but that's not the case, each "duplicate" row describes the particular item at a specific period of time.

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • ya its a dimension table

    but for a single product there should be only one id.

    no history at all...

  • Sachin4all (7/1/2011)


    ya its a dimension table

    but for a single product there should be only one id.

    no history at all...

    Okay. If that's the case deleting duplicates is a well documented matter, check here: http://support.microsoft.com/kb/139444 just to mention Microsoft KB take on the matter.

    I would test the process... and how it impacts the application after it's done in a production sized test database.

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply