Fact table with surrogate keys on SCD2 dimensions

  • I'd like to make a fact table with a surrogate key in place of a foreign key. My fact table has about 1Tb of data and my SCD2 dimension has about 1MM rows since they're customers. Here are possible ways I could do this.

    1. Lookup thingy

    2. Merge join thingy

    3. tsql merge statement

    4. load the fact table naked and then update with a left join on foreign key and date

    5.

    http://www.codeproject.com/Tips/1060568/SSIS-SCD-Script-Lookup

    If SSDT is so sexy, don't they already have an out of the box solution for this so I don't have to do 4 or 5? If I can do 5, what would I need ssdt for? Is a Fact table with a surrogate key on an scd2 dimension so foreign (ha ha SQL pun)? No one on the web is talking about this.

    Clearly, in the future, should this be a daily thing, we're going to be doing incremental loads and partitions

  • Your question isn't very clear...surrogate keys are table attributes (automatically assigned nonsensical numbers that uniquely identify rows), while foreign keys are table constraints. These are 2 very different concepts.

    Assuming that you have a type-2 dimension (tracking history) and want to create a fact table that references the dimension member at the time of the fact transaction: This is nothing new, fancy or unusual. You need a simple lookup to the type-2 dimension that will return the valid dimension record at the time of the fact transaction, and use that surrogate key in the population of the fact table. Whether you decide to do that with the built-in SSIS tasks or not is really up to you.

    Are you looking to recreate the fact table every day? Can you provide a sample of what your dimension/fact tables look like right now, as well as your desired end state?

  • I think you got it. I need to relate a dimension to a fact table by replacing a foreign key with a corresponding surrogate key scd2. You mentioned a lookup thingy. Yes that is certainly a possibility but those things only work with 25k rows in the dimension total. I've got almost a million. My issue isn't the shape of the table or determine the business rule with whcih to apply the surrogate key. I want to know how people with enormous amounts of rows handle that volume. Is there a best practice?

  • In in doubt, use T-SQL to join the facts to the dimension table.

    I tend to create the facts in a working table (a persistent table in a database not a #temp table) and then update with the surrogate keys from all the dimensions. I then load the data into DW tables.

    Jez

  • ffitzpatrick47 (4/13/2016)


    I think you got it. I need to relate a dimension to a fact table by replacing a foreign key with a corresponding surrogate key scd2. You mentioned a lookup thingy. Yes that is certainly a possibility but those things only work with 25k rows in the dimension total. I've got almost a million. My issue isn't the shape of the table or determine the business rule with whcih to apply the surrogate key. I want to know how people with enormous amounts of rows handle that volume. Is there a best practice?

    What approach are you taking now that's slow? Or are you assuming it's slow?

    If you have incremental loads, then you're talking about updating a smaller amount of records each day. This can be done with SQL and an UPDATE statement joining the dimension table once said table is created/updated.

    As Jez mentioned, this can be done with a temp table or physical working staging table, typically in a separate staging database and update the data, typically in batches (code below). Once the incremental load of data is ready, then it can be inserted/merged into the primary data.

    Updating and insert/merging could result in a few different approaches depending your data, model, indexes and so forth. For example, if you're clustering on time and just putting the incremental data on top of the stack, it's going to reduce fragmentation and be pretty easy in a typical DW scenario. If you also have a non-cluster index on a customer ID that is alphanumeric with 40 characters in length, not so much.

      1 - Create a staging table or staging database with a table that includes fields for your fact and the added surrogate key fields that you will populate from your dimensions.

      2 - Load your staging fact table with the incremental load for the day. Note, your fields to populate with surrogate keys are still NULL at this point.

      3 - Run an update statement joins your dimensions onto your staging fact data in batches to populate those NULL fields with surrogate keys from your dimensions.

      4 - Insert/Merge your staging fact table into your primary fact table.

    (Example: loop the following TSQL statement for example.)

    UPDATE TOP (1000) [staging].[FactSales]

    SET [ID] = c.[ID]

    FROM [staging].[FactSales] s

    INNER JOIN [primary].[DimCustomer] c ON c.[Customer] = s.[Customer]

    WHERE s.[ID] IS NULL;

  • ffitzpatrick47 (4/13/2016)


    I think you got it. I need to relate a dimension to a fact table by replacing a foreign key with a corresponding surrogate key scd2. You mentioned a lookup thingy. Yes that is certainly a possibility but those things only work with 25k rows in the dimension total. I've got almost a million. My issue isn't the shape of the table or determine the business rule with whcih to apply the surrogate key. I want to know how people with enormous amounts of rows handle that volume. Is there a best practice?

    To answer your question: My preference is to do is with T-SQL. I use SSIS to control the workflow, but stage my data and then transform and load with T-SQL (contained within stored procs if it will be reused). Others may prefer to do lookups in SSIS and a T-SQL task in SSIS for updates, but in my experience it works best to use T-SQL to deal with large updates and/or lookups.

  • I do lookups/merge joins against 1MM+ datasets all day long. With a SCD2, you're not going to be looking up against the whole table anyways. Your are always going to be looking at the 'active' customer rows only, ignoring historical. So while you may have 1MM rows in your SCD2, your lookups will not have to take that many into account.

    A lookup transform will work, but you'll want to use the 'Use results of an SQL query' option to limit the lookup to only active rows. If this ends up being an expensive process, then consider using a Merge Join. But still, your source for the Merge Join will limit the rows to 'active' customers.

    Lastly, you can certainly do the JOIN in the source component when you are retrieving your fact data (assuming your staged fact data is in the same DB as your customer dimension).

    I tend to stay clear of writing large ETL operations using stored procedures to do the work. Set based code is always the way to go when working in SQL Server, but the all-or-nothing nature of working with sets/batches of data make your error handling more difficult. SSIS let's you load only the good rows, while re-routing error rows elsewhere for handling. There are pros/cons for both approaches, but my preference for large ETL operations is SSIS data flows over stored procedures.

    John Rowan

    ======================================================
    ======================================================
    Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

  • This is exactly what I was looking for. The training kit specifically warns against a lookup with more than 25k records, but it's wrong and the lookup usable if it can handle 1MM records at reasonable speed.

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply