On how to design dimensions and discover correlations between their attributes

  • First up let me say that where I use the term "correlation" I don't think I am using it in strictly the sense that a statistician might. So apologies up front if I cause confusion.

    Say I have a big ugly flat file that I have to turn into a star schema. Assume there's only going to be one fact table for the minute. Assume also that it's easy enough to identify the measures that are going to go into the fact table. You're then left with a bunch of columns that are going to inform the attributes of at least some of the final dimension tables.

    For example, say we have two columns that are candidates for being attributes. Call them column_a and column_b. Column_a has 50 distinct values and column_b has 20 distinct values. If we did a group by and count of those two columns together, a spectrum of outcomes is possible. The columns could be highly correlated. The most highly correlated outcome would be that 50 rows are returned from the group by of both columns. This means that column_b can be explained entirely by column_a and it makes a rather compelling case for them to be both in the same dimension. At the other end of the spectrum, you might end up with 1000 rows (the cartesian product of the two distinct counts). On the face of it, this is a fair indication that the two columns are relatively unrelated and don't belong in the same dimension.

    But so far this is an oversimplified example, because I have not addressed the counts of records. Say my big ugly flat file contains 10 million rows. I have 1000 distinct combinations of column_a and column_b, but on closer examination it turns out that 9,999,000 of the rows use just 100 distinct combinations of column_a and column_b and the remaining 1000 records are scattered around the last 900 distinct combinations. It turns out that the columns ARE highly correlated... most of the time. To deal with this, you could take the view that the 1000 records that don't fall into the boxes that most of the others neatly fall into, are "outliers" or bad data entry or whatever, and group them all up into an "other" category. That way you'd reduce the size and complexity of your dimension table from 1000 rows to 101 for almost no loss in explanatory power.

    But things are still oversimplified because we've only compared column_a to column_b. What if there were say 100 other columns that could potentially contribute to the final design of the dimensions. The number of comparisons that can be made quickly becomes mindboggling. There must surely be an automated mathematical way of examining these columns to see what the level and nature of the correlations is between them is, so that you come out with a result in the end that maximises the efficiency of the dimensions. In other words, how can you tell that you've got the maximum explanatory power for the minimum number of dim tables and minimum number of rows in those dim tables?

    A friend of mine was asked to develop a star schema and was faced with just this dilemma, so he grouped things up into dimensions as best he could based on his gut feel. He didn't have the luxury of knowing the nature of the underlying business at the time and overall did a sterling job given the constraints placed on him. I had a look at it some years after it had been designed and I asked why a given column had been included in a particular dimension. He just shrugged his shoulders. We took it out and the dimension went from around 100,000 rows down to less than 5,000. And yes, performance improved. But without the luxury of being able to examine every correlation in detail, it seems there will always be a risk of this.

    What do you do when faced with this? If database design is both art and science, this is one place where it seems that art is used, but science should be.

    Cheers

    GPO.

    ...One of the symptoms of an approaching nervous breakdown is the belief that ones work is terribly important.... Bertrand Russell

  • GPO (7/14/2010)


    ... didn't have the luxury of knowing the nature of the underlying business

    That's the problem, can't do data modeling without knowing the business including how data is used and accessed.

    In this particular case we are talking about a datamart where the central FACT table appears not to be a problem. Problem appears to be the DIM tables.

    Each defined DIM table should describe a single attribute, as an example I can see a DIM_STATE table to describe "States" if in the FACT table there is a STATE_CD column. Such DIM_STATE table would include the normalized State code, whatever internal code is used in the company if a different one is used, the name of the State and even the "State flower" if this information is used by the business to retrieve data.

    Each time you think about adding a column to a DIM table ask yourself: would adding this column means I'll need more rows in my DIM table to properly describe the attribute? if the answer is Yes, forget about such an idea.

    As a rule of thumbs you are going to end up with a DIM table for each single attribute you want to describe in the FACT table.

    Bottom line is that Data Modeling requieres knowledge of the business as well as how data is used and accessed.

    Hope this helps.

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • This is a hopeless task unless the data is never going to change. Otherwise the results of your static analysis could prove invalid the moment the data first gets updated. Effective database design must always be based on an understanding of the intended meaning of the data and the business requirements that drive it. There are no short cuts.

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply