Computed column as a primary key?

  • Hi,

    What do you think about creating a unique computed column as a primary key in a table?

    Here is the whole story:

    I have a table which has 15 fields and 10 of those fields should be unique together.

    If I created a unique index on 10 columns (some of them are varchar(50)) it will take too much storage and I'm sure this is not a good practice.

    So I created a persisted computed column, which holds a md5 hash of those fields that should be unique together, and created a unique index only on that field, which is a much better solution I guess.

    id AS (CONVERT([binary](16),hashbytes('MD5',fieldA + fieldF + str(fieldH) + ...),(0))) PERSISTED

    Then I thought that, this field is already unique so I don't need another field as a primary key.

    The table is kind a log table, rows updated very rarely, but loaded ~100.000 new rows everyday.

    Is there any problem you see on this approach? Any problem using computed column as primary key? As I tested it works well, performance looks normal.

    Thank you.

  • A hash makes a good key for the situation you have described, even more so if you can search on it (bearing in mind the other columns won't be indexed). It probably won't make a good choice for a clustered index though. Make sure it's nonclustered unless you think that index fragmentation won't be a problem for you. The HashBytes function also has a limitation that it will only hash the first 8000 characters.

    The hash function you are proposing to use isn't likely to be reliable unless the values are of fixed length. You need to delimit the columns because otherwise 'A'+'BC' will return the same hash as 'AB'+'C' , so put a delimiter in there like: 'A'+'|'+'BC'. Since this is supposed to be a key there shouldn't be any nullable columns but if there are are then you also need to handle nulls in the concatenation.

  • There's a slim but real chance of hash collision, even if the data is different. What will you do if that comes up? Just hope it doesn't?

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • True, but with MD5 the chance of an accidental collision is less than the chance of a duplicate GUID and most people don't worry too much about that. It's probably also less than the risk of your server being destroyed in a lightning strike. If you are really paranoid you could use two different hash algorithms and make the combination of them a key.

  • Thank you very much for your explanations.

    As you said, it is not a clustered index, I use it just for making rows unique without creating a big multi-column unique index. I don't use it for searching either.

    Using a delimiter is also a good practice, thank you for the idea.

    The only problem I can find, SSMS cannot drop/recreate that table if primary key is a computed column and IGNORE_DUP_KEY is set to true when I change schema definition.

    So I need to create another table by using SQL statements with necessary changes and transfer the data.

    Thanks again.

  • I created that computed column as primary key with IGNORE_DUP_KEY=true, so it won't lead an error if hash collide. And of course it is a very (very!) little chance to happen. And it is not that important to lose one row in 5-10 million records, in my situation.

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply