Setting up the Primary Key

  • Either a GUID or an identity field will do for a primary key. Through the years, I have found that it is really a matter of choice. To keep it simple, you can use the following as a guide to help you decide how you will implement your database tables:

    Use identity when:

      Your PK will be the clustered index

      You will not be using replication

      Sequence is/may be important. Keep in mind that you may still end up with gaps

      You will be querying your database by using the PK as opposed to some other data like a date range, customer name, company, etc.

    Otherwise use a GUID. The important thing about using GUIDs for PKs is to never, ever, use a clustered index on them.

    Hope this helps.

    Michael

  • I respect all of your views about keys, indexes and the like. It's great to hear different views on the same topics and let my mind try to wrap around some of the higher thinkers about SQL Server.

    However, if we go back to robblot and his original question, I suspect that his app won't be as complex as what some of you are thinking. Now, I'm no certified DBA, I'm more of a hybrid business guy and developer. I think robblot has several spreadsheets that are created every month and he's looking for a simple solution and not an overtly complex app that is going to overtax a server(s) when pulling data.

    1. My first thought is: is there a performance hit for using a GUID over an identity key? a 37 digit number compared to just an integer, it seems the integer wins. (I can definitely be wrong here)

    2. How many records at a time will this information be bringing back? dozens, hundreds, thousands, millions, billions? Since he's using spreadsheets, I reckon he doesn't have a terabyte of data to work with and just wants a way to import some data into tables and either use Excel or SSRS to input some dates to return the data.

    My vote is for the identity keyed integer (or bigint if necessary) if my scenario is what robblot is thinking. I've built a ton of successful web apps (I'm sure you guys have, too) that SSIS a spreadsheet(s) into a SQL Database and SSRS a report out of that database.

  • 1. The performance hit for GUIDs is the size, and the fact that if this is clustered, you can create lots of page splits. I tend to agree with you. For most tables, use an int identity.

    2. I heard in SQL2K5 that above 400 inserts/sec, a GUID was better (nonclustered). For reads, I'd go back to it's simpler to use int and have less space taken up. However, for this data size, likely doesn't matter

  • SQLWannabe (6/15/2010)


    I respect all of your views about keys, indexes and the like. It's great to hear different views on the same topics and let my mind try to wrap around some of the higher thinkers about SQL Server.

    However, if we go back to robblot and his original question, I suspect that his app won't be as complex as what some of you are thinking. Now, I'm no certified DBA, I'm more of a hybrid business guy and developer. I think robblot has several spreadsheets that are created every month and he's looking for a simple solution and not an overtly complex app that is going to overtax a server(s) when pulling data.

    1. My first thought is: is there a performance hit for using a GUID over an identity key? a 37 digit number compared to just an integer, it seems the integer wins. (I can definitely be wrong here)

    2. How many records at a time will this information be bringing back? dozens, hundreds, thousands, millions, billions? Since he's using spreadsheets, I reckon he doesn't have a terabyte of data to work with and just wants a way to import some data into tables and either use Excel or SSRS to input some dates to return the data.

    My vote is for the identity keyed integer (or bigint if necessary) if my scenario is what robblot is thinking. I've built a ton of successful web apps (I'm sure you guys have, too) that SSIS a spreadsheet(s) into a SQL Database and SSRS a report out of that database.

    1. Yes, there would be a hit for several reasons, one being the size and another being page fragmentation. Keep in mind SQL is heavily optimized to deal with integer data and it does GREAT!

    I'm going to go out on a limb, GUID's should NEVER be used in a clustered index. It can be a secondary index. It is useful to some web applications to keep their data that way instead of as integers, but in the database side I regularly use integer (with ot without identity) to uniquely identify a record. The GUID would be just another attribute of the record and I would probably never use it in an way other than to return it to the front-end.

    I walked through this whole thread and in my mind it was pretty clear, business keys are attributes of the record they may be indexed and even uniquely indexed, but I won't use them for DRI.

    I'll give this example, your username is UserABCD.

    Where the business key (username) is the key all user activity is tied to UserABCD. What happens when you want to change your username? you have to hit ALL related records in all tables, and you probably have to mess with either the DRI or add a new record for DRI to match before you delete the old record.

    Where the business key is an attribute, username is stored as an attribute of a record that has a UserId value of 54321, all of your user activity is tied to UserId 54321. What happens when you want to change your username? You change it and all records come with it.

    Given this example I almost ALWAYS assign an integer surrogate key. There are exceptions because like everything else, it depends.

    CEWII

  • I can see a GUID clustered if it's built on NewSequentialID() only.

  • I agree Steve. However, there are times when the GUID is needed before the record is created, so NewSequentialId() wouldn't work. Also, it only works on SQL2005 or later.

  • Steve Jones - Editor (6/15/2010)


    I can see a GUID clustered if it's built on NewSequentialID() only.

    I *partially* agree, it doesn't cause index fragmentation but it is still a HUGE value storage wise.. And in most of the cases I've seen to justify even having a GUID in the table are tied to the front-end needing to create a unique value locally, so a sequential GUID wouldn't help.. That is not to say that the cases I've dealt with are all inclusive.. Still, I'm going with No, don't do it without thinking it through 5-6 times and then even be really skeptical..

    CEWII

  • Steve Jones - Editor (6/15/2010)


    I can see a GUID clustered if it's built on NewSequentialID() only.

    You could still get page splits because NEWSEQUENTIALID may be generated out of sequence after a reboot - perhaps not a major problem though.

  • I'd like to hear some feedback from robblot (the original poster) and see A) what is he actually trying to do? and B) what did he decide to go with GUID or identity?

  • Since all the "big guns" are posting on this thread, wonder if I could get some thoughts on a design I am contemplating. We will have multiple sites, multiple databases, using the same schema. Through some process of replication (TBD) all data to be collected in one reporting source. There are no true natural keys. The identity columns used as surrogate keys in the previous design (and existing data) are a problem since they overlap by site.

    I am looking at a re-design with GUIDs as non-clustered primary keys, but a date/time column which will use the store time of the record as a clustered key for the tables. The GUIDs help in this way:

    - aid for the programmers because they can create them and don't have to reach into the data layer to get identity values back for multi-table inserts

    - multi-site support without worrying about setting up identity seeds and replication commands

    The clustered key of the store time of the record will probably have to use a uniquifier, but I'm thinking that it will save on page splits and I hate to implement a heap table. The clustered key would only be used for the production table design (for inserts), not carried over to any reporting database schema.

    Any thoughts? What am I not thinking of?

  • MelS-512196 (6/16/2010)


    Since all the "big guns" are posting on this thread, wonder if I could get some thoughts on a design I am contemplating. We will have multiple sites, multiple databases, using the same schema. Through some process of replication (TBD) all data to be collected in one reporting source. There are no true natural keys. The identity columns used as surrogate keys in the previous design (and existing data) are a problem since they overlap by site.

    I am looking at a re-design with GUIDs as non-clustered primary keys, but a date/time column which will use the store time of the record as a clustered key for the tables. The GUIDs help in this way:

    - aid for the programmers because they can create them and don't have to reach into the data layer to get identity values back for multi-table inserts

    - multi-site support without worrying about setting up identity seeds and replication commands

    The clustered key of the store time of the record will probably have to use a uniquifier, but I'm thinking that it will save on page splits and I hate to implement a heap table. The clustered key would only be used for the production table design (for inserts), not carried over to any reporting database schema.

    Any thoughts? What am I not thinking of?

    You're better off starting a new thread for that question so that you get all the community's help rather than a few big guns...

    you can always post a link to the new thread here if it makes you feel better.

  • I agree with the Ninja's suggestion about starting a new thread.

    I will however relate one thing that I have seen done and worked pretty well. There was a table that contained database server names and associated id values, you could always calculate what the local servers value was by querying this table and using @@SERVERNAME. In the tables that were shared the key contained the id value of the record AND the server id, so when that table was replicated it was unique to that server.

    What this meant was that one and only one server would have server id 2 and all records that originated at that server and were shared would have a server id of 2. This meant that the local server could maintain its own identity values without worrying about collisions because the records could ALWAYS be identified as unique.

    We set the server id as a smallint as opposed to int to save space and we NEVER expected to have more than 2-300 servers.

    CEWII

Viewing 12 posts - 16 through 26 (of 26 total)

You must be logged in to reply to this topic. Login to reply