difference between primary key and IDENTITY

  • Can somebody give me an explanation (or point me to a link which explains) of the difference between IDENTITY and PRIMARY KEYs?  I am fairly new to SQL Server, and have set my id columns as PRIMARY KEY and IDENTITY (1,1) YES.  I am hoping to implement transactional replication eventually and am not sure this is the best option. 

    Thanks,

    Brett

     

  • A PK is a unique identifier for a row of data. An identity column will auto fill based on some increment each new row. Identities are not necessarily unique or contiguous. You can make the identity column your PK, but it doesnot have to be.

  • ok, so a Primary Key could be something like 'bmw' or 'ford' in a 'mfr_name' column in a table called 'vehicle_mfr' as long as the manufacturer was only in the table once.  I suppose my question is then, why wouldn't there be a 'mfr_id' int column, starting at 1 and auto-incrementing, which was set as the primary key.  I suppose then for Primary Keys, a default formula rather than a default value should be specified.  Is this on the right track?  Also, would there be any reason to have an IDENTITY column in a table that has a separate Primary Key?

    Many thanks.

  • Ok a primary key(pk) should be INVARIABLE with time. Sometimes it happens that what you thought it could be a good pk candidate is not anymore and you may ended redesigning a lot but if you use a surrogate key (something to identify rows on your table ) you are almost sure to have something stable.

    I know there is a lot of debate between to use surrogates or not to use them but that's another topic

    About the IDENTITY column where there is a pk is you said that the pk constraint is ENFORCED then there is no apparent need for it but some times it helps to accomplish certain business needs(queries) with ease

     


    * Noel

  • Yes, a PK shouldn't be volatile. A PK is one of the fundamental concept of relational databases. It is used to uniquely identify a row in a table. The IDENTITY property you can define on a numeric column offers simply a mechanism for providing a unique number. I would look at BOL for explanation of both. And if you have some money to spent, maybe buy a book on designing relational databases. For beginners I think Database Design for Mere Mortals (Second Edition) von Michael J. Hernandez, Addison-Wesley; ISBN: 0-201-75284-0 is quite worth reading because it explains things in a non academic fashion. Another option is to ask questions here. It's a real time saver.

    HTH

     

    --
    Frank Kalis
    Microsoft SQL Server MVP
    Webmaster: http://www.insidesql.org/blogs
    My blog: http://www.insidesql.org/blogs/frankkalis/[/url]

  • ok, so a Primary Key could be something like 'bmw' or 'ford' in a 'mfr_name' column in a table called 'vehicle_mfr' as long as the manufacturer was only in the table once.  I suppose my question is then, why wouldn't there be a 'mfr_id' int column, starting at 1 and auto-incrementing, which was set as the primary key.  I suppose then for Primary Keys, a default formula rather than a default value should be specified.  Is this on the right track? 

    There is no law that state that you must have a numeric primary key. However, numbers are easy to sort and search, therefore offer performance advantages over strings. If you have a natural PK, of course you can use this. The concept of IDENTITY sometimes seem to lead to rather religious discussions. Difficult to judge who is right and who is wrong. I guess everyone has to find out for himself which way to go.

    Also, would there be any reason to have an IDENTITY column in a table that has a separate Primary Key?

    No, I would say.

     

    --
    Frank Kalis
    Microsoft SQL Server MVP
    Webmaster: http://www.insidesql.org/blogs
    My blog: http://www.insidesql.org/blogs/frankkalis/[/url]

  • Some systems I have "inherited" had multiple Colums primary keys (4-5) and I was task to speed some of the code! One of the shortest solutions to get it going was to ADD an identity column and a index on it !!!

    You see there is not a necesity on the model but in practice sometimes it helps.

    This is why some times you have a perfect 3NF design and you DENORMALIZE to get speed!!

     


    * Noel

  • Brian Kelley has a nice article on normalization http://qa.sqlservercentral.com/columnists/bkelley/normalization.asp

    --
    Frank Kalis
    Microsoft SQL Server MVP
    Webmaster: http://www.insidesql.org/blogs
    My blog: http://www.insidesql.org/blogs/frankkalis/[/url]

  • Thanks guys, this will certainly help me! 

    I think I agree with Frank on the numeric PK performance enhancements.  I think it just seems most obvious to visualize the rows in a numeric order, and then sort them in that way.  I have seen tables before that do have other unique values, but numeric seems easiest to me. 

    noeld, if I am ever in your situation I will certainly remember your solution!

    Frank, thanks for the article on normalization.

  • I think I agree with Frank on the numeric PK performance enhancements.  I think it just seems most obvious to visualize the rows in a numeric order, and then sort them in that way.  I have seen tables before that do have other unique values, but numeric seems easiest to me. 

    Purists and hardliners might now argue, you've successfully turned a RDBMS into some kind of something else. I know I will get crucified for quoting Joe Celko again, but here it goes. Take it purely as some entertaining evening fun from me.

    >> I'm trying to change an existing field [sic] to an

    IDENTITY field [sic], but I'm getting syntax errors. It's got to be

    simple, but somewhere I'm missing something. <<

    What you are missing is the basic concepts of the relational model.

    Columns are not fields; rows are not records; tables are not files. An

    IDENTITY property cannot be a key by definition. A key is a subset of

    attributes that uniquely define an entity in your data model.

    The IDENTITY column is a holdover from the early programming language

    which were very close to the hardware. For example, the fields in a

    COBOL or FORTRAN program were assumed to be physically located in main

    storage in the order they were declared in the program.

    The early SQLs were based on existing file systems. The data was kept

    in physically contiguous disk pages, in physically contiguous rows, made

    up of physically contiguous columns. In short, just like a deck of

    punch cards or a magnetic tape.

    But physically contiguous storage is only one way of building a

    relational database and it is not always the best one. But aside from

    that, the whole idea of a relational database is that user is not

    supposed to know how things are stored at all, much less write code that

    depends on the particular physical representation in a particular

    release of a particular product.

    One of the biggest errors is the IDENTITY column (actually property) in

    the Sybase family (SQL Server and Sybase). People actually program with

    this "feature" and even use it as the primary key for the table! Now,

    let's go into painful details as to why this thing is bad.

    The practical considerations are that IDENTITY is proprietary and

    non-portable, so you know that you will have maintenance problems when

    you change releases or port your system to other products.

    But let's look at the logical problems. First try to create a table

    with two columns and try to make them both IDENTITY. If you cannot

    declare more than one column to be of a certain datatype, then that

    thing is not a datatype at all, by definition. It is a property which

    belongs to the PHYSICAL table, not the data in the table.

    Next, create a table with one column and make it an IDENTITY. Now try

    to insert, update and delete different numbers from it. If you cannot

    insert, update and delete rows from a table, then it is not a table by

    definition.

    Finally create a simple table with one IDENTITY and a few other columns.

    Use a few statements like

    INSERT INTO Foobar (a, b, c) VALUES ('a1', 'b1', 'c1');

    INSERT INTO Foobar (a, b, c) VALUES ('a2', 'b2', 'c2');

    INSERT INTO Foobar (a, b, c) VALUES ('a3', 'b3', 'c3');

    to put a few rows into the table and notice that the IDENTITY

    sequentially numbered them in the order they were presented. If you

    delete a row, the gap in the sequence is not filled in and the sequence

    continues from the highest number that has ever been used in that column

    in that particular table.

    But now use a statement with a query expression in it, like this:

    INSERT INTO Foobar (a, b, c)

    SELECT x, y, z

    FROM Floob;

    Since a query result is a table, and a table is a set which has no

    ordering, what should the IDENTITY numbers be? The entire, whole,

    completed set is presented to Foobar all at once, not a row at a time.

    There are (n!) ways to number (n) rows, so which one do you pick? The

    answer has been to use whatever the physical order of the result set

    happened to be. That non-relational phrase "physical order" again.

    But it is actually worse than that. If the same query is executed

    again, but with new statistics or after an index has been dropped or

    added, the new execution plan could bring the result set back in a

    different physical order. Can you explain from a logical model why the

    same rows in the second query get different IDENTITY numbers? In the

    relational model, they should be treated the same if all the values of

    all the attributes are identical.

    Using IDENTITY as a primary key is a sign that there is no data model,

    only an imitation of a sequential file system. Since this number exists

    only as a result of the state of particular piece of hardware at a

    particular time, how do you verify that an entity has such a number in

    the reality you are modeling?

    To quote from Dr. Codd: "..Database users may cause the system to

    generate or delete a surrogate, but they have no control over its value,

    nor is its value ever displayed to them ..."(Dr. Codd in ACM TODS, pp

    409-410) and Codd, E. (1979), Extending the database relational model to

    capture more meaning. ACM Transactions on Database Systems, 4(4). pp.

    397-434. This means that a surogate ought ot act like an index; created

    by the user, managed by the system and NEVER seen by a user. That means

    never used in queries.

    Codd also wrote the following:

    "There are three difficulties in employing user-controlled keys as

    permanent surrogates for entities.

    (1) The actual values of user-controlled keys are determined by users

    and must therefore be subject to change by them (e.g. if two companies

    merge, the two employee databases might be combined with the result that

    some or all of the serial numbers might be changed.).

    (2) Two relations may have user-controlled keys defined on distinct

    domains (e.g. one uses social security, while the other uses employee

    serial numbers) and yet the entities denoted are the same.

    (3) It may be necessary to carry information about an entity either

    before it has been assigned a user-controlled key value or after it has

    ceased tohave one (e.g. and applicant for a job and a retiree).

    These difficulties have the important consequence that an equi-join on

    common key values may not yield the same result as a join on common

    entities. A solution - proposed in part [4] and more fully in [14] - is

    to introduce entity domains which contain system-assigned surrogates.

    Database users may cause the system to generate or delete a surrogate,

    but they have no control over its value, nor is its value ever displayed

    to them....." (Codd in ACM TODS, pp 409-410).

    References

    Codd, E. (1979), Extending the database relational model to capture more

    meaning. ACM Transactions on Database Systems, 4(4). pp. 397-434

    The most common use that a Newbie makes of IDENTITY is to use it as a

    record number (under the error that a record nubmer is a key!), so that

    he does not have to think about keys, DRI, check digits, proper data

    types, international standards and all that hard stuff.

    While this was meant as an abstract example, I also fear that you have

    not read ISO-11179 because of the silly, redundant, dangerous prefixes

    on your code.

    --CELKO--

    Steve, no diatribe from me!

     

    --
    Frank Kalis
    Microsoft SQL Server MVP
    Webmaster: http://www.insidesql.org/blogs
    My blog: http://www.insidesql.org/blogs/frankkalis/[/url]

  • This type of scenario is also very used in Datawarehousing, so keep that in mind


    * Noel

  • I have to say that I have watched with some dismay the tendency of programmers/developers to automatically make the first column of their tables an indentity column without any real thought as to the data contained in the table. Worse, in my opinion, is the fact thes often get set as the PK ( not always clustered to be fair ) but that the column is never ( or rarely ) used in the queries against the database. 

    [font="Comic Sans MS"]The GrumpyOldDBA[/font]
    www.grumpyolddba.co.uk
    http://sqlblogcasts.com/blogs/grumpyolddba/

  • Although I agree with you, this seems like some kind of religious discussion which we won't solve here. Using the IDENTITY property is not bad per se, as long as no natural PK exists. But why being a hardliner on this? There are more severe issuer to watch out for when it comes to bad design.

    --
    Frank Kalis
    Microsoft SQL Server MVP
    Webmaster: http://www.insidesql.org/blogs
    My blog: http://www.insidesql.org/blogs/frankkalis/[/url]

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply