Why must tables have PKs?

  • Hi all,

    I find myself in a situation where I have to justify in a concrete, rubber meets the road manner the need for SQL server 2000 database tables to have primary keys.  Not just declared keys, but tables that actually have a Key.

    We have a number of tables that are extracts of monetary activity on accounts.  The system of record does not provide a reliable key field that we could import as the key of this table.  As such we rely on record count to ensure we have loaded the correct number of transactions.  Then the table is used by numerous reporting processes to answer business questions.  The problem I see is that the table has duplicate rows, defying the first rule of relational database theory. 

    I know that in other discussions I have seen it suggested that this is a perfect example of when to use a surrogate key.  I think an article on this site suggests that and identity field set as PK is better than nothing.  I need to find a way to PROVE that it is better than nothing.  I am being asked to show that the data described above is used more efficiently by SQL in building the reports that are based on it.  The argument that good relational design will ensure data integrity is not good enough, the table is completly replaced every day, there is no updating, inserting or deleting of rows.

    I am not an expert in SQL server internals but what I have learned so far leads me to believe that the product is engineered to expect "good" relational design.

    Can anyone help me with a technique I can use to provide this demonstration?

     

  • You should always have some sort of unique identifer.  In your situation, I would just use and identity column, separate from the data entered.  If the table is joined in anyway, using identity columns for primary and foreign keys is usually best for performance.

    You should also have a clustered index on a table, always.  The primary key can serve as that, although it may not be the best choice. Without a clustered index, the table is organized as a "heap", which is as bad as it sounds as far as performance.  Since this is a table that is recreated every night, I think this is the bigger issue.  Does a table have to have a primary key?  No.  SQL does keep an internal rowid to differentiate.  You could get by without a primary key and a clustered index on another field.  My choice would still be an identity based primary key.

    So, from a reporting standpoint, a primary key is optimal in join situations, and a clustered index is critical for performance. 

    Just out of curiousity, why is there an objection to placing a key on the table?  And how are the existing duplicates dealt with?

  • I am being told that there is no benefit to having a primary key defined, especially a surrogate key of identity if it is not used in joins. 

    We have another table of account information that does have unique rows (key is account_num) however the uniqueness is enforced only by an index that is created after the data is loaded for the day. 

    The most common use of the monetary transactions is to match with accounts by joining on account number.  Obviously this will return all the payments for an account for that day.  Since we join on the account number (a good candidate for an index) the developers (and my manager) claim that the surrogate key created by identity is of no value.  They insist that their experience outweighs any academic discussion about relational theory and all tables being implemented in 1st normal form minimum. 

    What would be the easiest way to demonstrate that performance is bad when a table is a HEAP?

    Thanks for you comments,

    Glen

  • Interesting.  Well, given that the account number is the point of a join, I'd make that the clustered index.  SQL then literally orders the records based on account number.  A date field might be a good candidate if queries are often date driven. Try running similar queries with or without the clustered index.  If it is a large amount of data, you should see a difference.

    One of the cases for using a basic identity column for the join key is size -- the int data type only 4 bytes.  Surrogate keys are often some sort of text-based field.  The larger the key and the larger the tables joined, the slower the query should be. 

    It sounds like arguing the merits of a contextless primary key (such as identity column) in this case is going to be difficult.  The only way that would seem to work is to actually demostrate better performance.  Given the size of what you're querying, there may or may not be a performance gain.  I think the identity method scales better than the others.  In your case, I'd just try a clustered index and see if there's a performance gain there -- don't push the "academic discussion" of relational databases. 

    Good luck!

  • I am being told that there is no benefit to having a primary key defined, especially a surrogate key of identity if it is not used in joins.

    I disagree with James insofar that you should not push the academic discussion of relational databases!

    Give that person(s) an easy to read introduction to relational databases. An PK is one of the fundamental concepts in relational database theory. It's used to uniquely identify a row within a table. Nothing more and nothing less. It's not all about performance.

    Even better is to let them post their opinion to MS newsgroups, and with a little bit of luck get a response from Joe Celko. Should be very funny.

    Just my $0.02 cent.

    --
    Frank Kalis
    Microsoft SQL Server MVP
    Webmaster: http://www.insidesql.org/blogs
    My blog: http://www.insidesql.org/blogs/frankkalis/[/url]

  • I actually agree with you Frank

    What I think though is in this particular situation it will lead to more arguments and frustration.  Who wants to beat their head against a wall

    Ooo, yeah, that would be a good thread on the MS newsgroup

  • James,

    yes, I know.

    "Against stupidity the gods themselves struggle in vain."

    Nobody will benefit from a clash. That will not generate a win-win situation.

    ...anyone noticed that I attended a meeting with marketing guys today

    --
    Frank Kalis
    Microsoft SQL Server MVP
    Webmaster: http://www.insidesql.org/blogs
    My blog: http://www.insidesql.org/blogs/frankkalis/[/url]

  • I was guessing you had recently attended an HR class. 

    Robert

  • There is no hard and fast rule that states that a table must have a PK. However it is recommended that they do. Now what type of PK should the table have? Well, that depends. I like to use natural PKs where the data allows for that. In cases where this it is not easily discernable, then we I use artificial keys to identify records uniquely. With regards to joins for Query purposes, we refer to indexes to help us there. If your tables are not going to be related then the concept of PK and FK does not come into play and you can still identify records uniquely in the tables without PKs by simply using indexes or enforcing table constraints in a declarative manner. In summary, do you need PKs, in a table no? Is it recommended? Yes. Can you uniquely identify records in a table without PKs? Yes. Can you want to eliminate duplicates in the tables without PKs? Yes, use Ingnore Duplicate indexes as long as the number of columns does not exceed the limit which I believe is 16.

  • Do you understand that my dilema is that it is being suggested that the table does and should contain duplicate rows?  I understand that there are several ways to enforce uniqueness in a table.  What I need to do is convince my supervisor and several peers that the rows need to be unique at all.  I believe it to be a fundimental concept of relational databases.  Yet, I am being asked to accept that there are several tables in our datawarehouse that contain non-unique data.  I have been told that unless I can prove that the database will perform better with a unique table enforced by surrogate key column then it is not worth it.

  • Wow.  Performance has nothing to do with uniqueness.  If there really is duplicate data, then they have a data quality issue -- performance doesn't play into it at all.  Now if they're saying that there is a surrogate key that they supplies uniqueness, THAT is the primary key.  It's just a question of semantics.  Either there is duplicate data or there isn't.  If rows aren't unique in some way, you have a potential data quality issue. 

    It sounds like there might be a cross-argument going on here, or a non-technical agenda.  A surrogate key is still a key. It can be a primary key, or it can be a "seach" key.  When initially responded I was looking at this from a physical database implementation viewpoint.  I still stand by saying that simple integer identity columns are best for use as keys, both primary and foreign.  Given your last post though, it seems like the issue at hand is whether there is duplicate data or not.  If what they're referring to as a surrogate key allows for duplicates, then it isn't really a key at all.  If there are duplicates, what is the other criteria that makes them valid data elements (date, product type, salesperson, etc)?

    If they can't explain why there can be duplicate data (if there is), and they don't understand what the *business* meaning is if there is duplicate data, I really don't know what else to suggest.  If there's really duplicate data, perhaps exploring a specific business case and the implications of the data quality might be the way to go.  For example, if two sales people take credit for the same sale, and it is entered in twice, you're going to make it look as if there are multiple sales.  That shouldn't reconsile with accounting data.

    Very strange situation  you're in, and I don't know what else to suggest.  There looks to be a communication gap somewhere.  In previous posts you mentioned them talking about "the academic discussion of relational databases".  Try avoiding database language at all and talk about it in business terms.  If they can understand the business problems associated with decisions based on suspect data, you've got a real problem.

    Hang in there.  My heart goes out to you -- wouldn't want to be in your shoes :

  • If you ever need to work with the data, then having a PK will make life much easier, even if it is just an identity column.  Perhaps you can argue that development and maintenance time will be reduced.  An example might be that you have loaded an extra record by mistake.  How would you delete just one if it is a duplicate?  One way is to use SET ROWCOUNT, provided you only have one record duplicated.  The PK might also reduce table scans in queries.  If you think a trigger will be required, then a way to identify the records before and after is very helpful.  In this case, an identity column is nice because the user can't update it - the before and after records can be matched.  In this case, a PK is not enough because the PK values might be updated. 

    Our legacy system has duplicate ledgers that are valid - perhaps the same item is purchased more than once in a given day.  A datetime field with the insert time would have saved me days of work repairing the ledgers when something went wrong.  I can normally detect invalid duplicates by checking if all the ledgers in a given day are duplicated. 

    Those who argue against a PK probably are not responsible for fixing the problems they create with bad design. 

    Randy
    Helpdesk: Perhaps Im not the only one that does not know what you are doing. 😉

  • This sounds like a typical SQL Server vs. Oracle feature question.

    If I recall correctly, Oracle creates a unique identifier for each row, if one does not already exist.  I'll bet that this issue is a non-issue for people coming from the O-world.  SQL Server doesn't create this unique identifier in the background automatically - at least not yet anyway.

    Uniqueness is mandatory if you would ever want to update or delete only one of those rows.  Otherwise any action that would involve one row would involve all rows whether you wanted it to, or not.

    That's why we all have unique ssn's, unique employee numbers, unique addresses, unique fingerprints, etc.  If we didn't we would all respond when any one of us were called upon to do something.

    Uniqueness is mandatory.  Its just a matter of how you achieve it, or if you even realize that you are getting it behind the scenes.

    Regards,

    Randy

  • Yes, I think your all catching on now.  The discussion is really about uniqueness in tables not programatically setting a PK. 

    One of the tables in question contains payments customers have made on their credit cards.  It is possible and does happen more often than you would guess that more than one payment for the same amount is posted to one account on the same day.  The table is an extract from the mainframe system of record.  Unfortunately the system of record does not provide a unique identifier for each payment.  No inserts or deletes are performed on this table other than the daily bulk insert of the entire extract.  We perform a simple record count comparison to ensure that we loaded all the records sent by the mainframe.  The table does have duplicate rows, has no clustered index and is used primarily in joins to account level data to provide MIS reporting related to collections on those accounts.

    There are a couple of other examples of the same situation with different system of record data. 

    The really interested part for me is one of the main opponents to this best practice wrote an application that matches payments to promises to pay credited to collectors for incentive purposes.  In his application he consumes this payment data from the main extract, gives it a "unique identifier" in his database then matches the payments with promises.  My suggestion of adding the identity field at the extract level would ensure that all of our reporting would handle the payments in a consistent manner. 

    Thanks alot for all your comments and sympathy.  I am guessing that it will take an embarassing discovery of inconsistent reports by our customers to get these people to understand the message.

    Cheers,

    Glen

  • I don't understand this discussion.

    If duplicate data is mandated by the nature of the data described, then you don't need a primary key.

    If you need to be able to distinguish between every piece of data seperately, then put in a primary key.

    It's just that the former case is very rare. Usually there is no point in having the exact same data presented twice, physically.

    In your case instead of having:

    • Duplicate Data
    • Duplicate Data
    • Duplicate Data

    I would have:

    • Data, Counter = 3

    PS: Is there no date specified for these credit card transactions? If so, then your data wouldn't be duplicated anyway, the date would be different.

Viewing 15 posts - 1 through 15 (of 23 total)

You must be logged in to reply to this topic. Login to reply