Give Up on Natural Primary Keys

  • RonKyle - Monday, October 1, 2018 7:35 AM

    Programming / development is made much easier by this as well.

    I'm not current on the state of programming.  But back in my programming days, when an identity was necessary and also involved a child table, it was not easy to add the parent, retrieve the generate key, and then insert the key.  The SEQUENCE, which didn't exist in SQL Server when I was programming, helps that greatly.  But I certainly don't see how it makes it easier.  It may be harder.  But that's not the same as easier.

    In terms of changing values, redacting information, you would only change the non-surrogate values in the the main table. Every other table with an FK would be using a surrogate key, so this wouldn't change. No work is being done.

    This wasn't written in  reference to  the original application, but programming any redaction work.

  • jennym - Monday, October 1, 2018 8:00 AM

    I agree with all of the reasons given for use of surrogate keys.  Two thoughts come up for me around this:

    1.     I’ve found that when surrogate keys are used,depending on the person in charge of the database, duplication of information increases.  Whether due to hiccups in process, laziness, or something else.  I see administrators assign a new surrogate to an existing entity because apparently that is easier than determining if the entity already exists.   With natural keys (at least anecdotally) this does not seem to happen as much.  

    2.     Is there any risk of some ‘master surrogate’ keybeing associated with a given entity across most databases (think a cyber SSN),unintentionally creating an even bigger security risk.

    Some thoughts on your comments.

    1 - Unique nonclustered indexes can minimize duplicate data on key fields.
    2 - Surrogate keys tend to be meaningless in terms of data. They identify records, not data. So unless you can figure out that 9361903 = you it's not something I would concern myself with when doing development.

  • Victor Kirkpatrick - Monday, October 1, 2018 7:59 AM

    A single column int or bigint will always perform better than some multi-column natural key with one or two char/varchar columns.

    Performance is not the overriding concern. I use identity keys if there is no good natural key or at some point at three or four different columns I'll use an identity key instead. But then you have to ensure that the columns don't have what in reality is a duplicate, which I've seen in this type of design. A unique constraint would solve this, but what impact does that have rather than just making it the key to begin with.

    Performance is a major concern when you're dealing with tens of millions or billions of rows in parent / child relationships with multiple levels, and wide tables at that.


    Programming / development is made much easier by this as well.

    I'm not current on the state of programming. But back in my programming days, when an identity was necessary and also involved a child table, it was not easy to add the parent, retrieve the generate key, and then insert the key. The SEQUENCE, which didn't exist in SQL Server when I was programming, helps that greatly. But I certainly don't see how it makes it easier. It may be harder. But that's not the same as easier. 

    I was speaking mainly about joining tables, and especially multiple levels. Having 1 column to join down to on each level is a huge plus, in terms of performance and clarity, and not to mention the data space savings in not having to have the upper level natural key columns repeated in your lower level table just to have the matching relationship. Surrogate keys rock. I'd rather have that one column relationship and do the key generation and insert downstream (an easy operation, really) when needed, which is rare. Stored procedures take care of that very easily when adding data normally.
    To each his own... this entire subject is very polarizing for db folks.

    Heh... performance is second only to accuracy so I'll have to disagree with the notion that performance isn't an overriding concern.

    But, there are larger concerns when it comes to supposed natural keys, just one of which I've already posted above and many have been posted by others on this thread.  There's also nothing at all that says that you must have only one unique key on any given table.  For example, if you actually DO have a viable and proper multi-column natural key, there's no reason why you can't create an AK (Alternate Key) to help ensure that there will be no duplication in the table... well... except for name tables and the like because all sorts of atrocities occur there especially in the heat of battle and uncontrolled inserts, which happens a whole lot more than anyone would care to admit especially if such names are coming from the internet or telephone conversations or third party data or etc, etc, etc.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Jeff Moden - Monday, October 1, 2018 9:12 AM

    Heh... performance is second only to accuracy so I'll have to disagree with the notion that performance isn't an overriding concern.

    But, there are larger concerns when it comes to supposed natural keys, just one of which I've already posted above and many have been posted by others on this thread.  There's also nothing at all that says that you must have only one unique key on any given table.  For example, if you actually DO have a viable and proper multi-column natural key, there's no reason why you can't create an AK (Alternate Key) to help ensure that there will be no duplication in the table... well... except for name tables and the like because all sorts of atrocities occur there especially in the heat of battle and uncontrolled inserts, which happens a whole lot more than anyone would care to admit especially if such names are coming from the internet or telephone conversations or third party data or etc, etc, etc.

    I'd actually reverse that. I'd make the surrogate the PK and a natural key the AK. That way if I am joining, I want to use the PK and reduce any future adjustments of those values.

  • Heh... performance is second only to accuracy so I'll have to disagree with the notion that performance isn't an overriding concern.

    We agree with the priority:  accuracy (or data integrity), then speed (or performance).  Therefore it is not an overriding concern.  Otherwise it would be first.  You are finding disagreement where none exists.

  • RonKyle - Monday, October 1, 2018 10:41 AM

    Heh... performance is second only to accuracy so I'll have to disagree with the notion that performance isn't an overriding concern.

    We agree with the priority:  accuracy (or data integrity), then speed (or performance).  Therefore it is not an overriding concern.  Otherwise it would be first.  You are finding disagreement where none exists.

    I'll have to disagree again, Ron.  Performance is so close to accuracy in priority that they're almost the same.  The trouble is that many don't understand that doing it for accuracy should not exclude also doing something for performance and the two things are not mutually exclusive.  If you don't consider both during design, one or the other, usually performance, isn't considered and then it bites people in the near future.  It doesn't matter how accurate something is if you can't actually get at the data you need and, likewise, it doesn't matter how fast something is if it produces incorrect results.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Jeff Moden - Monday, October 1, 2018 11:07 AM

    RonKyle - Monday, October 1, 2018 10:41 AM

    Heh... performance is second only to accuracy so I'll have to disagree with the notion that performance isn't an overriding concern.

    We agree with the priority:  accuracy (or data integrity), then speed (or performance).  Therefore it is not an overriding concern.  Otherwise it would be first.  You are finding disagreement where none exists.

    I'll have to disagree again, Ron.  Performance is so close to accuracy in priority that they're almost the same.  The trouble is that many don't understand that doing it for accuracy should not exclude also doing something for performance and the two things are not mutually exclusive.  If you don't consider both during design, one or the other, usually performance, isn't considered and then it bites people in the near future.  It doesn't matter how accurate something is if you can't actually get at the data you need and, likewise, it doesn't matter how fast something is if it produces incorrect results.

    Sorry, but you're wrong.  The first duty of a DBA, or data steward if that's the better term, is to the integrity of the data.  That is the overriding concern.  I have had too many bad experiences with developers forgoing things like transactions in the name of performance, only to have to deal with incomplete data issues later.  First, make it right.  Second, make it fast.  When doing the second, don't mess up the first.

  • RonKyle - Monday, October 1, 2018 11:12 AM

    Sorry, but you're wrong.  The first duty of a DBA, or data steward if that's the better term, is to the integrity of the data.  That is the overriding concern.  I have had too many bad experiences with developers forgoing things like transactions in the name of performance, only to have to deal with incomplete data issues later.  First, make it right.  Second, make it fast.  When doing the second, don't mess up the first.

    Sounds good in principle, but I have to tell you as a developer that performance is neck and neck with accuracy in importance.

    After all, if a data process is absolutely perfectly accurate but takes 12 days to produce an answer, that's clearly unacceptable. Especially if the answer is supposed to be available in real time...

    Now, admittedly getting a wrong answer with split-second performance isn't acceptable either. Still, the only acceptable design is one that produces accuracy as rapidly as possible. And that takes some serious design work.

  • And that takes some serious design work.

    That's the only kind I do.

  • A further word on speed.  I have taken ETL processes that previously took hours and by redesigning them made them much faster.  Recently I took another one that was taking 45 minutes to an hour and got it down to under five minutes.  It's so much faster that the data warehouse is updated every hour.  So please no phrases like "it can't take 12 hours."  I know that.  But I find if the design is good the speed is usually there with the proper indexing.

  • RonKyle - Monday, October 1, 2018 11:12 AM

    Sorry, but you're wrong.  The first duty of a DBA, or data steward if that's the better term, is to the integrity of the data.  That is the overriding concern.  I have had too many bad experiences with developers forgoing things like transactions in the name of performance, only to have to deal with incomplete data issues later.  First, make it right.  Second, make it fast.  When doing the second, don't mess up the first.

    I'm in no way suggesting that performance considerations should ever override accuracy.  I'm suggesting that you don't need to ignore performance for the sake of accuracy and that, quite frequently, the performance considerations will actually improve accuracy.  The example of using a surrogate key instead of a  multi-column natural key (which should be AK'd) will do both.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • RonKyle - Monday, October 1, 2018 11:12 AM

    Sorry, but you're wrong.  The first duty of a DBA, or data steward if that's the better term, is to the integrity of the data.  That is the overriding concern.  I have had too many bad experiences with developers forgoing things like transactions in the name of performance, only to have to deal with incomplete data issues later.  First, make it right.  Second, make it fast.  When doing the second, don't mess up the first.

    In a lot datasets, data is almost never 100% accurate to begin with. For example, revenue in various currencies on top of distributions of time. Others, is the data is not finishes it's lifecycle when you analyze it. Could go on and on.

  • I'm suggesting that you don't need to ignore performance for the sake of accuracy and that, quite frequently, the performance considerations will actually improve accuracy. The example of using a surrogate key instead of a multi-column natural key (which should be AK'd) will do both.

    In none of my posts did I say to ignore performance.  You're making a claim that I did not make.  And I find the opposite to be the case.  Accuracy will improve performance.  I have made the occasional design alteration that created redundant data, but only when clear no other way was possible.  I can count on one hand the number of times this has been the case.  Even times I thought I would have to compromise the design I was able to come up with solutions that worked at the speed needed.  The surrogate will NOT do both out of the box.  As a previous poster has noted, and I have observed in real life, duplicate data can get in there because it the duplicate entity was simply assigned a new key.  A unique constraint of some kind or other is required to prevent this from happening.  That cannot be without its cost.

  • Jeff Moden - Monday, October 1, 2018 9:12 AM

    Heh... performance is second only to accuracy so I'll have to disagree with the notion that performance isn't an overriding concern.

    But, there are larger concerns when it comes to supposed natural keys, just one of which I've already posted above and many have been posted by others on this thread.  There's also nothing at all that says that you must have only one unique key on any given table.  For example, if you actually DO have a viable and proper multi-column natural key, there's no reason why you can't create an AK (Alternate Key) to help ensure that there will be no duplication in the table... well... except for name tables and the like because all sorts of atrocities occur there especially in the heat of battle and uncontrolled inserts, which happens a whole lot more than anyone would care to admit especially if such names are coming from the internet or telephone conversations or third party data or etc, etc, etc.

    Part of the logical design of tables is identifying the Candidate Keys, those keys that can be used to uniquely identify a single row of data.  From there you have decide which will be the Primary Key making the others Alternate Keys, if you choose to implement one or more.

  • Lynn Pettis - Monday, October 1, 2018 3:11 PM

    Part of the logical design of tables is identifying the Candidate Keys, those keys that can be used to uniquely identify a single row of data.  From there you have decide which will be the Primary Key making the others Alternate Keys, if you choose to implement one or more.

    Yep... that's pretty much what I've been saying.  The thing that a whole lot of people leave out is the "what if's" when those supposedly reliable natural keys actually do change.  That throws a huge monkey wrench into "accuracy" and causes another "performance" issue that few even think about and that's the amount of time it takes to change the natural keys that have now been used in dozens or even hundreds of tables and well as the regression testing it takes to ensure that batch runs and GUI's alike must go through because of the mistake.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

Viewing 15 posts - 16 through 30 (of 108 total)

You must be logged in to reply to this topic. Login to reply