Give Up on Natural Primary Keys

  • RonKyle - Monday, October 1, 2018 1:17 PM

    I'm suggesting that you don't need to ignore performance for the sake of accuracy and that, quite frequently, the performance considerations will actually improve accuracy. The example of using a surrogate key instead of a multi-column natural key (which should be AK'd) will do both.

    In none of my posts did I say to ignore performance.  You're making a claim that I did not make.  And I find the opposite to be the case.  Accuracy will improve performance.  I have made the occasional design alteration that created redundant data, but only when clear no other way was possible.  I can count on one hand the number of times this has been the case.  Even times I thought I would have to compromise the design I was able to come up with solutions that worked at the speed needed.  The surrogate will NOT do both out of the box.  As a previous poster has noted, and I have observed in real life, duplicate data can get in there because it the duplicate entity was simply assigned a new key.  A unique constraint of some kind or other is required to prevent this from happening.  That cannot be without its cost.

    My point is that you were being pedantic about the word "overriding".  While I agree that "accuracy" is THE overriding factor, it's not the only super important factor with "overriding" tendencies.  The fact that you were so pedantic about it (especially in your disagreement responses to me) made it sound like you didn't actually care about performance, which is a frequent problem with a whole lot of people.

    Go back to the beginning, the proper use of surrogate keys and the related alternate keys usually solves both accuracy and performance problems.  There's just no need to compromise on either during design or implementation.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • I find it strange that we still see articles like this discussing "primary" keys. In SQL Server in most (or all?) circumstances there is no real difference between a key defined using a PRIMARY KEY constraint and a key defined using a UNIQUE constraint on non-nullable columns. True, there are some very trivial differences of syntax but why would those tiny differences need to influence key design decisions?

    GDPR effectively gives even more legal weight to the requirement that personal data should always have natural keys. The key needs to be familiar and visible to users and stable over time so that their data can be properly identified in order to satisfy GDPR requirements.

  • sqlvogel - Tuesday, October 2, 2018 12:10 AM

    GDPR effectively gives even more legal weight to the requirement that personal data should always have natural keys. The key needs to be familiar and visible to users and stable over time so that their data can be properly identified in order to satisfy GDPR requirements.

    So what would the natural key be for the following personal data?
    Title, Firstname, Lastname, Address, Telephone Number and Email address.

    Far away is close at hand in the images of elsewhere.
    Anon.

  • This conversation puts me very much in mind of a new piece of software we're using at the office, which is setup with an email address being the primary key. Instant nightmare for 2 reasons. One, the very subject we're talking about here;if a customer invokes the right to be forgotten we can't simply anonymise the data, as their email address is in almost every table; at least if it's client data.

    The other problem is, Email address aren't unique (and I'm sure we all know this)! I know several people (albeit it, I will admit of the older generation), that have shared email accounts. Instantly we can't register them twice on the system as it stands. Secondly, we own multiple services, of which a customer could be registered with both, as they are different and deal with different entities (which is good, it means people can't access things they aren't responsible for (data segragation)), however, we can't register these people for more than one service out of the box, as (like before), it's non-unique.

    I have to agree that I avoid Natural Keys as well. Many of us have probably been told "It's unique, it'll never appear more than once", for it to be a total lie. This GDPR reasoning just gives me/us another reason to argue against it.

    I think one of my favourite natural keys I was told to use was the Passport Number for Horses. I was, very assurably, told "Every Horse has a passport number, and it will never change." What they didn't, however, tell me was that although every horse has a passport, operators won't always capture it, or will put "non-unqiue holding values" like "TBC" in the field. Oh, and although a Horse's Passport Number might not change, that doesn't mean it can't be sold, and thus need to be put under a different clients data (with that same unique passport number). >_< So glad that I didn't follow that "advice".

    Thom~

    Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
    Larnu.uk

  • sqlvogel - Tuesday, October 2, 2018 12:10 AM

    I find it strange that we still see articles like this discussing "primary" keys. In SQL Server in most (or all?) circumstances there is no real difference between a key defined using a PRIMARY KEY constraint and a key defined using a UNIQUE constraint on non-nullable columns. True, there are some very trivial differences of syntax but why would those tiny differences need to influence key design decisions?

    GDPR effectively gives even more legal weight to the requirement that personal data should always have natural keys. The key needs to be familiar and visible to users and stable over time so that their data can be properly identified in order to satisfy GDPR requirements.

    Agreed.  There's little difference but... this article isn't actually about PK constraints, although those certainly fit into what the article is talking about.  It is, in fact, about what one uses as a unique identifier for rows in tables.  It usually just easier to say "PK", though.

    As for GDPR providing more justification for the use of natural keys as the PK, I have to disagree because I've what I've been through in the past.  I'm an advocate of using surrogate keys as the proverbial PK and the natural keys as an AK and only in one or two tables usually necessary to store the PII.  That way, I can anonymize the PII for an individual just by making data changes to those one or two tables without having to go through dozens or hundreds of tables.  A great example is when people use the TaxID (usually, SSN in the U.S.) as a natural key.  There's hell to play to find all tables that contain such information and then remove it.  It's far better (IMHO) to keep all data anonymous except when it's related to the one or two tables that aren't.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • I would go even further to not just avoid Natural Primary Keys but aswell PKs using <Value> +1 (at least for Scenarios with a lot of Rows to be expected), however that one has nothing to do with GDPR but rather with the fact that CPUs operate faster on shift than addition. Every CPU Cycle counts and everyone who had to face a GUID as PK knows the other end of the nightmare.

  • DinoRS - Tuesday, October 2, 2018 6:32 AM

    I would go even further to not just avoid Natural Primary Keys but aswell PKs using <Value> +1 (at least for Scenarios with a lot of Rows to be expected), however that one has nothing to do with GDPR but rather with the fact that CPUs operate faster on shift than addition. Every CPU Cycle counts and everyone who had to face a GUID as PK knows the other end of the nightmare.

    "CPUs operate faster on shift than addition", would you care to explain that, I'm a bit confused!

  • First we must understand what is the external function of a "primary" key (versus an alternate key or candidate key). Once you understand a primary key's external role in relation to other tables, then you can decide how the key should be composed internally (natural or surrogate) or physically structured (clustered versus non-clustered). In my opinion, and I believe this also conforms to the SQL standard, the designated primary key should be the key that we exposed for foreign key relationships, which is why it's called the "primary" key. A composite key like SSN + DOB + FIRSTNAME is a very poor choice for a primary key, mostly because it's mutable, but also for performance and normalization reasons as well. A surrogate key (ie: a sequentially incremented CustomerID) is a better choice for the primary key. Of course, it's also a good idea to create an alternate key index on the natural key as well, not only for the purpose of data integrity (uniqueness) but also because you'll find that many end user queries will key on natural identifiers (ie: SSN) not the primary key.

    My point is that the primary key isn't just about internal uniqueness, it's about external referential integrity, and that should drive your decision about how to construct it. Natural keys are very important, and there should always be a natural key for internal integrity (uniqueness within the table), but that key need not be the designated primary (facilitating foreign key relationships). If you can't identify a natural key, then there is something wrong with your data or understanding of of the data; go back to the white board or consult with others in the organization.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • RonKyle - Monday, October 1, 2018 10:41 AM

    Heh... performance is second only to accuracy so I'll have to disagree with the notion that performance isn't an overriding concern.

    We agree with the priority:  accuracy (or data integrity), then speed (or performance).  Therefore it is not an overriding concern.  Otherwise it would be first.  You are finding disagreement where none exists.

    Accuracy should be non negotiable. However with performance, you can get "good enough", it would be interesting to ruminate about situations where accuracy is merely "good enough", even with non critical apps I don't think theres really a plus to being inaccurate.

  • DinoRS - Tuesday, October 2, 2018 6:32 AM

    I would go even further to not just avoid Natural Primary Keys but aswell PKs using <Value> +1 (at least for Scenarios with a lot of Rows to be expected), however that one has nothing to do with GDPR but rather with the fact that CPUs operate faster on shift than addition. Every CPU Cycle counts and everyone who had to face a GUID as PK knows the other end of the nightmare.

    I'd argue that, at the register level on the CPU, that it matters whether or not they're BIG ENDIAN or LITTLE ENDIAN but it's been a very long time since I've looked at individual machine language operation clock cycle charts. 

    The other thing is, if you're actually talking about "Shift Left" or "Shift Right" operations, how does that even apply here?  "Shift Left" is multiplication by 2 and "Shift Right" is division by 2, neither of which is appropriate for the subject at hand.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Jeff Moden - Tuesday, October 2, 2018 8:24 AM

    ...talking about "Shift Left" or "Shift Right" operations....

    Don't forget the hokey pokey :hehe:

    Far away is close at hand in the images of elsewhere.
    Anon.

  • sqlvogel - Tuesday, October 2, 2018 12:10 AM

    GDPR effectively gives even more legal weight to the requirement that personal data should always have natural keys. The key needs to be familiar and visible to users and stable over time so that their data can be properly identified in order to satisfy GDPR requirements.

    I would completely disagree here. The data needs to be accurate and identifiable, which a surrogate does. However, data isn't always stable. Ladies get married? PK changes if you're using the name. If you're using a combination of fields, I would argue you're creating additional risk by putting PII in places it doesn't need to be, and violating the data minimisation principle of the GDPR.

  • Jeff Moden - Tuesday, October 2, 2018 8:24 AM

    I'd argue that, at the register level on the CPU, that it matters whether or not they're BIG ENDIAN or LITTLE ENDIAN but it's been a very long time since I've looked at individual machine language operation clock cycle charts. 

    The other thing is, if you're actually talking about "Shift Left" or "Shift Right" operations, how does that even apply here?  "Shift Left" is multiplication by 2 and "Shift Right" is division by 2, neither of which is appropriate for the subject at hand.

    Hi Jeff,
    when you do (need) to consider replacing Natural Keys, "Shift  Left / Right" is quite viable. It might get even more viable when you consider a lot of rows in a very short time.
    Intel might agree with me on this subject, considering the following statement:

    In processors based on Intel Core microarchitecture, shift immediate by 1 is handled by special hardware such that it does not experience partial flag stall.

    Ofcourse, this might have changed with newer Architectures again, however I wouldn't expect any other result regarding these operations. You might not consider something like this unless you're heavily bound by CPU Resources or if your deep analysis uncovers these stalls to happen in your systems.

    Another point while being on this is: I never said I'd use uniqueidentifier or int or similar for this, have a look at this in binary:
    01000100
    01000110
    01000101
    01000010
    01000001

    Can you see how this might be much more benefical for Indexing than say a GUID or even NEWSEQUENTIALID() considering add (as CPU Operation to get value + 1) might get to the point of stalling? It might be even less readable than a GUID - which shouldn't be exposed publicly anywhere either within your application - but for extremely INSERT and Index Seek heavy Operations it should be (partially) faster. You can even go ahead and limit the amount of possible key by defining the size of the binary column appropriately.

  • Jeff Moden - Monday, October 1, 2018 6:13 PM

    Lynn Pettis - Monday, October 1, 2018 3:11 PM

    Part of the logical design of tables is identifying the Candidate Keys, those keys that can be used to uniquely identify a single row of data.  From there you have decide which will be the Primary Key making the others Alternate Keys, if you choose to implement one or more.

    Yep... that's pretty much what I've been saying.  The thing that a whole lot of people leave out is the "what if's" when those supposedly reliable natural keys actually do change.  That throws a huge monkey wrench into "accuracy" and causes another "performance" issue that few even think about and that's the amount of time it takes to change the natural keys that have now been used in dozens or even hundreds of tables and well as the regression testing it takes to ensure that batch runs and GUI's alike must go through because of the mistake.

    I learned to not like natural keys when those were used in ISAM databases I had to support years ago.  The problem was when those Primary Keys were no longer unique because of changing business requirements.  The hassle it caused when those changes were made truly sucked.  Had SIDs been used the changes would not have been as difficult or wide spread.

    Still, having a natural key as an Alternate Key makes sense.  Having more than one way to identify a single row of data is a good thing.

  • DinoRS - Tuesday, October 2, 2018 10:30 AM

    Hi Jeff,
    when you do (need) to consider replacing Natural Keys, "Shift  Left / Right" is quite viable. It might get even more viable when you consider a lot of rows in a very short time.
    Intel might agree with me on this subject, considering the following statement:

    In processors based on Intel Core microarchitecture, shift immediate by 1 is handled by special hardware such that it does not experience partial flag stall.

    Ofcourse, this might have changed with newer Architectures again, however I wouldn't expect any other result regarding these operations. You might not consider something like this unless you're heavily bound by CPU Resources or if your deep analysis uncovers these stalls to happen in your systems.

    Another point while being on this is: I never said I'd use uniqueidentifier or int or similar for this, have a look at this in binary:
    01000100
    01000110
    01000101
    01000010
    01000001

    Can you see how this might be much more benefical for Indexing than say a GUID or even NEWSEQUENTIALID() considering add (as CPU Operation to get value + 1) might get to the point of stalling? It might be even less readable than a GUID - which shouldn't be exposed publicly anywhere either within your application - but for extremely INSERT and Index Seek heavy Operations it should be (partially) faster. You can even go ahead and limit the amount of possible key by defining the size of the binary column appropriately.

    the values you posted aren't exclusively creatable by only using shifts. Also flag stall I'm hazarding a guess will be very much below statistical noise considering all the other overhead you would encounter when running inserts with SQL Server.

    Still, if you like, can you elaborate on how you expect to produce the above values by only using shifts?

Viewing 15 posts - 31 through 45 (of 108 total)

You must be logged in to reply to this topic. Login to reply