Do You Really Need Surrogate Keys? Part 1 – Concepts and Methods

  • How are these articles reviewed/approved !?

    How is it decided that they are good enough to be posted !?

    Are we really going to have a Part 2 !? If we are, I have to remember to skip it (sorry)

    I like the topic, but the "tests" used to support the conclusion are not relevant !!??? Many other already complained about how the SELECT and UPDATE SQL and not relevant to the point, etc...

    I could go on with my two cents as well, but there is no point - I think we have already wasted enough of everyone's time !?

  • ta.bu.shi.da.yu (10/25/2010)


    What does that even mean?!?

    It means performance responds to increasing workload in an O(n) (linear) or better manner. Some people loosely use buzzwords like "scalable" to simply mean "good".

  • ebay-1087924 (10/25/2010)


    ta.bu.shi.da.yu (10/25/2010)


    1NF takes a table with repeating groups and splits them into several tables. It doesn't take repeating groups from several tables... I fail to see your point here. When I normalize data, I don't look at multiple tables, I look at the one table and normalize that. This leads to multiple tables.

    If you can show me an example of where you find repeating groups on multiple tables, please feel free to correct me!

    BCNF also relates to an individual table - it cares about dependencies within the one table.

    5NF, again, relates to join dependencies in the one table.

    You're not properly understanding normalization. It's not about "taking one table and splitting it into multiples" (though most examples do illustrate the rules in that manner).

    Put simply, normalization is about expressing a fact without modification anomalies. If a PK uniquely identifies a person, and sex is an attribute of the person, then having that sex column in more than one table anywhere in the database is a violation of normal form. Period. Redundancy violates 2NF .. regardless of what table that redundancy is in.

    Normalization analysis operates across all tables in your database. It's not something that can be fully applied on a table-by-table.

    While it's obviously not a good idea to have duplicate relations for the same entity, nowhere in Codd's relational model does it state that you normalize anything other than the relations themselves. I suppose if you consider a database as a gigantic relation then you may have a point, I've never seen this done in practice.

    Random Technical Stuff[/url]

  • ebay-1087924 (10/25/2010)


    ta.bu.shi.da.yu (10/25/2010)


    What does that even mean?!?

    It means performance responds to increasing workload in an O(n) (linear) or better manner. Some people loosely use buzzwords like "scalable" to simply mean "good".

    Fair enough, I see your point.

    Random Technical Stuff[/url]

  • To use huge multi-column "real-world" keys, you have to have a lack of knowledge about database design. Surrogate keys are the only way to go if you want large production databases to perform well, and unlike what the author said about "overhead" using surrogate keys, there is actually a very large savings when using them. Keys do not have to be as large and when constructing the foreign keys down the chain of tables, when using surrogates the keys stay small. Honestly, learn about surrogates and use them.

  • ta.bu.shi.da.yu (10/25/2010)


    nowhere in Codd's relational model does it state that you normalize anything other than the relations themselves. I suppose if you consider a database as a gigantic relation then you may have a point, I've never seen this done in practice.

    The point you fail to understand is that "a relation" is an attribute of the database, not a specific table. The fact here is the relationship between data entities not columns in a specific table. If the entity is expressed elsewhere in another column, you're violating normal form.

    If you're not normalizing across all tables in your database, you're doing it wrong.

  • Robert Mahon-475361 (10/25/2010)


    ebay-1087924 (10/25/2010)


    Consider a people list originally keyed by SSN. Replacing SSN with a surrogate key means that -- unless you add a secondary uniqueness constraint -- you can then have duplicate SSNs in the database.

    A) Does the person have an SSN? It's more common than you thing with foreign workers. Being in the US but a Brit, I've had a terrible time getting things sorted without an SSN, often having to use made up SSN's so they've been able to enter me into the system (they made it up, not me! But they had to enter SOMETHING to get the system to take it. That's broke if you have to make things up to get data in) Maybe they've just not got it at that time, or are unconscious and can't provide it (real world experience here...)

    B) Has their SSN changed (ID Fraud the main reason it'd change, something that's becoming more common).

    C) Can a dupe occur? (It shouldn't, but if the system spits out it already exists, is it the new one being entered that's the problem so you can't enter it at that time, or the old one?) You'd not be able to enter any data until resolved, or, have to put in a fake one 'just to get around it for now'.

    Just use a PK and you'll not worry about that. SSN's an attribute of the entity, not the unique value in real life alas.

    Again, read Quazibubble's posts.

    If by "Just use a PK and you'll not worry", you mean use a surrogate key as the primary key and no UNIQUE constraint on the SSN, then that's the worst advice possible. If you are in a business that expects the SSN to be unique, you have to enforce the uniqueness.

    Let's suppose you and I both build a database for airport services. The business expects the SSN (for US citizens; for non-citizens, their countries' equilvant of the SSN is used) to uniquely identify a passenger, even when in reality it doesn't always. My database uses a constraint to enforce that uniqueness' your database uses a surrogate primary key and considers the SSN as an attribute of the entity.

    If I have to make a booking in my own DB, I might be unlucky - maybe some terrorist has forged papers, using my SSN for his fake ID. When I enter my data, I get an error message, and I can't complete the booking. It will take lots of phone call, time, maybe money -the whole red tape ritual- to get this all sorted out, and I might not even make it before my planned flight. A major nuisance.

    Your DB will make my life lots easier. It will happily spew out a new surrogate key value for me, accept my information, and book my flight. Two days later, at the airport, I am in the security queue, when some automated system correlates the SSN in my booking with previously registered terroristic activites. Before I can say "entity integrity", I'll be in a completely different plane than I intended, for a straight flight to Guantanamo Bay.

    Your mileage may vary, but I think I'll take the major nuisance of my database design over yours.


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • Victor Kirkpatrick (10/25/2010)


    ...unlike what the author said about "overhead" using surrogate keys....

    Victor

    I'd like to bring the word "overhead" to the context I used it in my article. I compared a table with a simple natural key to another one with a surrogate key. This context is one (and one only) table with a simple (one column) natural key.

    No foreign keys, no extra indexes, no nothing.

    The idea of this article is to show concepts and methods I will use in other studies on surrogate keys. I illustrated this idea using the example of a surrogate key in product table, something we do every day. My point is: we don't use surrogate keys for performance reasons. We use them to get a stable primary key. That's all.

  • Hugo Kornelis (10/25/2010)


    Let's suppose you and I both build a database for airport services. The business expects the SSN (for US citizens; for non-citizens, their countries' equilvant of the SSN is used) to uniquely identify a passenger, even when in reality it doesn't always. My database uses a constraint to enforce that uniqueness' your database uses a surrogate primary key and considers the SSN as an attribute of the entity.

    If I have to make a booking in my own DB, I might be unlucky - maybe some terrorist has forged papers, using my SSN for his fake ID. When I enter my data, I get an error message, and I can't complete the booking. It will take lots of phone call, time, maybe money -the whole red tape ritual- to get this all sorted out, and I might not even make it before my planned flight. A major nuisance.

    Your DB will make my life lots easier. It will happily spew out a new surrogate key value for me, accept my information, and book my flight. Two days later, at the airport, I am in the security queue, when some automated system correlates the SSN in my booking with previously registered terroristic activites. Before I can say "entity integrity", I'll be in a completely different plane than I intended, for a straight flight to Guantanamo Bay.

    Your mileage may vary, but I think I'll take the major nuisance of my database design over yours.

    Hugo, I mostly like your arguments, but there are a few holes in this one.

    The business expects the SSN (for US citizens; for non-citizens, their countries' equilvant of the SSN is used) to uniquely identify a passenger, even when in reality it doesn't always.

    Aren't you kind of contradicting yourself here?

    Also, not all countries have an SSN equivalent (even some relatively advanced countries like New Zealand, where any kind of attempt to introduce a 'universal identification system' is met with huge protests from privacy advocates). I suppose you could use a passport number (being that we're talking about an airport), but that has its own problems: for example, they typically change each time they're renewed. And then there are people like me who are citizens of more than one country and have more than one valid passport.

    And for those that do have an SSN equivalent, the numbers may work completely different, or overlap with those of other countries.

    And if someone chooses a surrogate key for this application, the use of said key doesn't remove the need for appropriate consistency checking in your application, or appropriate unique constraints in your database.

    I don't have a solution to these issues, I just wanted to point out how taking the 'easy' route of using

    SSN as a PK has a number of challenges that aren't immediately obvious. Unfortunately I've now spent so much time reading this thread and the SQL Team thread that I now have to do some bill-paying work. Maybe I'll come back to it.

    What I can say, is that this scenario (including all the security and privacy implications) are far more complicated than your post would suggest.

    Note: these comments are about Hugo's specific example scenario.

  • Robert Frasca (10/25/2010)


    tfriedman71 (10/25/2010)


    A surrogate key is meaningless by itself, it's just a GUID or Identity. You also need the unique key to inforce the natural key/s relationship in the table.

    You're joking right?

    I can bet he is not. Since the table has a candidate key, it must have an UNIQUE constraint to guarantee integrity of that key.

    create table dbo.ExamplePK

    (

    ID_SurrKey int not null primary key,

    NameofSomething varchar(50),

    CodeOfSomething_CandidateKEY bigint UNIque

    )

    Something like that above.

  • My 2c on the general arguments of 'surrogate' vs. 'natural' keys.

    a quick resume:

    I have formal (university) training in database design, and I understand relational and normalisation theory.

    I'm (currently) a small-time freelance consultant/developer. I've worked over the years for and with a variety of organisations of various sizes, from national telco's, to multi-national I.T. company (Digital, R.I.P.), to small businesses from 2-50 users. And with databases of corresponding sizes and complexity.

    These days, I mostly spend my time a) customising 'shrink-wrap' financial software for small businesses and b) maintaining (mostly Access front-ended) bespoke applications written by other people who have washed their hands of their applications.

    Now my point.

    This work involves a lot of writing queries, for back-end reports (Crystal, SQL Server, Access, etc), and for fixing data botched by badly written applications.

    I deal with a handful of different database design styles, some use natural keys, some use surrogates, some a combination of the two.

    99.9% of the time, I prefer dealing with those databases that use surrogate keys, be they 'Identity', 'GUID' or other. I find it much easier to find the links I need between the tables to get the data out that the customer needs. They're just easier to maintain.

    The exceptions are so minor that I wouldn't miss them if they weren't there. The only valid ones I can think of are those described in my next comment.

    In my experience, the only time natural keys can be relied on to be 'constant', is when their definitions are 'cast in stone' by a standards body. E.g. 3-letter country codes and 3-letter currency codes. Other than that, I think the benefits of natural keys are generally out-weighed by benefits of surrogates.

    Next time I find myself in the position where I'm designing a database from scratch, I can't see myself using natural keys. They're great when you're analysing and producing a system, but I have found them more a nuisance years later when trying to work them out.

    Yes, one may be able to argue that these 'natural key' applications are poorly written. Doesn't change the fact that if they were poorly written using surrogate keys, they would be easier for me to maintain today.

    And given that I'm dealing with small companies with small budgets, major re-writes just aren't an option.

    Note: This is just *my* experience. As a whole I think it should be a 'horses for courses' decision - you may be in a different environment, where you need to weight up the pros-and-cons for yourself.

  • Hugo Kornelis (10/25/2010)


    If by "Just use a PK and you'll not worry", you mean use a surrogate key as the primary key and no UNIQUE constraint on the SSN, then that's the worst advice possible. If you are in a business that expects the SSN to be unique, you have to enforce the uniqueness.

    Let's suppose you and I both build a database for airport services. The business expects the SSN (for US citizens; for non-citizens, their countries' equilvant of the SSN is used) to uniquely identify a passenger, even when in reality it doesn't always. My database uses a constraint to enforce that uniqueness' your database uses a surrogate primary key and considers the SSN as an attribute of the entity.

    If I have to make a booking in my own DB, I might be unlucky - maybe some terrorist has forged papers, using my SSN for his fake ID. When I enter my data, I get an error message, and I can't complete the booking. It will take lots of phone call, time, maybe money -the whole red tape ritual- to get this all sorted out, and I might not even make it before my planned flight. A major nuisance.

    Your DB will make my life lots easier. It will happily spew out a new surrogate key value for me, accept my information, and book my flight. Two days later, at the airport, I am in the security queue, when some automated system correlates the SSN in my booking with previously registered terroristic activites. Before I can say "entity integrity", I'll be in a completely different plane than I intended, for a straight flight to Guantanamo Bay.

    Your mileage may vary, but I think I'll take the major nuisance of my database design over yours.

    EXCELLENT examples actually. Would you prefer the terrorist can't book until he comes up with a SSN that hasn't been used before, or let him book, but let it be flagged so the scenario you suggest can happen? Perhaps let the right people know that there's something amiss and so needs closer examination. And do a search to find if that SSN has been used elsewhere, if SSN's+1/-1 are also being used at the same time.

    SSN isn't a unique key. It's not the unique reference to a person. It can change. I'm in a business that deals with SSN's, and I'd love them to be unique, but... they're not. Neither passports, port ID's, driver's licenses. We can provide heads up to the data entry to get them to check to make sure it's been entered correctly, but there's so many reasons why that data can't be used as a key, thus the reason we use the surrogate keys.

    Academic example and real world, your way is technically correct assuming data input is always valid and, as you say, if you're supporting a business that can put that constraint on a SSN (but again, it can change, and keys should never change), though the way you describe my DB is actually how things work in real systems.

    Are we going too far from the surrogate key/natural key discussion though?

  • ebay-1087924 (10/25/2010)


    ta.bu.shi.da.yu (10/25/2010)


    nowhere in Codd's relational model does it state that you normalize anything other than the relations themselves. I suppose if you consider a database as a gigantic relation then you may have a point, I've never seen this done in practice.

    The point you fail to understand is that "a relation" is an attribute of the database, not a specific table. The fact here is the relationship between data entities not columns in a specific table. If the entity is expressed elsewhere in another column, you're violating normal form.

    If you're not normalizing across all tables in your database, you're doing it wrong.

    Sigh. I'm well aware of what a relation is thanks.

    I suspect it may be you who is not really understanding that a relational database is so called because of the mathematical concept of a relation.

    I'm well aware of the foundational aspects of Codd's relational model, if not the finer points, so I personally don't really see any value with corresponding on this point any more!

    Random Technical Stuff[/url]

  • 99.9% of the time, I prefer dealing with those databases that use surrogate keys, be they 'Identity', 'GUID' or other. I find it much easier to find the links I need between the tables to get the data out that the customer needs. They're just easier to maintain.

    i know i'm only quoting a small section of the original post here, but it does raise an interesting point.

    if you look in that well knwn oracle of correct information "wikipedia" and look up normalisation , you will see one key phrase

    "Avoid bias towards any particular pattern of querying"

    i would never recommend using surrogate keys to make your particular style of querying easier.

    for the original article see

    http://en.wikipedia.org/wiki/Database_normalization

    MVDBA

  • Ryan C. Price (10/25/2010)


    Hugo, I mostly like your arguments, but there are a few holes in this one.

    The business expects the SSN (for US citizens; for non-citizens, their countries' equilvant of the SSN is used) to uniquely identify a passenger, even when in reality it doesn't always.

    Aren't you kind of contradicting yourself here?

    I'll try to keep my answer short, for I never intended to debate all the fine issues of this example; I merely chose it to contradict a point made by Robert Mahon. His suggestion was (if I didn't misunderstand him) that, because the SSN is not a perfect identifier, one should use a surrogate key, NOT enforce uniqueness on the SSN, and move on. I think that is a short-sighted short-term solution that will bite you later. Maybe I should have spent more time creating a better example, for I agree that the international character of the airport business pretty much rules out SSNs as identifier there anyway.

    That being said, I think we all know that a majority of Americans think that an SSN is a number that does uniquely identify a US citizen. And many companies set up their systems around that expectation, especially in the many sectors where the US govt requires the business to verify and store the SSN - if it's in their database anyway, and they believe the myth of unique SSNs, why bother finding another way to identify customers?

    When faced with this situations, what are your options?

    1. Step up and convince TPTB that they are in the wrong, that the SSN is not a good way to identify people. Absolutely the best option, but success is not guaranteed. If the Enterprise Architect, who makes twice as much as the DBA and has a fancier job title to boot, says SSNs are good, and the humble DBA says they are not, who will be believed?

    2. Do what Robert suggested: use a generated key, and treat the SSN as a normal attribute that need not be unique. The DB will happily accept duplicate data when entered, regardless if that is caused by a "real" duplicate in the SSNs handed out to citizen, by a forged ID, by a data entry error, or even by double-clicking the "add" button on the "new customer" screen. A few years down the line, your successor will be posting here to ask how to deal with all the duplicates in his data - if the company still has money left to deal with the issue after all the lost lawsuits.

    3. Do what I would do: enforce the uniqueness the company eroneously expects. Since I was unable to convince them not to use the SSN as identifier, they will continue to do so. Regardless of my chosen data model. Maybe they are lucky and they never actually get a duplicate (not at all unlikely - duplicate SSNs are possible, but definitely not common). In that case, the enforced uniqueness won't change a bit. But if they ever do try to enter a duplicate SSN, the application will warn them that they enter data that, in their view of the workld should not exist. They will probably not like this, and I never expect a phone call expressing their gratitude, but at least *I* know that I save them a lot of problems that are much bigger than what they get now.

    And if someone chooses a surrogate key for this application, the use of said key doesn't remove the need for appropriate consistency checking in your application, or appropriate unique constraints in your database.

    I completely agree (with the constraints; I never rely on consistency checking in the app without constraints to back it up); that was exactly the point I was trying to make. I still hope I misunderstood Robert, but his message surely seemed to indicate that he does not want to enforce the uniqueness that the business expects.


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

Viewing 15 posts - 91 through 105 (of 178 total)

You must be logged in to reply to this topic. Login to reply