Identity column as Primary Key - good or bad?

Question

Post reply

Identity column as Primary Key - good or bad?

Viewing 15 posts - 16 through 30 (of 35 total)

You must be logged in to reply to this topic. Login to reply

jcelko212 32090 SSCrazy Eights Points: 9220 More actions · Answer 1

>> I'm building a web app to maintain some data, which is mostly flat but will benefit from having some static look-up tables. Typically these look-up tables with have at most a dozen rows and are unlikely to change over time. <<

There is a really good chance you can put them into CHECK() constraints to protect your data integrity

>> Consider the two different DDL scripts. The first creates a look-up table CustomerType with an IDENTITY primary key, and the Customer table then has a Foreign Key of the CustomerType's Identity column: <<

This is totally wrong. The identity property is highly proprietary (jibber work with the original Sybase SQL Server?) And accounts physical insertions to one table in one place on one hard disk blah blah blah. What's really stupid is to see people throw in a column called "id" not understanding that it has to be "<something in particular>_id" according to the law of identity. This is the very foundation of all Western logic.

Since a table models a set, its name has to be either a collective noun ("Personnel" and not "Employees"), or if there is no collective noun, then a plural. If the table has a singular name, then it should be one and only one entity in that set. If you want to follow good practices, ANSI and ISO standards and make some kind of sense of things, then your 1st example of customers (I hope you really do have more than one in spite of what you said) it would look more like this. I picked a fixed length character string the size of a credit card number is the identifier. Non-RDBMS people think they can simply sequentially numbers something. But we believe in keys, validation and verification. There's also no such crap as a "<something>_type_id" ; those post fixes are what ISO calls attribute properties and an attribute could have only 1 of them. I'm going to guess that this is a customer type. Because the type is on a nominal scale, it can't be a numeric. I'm also going to assume that you feel no obligation to follow the international postal Union rules about the length of address lines on envelopes, so you can go ahead and use 50 characters instead of the standard 35. That standard number comes from 3 1/2 inch labels used by the postal Union for forwarding mail and every country is supposed to be able to knock an address line down to that length.

CREATE TABLE Customers

(cust_id CHAR(16) NOT NULL PRIMARY KEY,,

customer_name NVARCHAR(50) NOT NULL,

cust_type CHAR(3) NOT NULL

CHECK(cust_type IN (..)),

..

);

Since customer types are probably a relatively small domain (less than 100 codes), it's better to put them in the check clause then to give them their own table. If they were volatile, or a lot of them, then you would consider putting them their own table and using a references clause to get to them.

>> (I wondered if anyone had any strong feelings one way or another about these approaches? <<

I always follow ANSI/ISO standards, the metadata committee standards, and basic data modeling principles.

>> At my last place (a month ago) many devs (we are all full-stack .NET devs) chose not to enforce data integrity in the database, but in the code that surfaced the data. I'm old school and prefer to enforce relational integrity, but I'm wondering whether a relatively meaningless Id IDENTITY PK is nowadays deprecated. <

Quick, give me the address of the person in charge of this project! It's going to fall on its ass in 6 months and I can come in and charge them 4-5 figures the day for consulting! This is how I earn my lifting; patching up stupid stuff like this.

Also, the use of the physical insertion row count called IDENTITY as a primary key and a table has earned me tens of thousands of dollars over the years. Eventually, it doesn't work, and I come in and find the real keys in the table to save the day.. It's really fun they're trying to move from SQL Server to another database, like DB2 or Oracle.

Please post DDL and follow ANSI/ISO standards when asking for help.

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 2

jcelko212 32090 wrote:

I always follow ANSI/ISO standards, the metadata committee standards, and basic data modeling principles.

BWAAAA-HAAAA-HAAAAA!!!! THIS coming from the person that says he likes the MySQL "standard" of using YYYY-00-00 as the notation for whole years and YYYY-MM-00 for whole months. What a line of hooie.

I also disagree with most of the rest of your post but enough people have posted the right stuff on this thread where there's little need for thee and me to argue about it yet again.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

GaryV Hall of Fame Points: 3307 More actions · Answer 3

I generally like identities as primary and foreign keys, just because that's my preference. And sometimes it's easier to maintain, say for when the natural key of a collection is long and/or made up of several columns (e.g. Address1 + Address2 + City + State/Province + Postal + Country for an table that stores unique addresses).

That said, I want to reiterate the point made above that you need to ensure that any natural key is unique so it doesn't get assigned to two ID's. You don't want to allow CustomerTypeID 2 and CustomerTypeID 193 to both have CustomerTypeDesc of "Education Organization". No matter how that came about - because someone inadvertently inserted a duplicate, or because 2 and 193 originally were ID's of unique attributes, but someone decided the descriptions could be the same now.

Second, know your data well enough to know what data type to use for your ID - smallint, int, bigint. At a former job, I was looking at a table that stored contact records. Every time a contact was made, whether call-in, call-out, letter, email, etc it was legally required to be recorded. I saw the maximum identity value on the table and thought, "Wow, that's a big number." I did a little more research, and pointed out to my boss that we were going to run out of integers on the table soon. Sure enough, a few weeks later the application bombed for having too large an ID to fit in the int column. Then it got fixed quickly!

roger.plowman SSChampion Points: 10247 More actions · Answer 4

Let me hop in on the side of surrogate keys as clustered keys (usually INT identities). Unless you have to worry about aggregating independent databases (say, from multiple client companies that run completely independent of each other) using INT identities (or BIGINT for huge tables) has a lot of advantages.

Grant already mentioned the big ones: GDPR, fixed key values that never change, etc. To this I would add that using a sequential value like identities also has the advantage of appending records and that reduces fragmentation. Also, it makes standardized keys much simpler, especially for combined audit tables where you have one audit table for all record adds, etc.

Of course you also need to put unique constraints on natural keys to prevent dups, but except in rare circumstances using surrogate primary keys have almost no downside, IMO.

Oh, and I recommend indexing foreign keys, something SQL Server doesn't do automatically.

nova SSC Journeyman Points: 80 More actions · Answer 5

I think the topic has strayed from the original question so it may be worth making two points about customer identifiers. Firstly, IDENTITY is likely not very convenient as a business identifier for customers because it can't easily be changed. SEQUENCE is a more powerful and more flexible alternative because it won't stop you updating the column values.

Secondly and more importantly, Grant seemed to imply in his answers that using an "artificially" generated key, such as one generated with the IDENTITY feature, is potentially superior or easier from a GDPR point of view. It's not that simple. GDPR is very clear that any unique identifiers for personal data, including technical identifiers generated by machines, are subject to the GDPR rules. What matters is how the identifiers in question are governed and used. The method used to create them is totally irrelevant as far as GDPR is concerned. IDENTITY is not a get-out clause for any of your GDPR responsibilities.

Steve Collins SSCommitted Points: 1686 More actions · Answer 6

Identity() is a piece of automation that solves a set of tricky problems for you. Choosing not to use it is tantamount to saying you can do it better. Well, I don't think you can. If you choose not to avail yourself of something that's cheap, fast, and easy then idk... good luck. Also, the code comparison tool we use (ApexSql Diff) works better with integer identity primary keys on all the tables. At this point I don't even stop to think about alternatives because it's settled. In production it's all the tables all the time.

Aus dem Paradies, das Cantor uns geschaffen, soll uns niemand vertreiben können

jcelko212 32090 SSCrazy Eights Points: 9220 More actions · Answer 7

>> IDENTITY() is a piece of automation that solves a set of tricky problems for you. <<

Actually, IDENTITY creates problems rather than solve them. Essentially, you've decided to throw away all chance of data validation and verification in your schema. Consider a simple motor pool, which identifies the cars by their VIN, but somehow you've decided that you want to use a count of sequential insertion attempts (not even successes!) on one machine to one disc as an identifier.

Generic id = 42

VIN = '3HGGK5H59JM702792'

While it's ugly looking, I can write a regular expression for the VIN.

I now have to carry the indexes and overhead of both the VIN column and the IDENTITY table property. But one of the goals of the database was to reduce redundancy, not increase it. What if I want to put a clustered index on my motor pool? If I cluster on the VIN, it will put cars from the same manufacturer together in physical storage. If I cluster on the generic_id, my cars are pretty much in the order in which I parked them in the motor pool order. I also have to worry about missing numbers, and gaps. But that's another issue.

Now when I want to go to Carmax, my insurance company, the state DMV or any other outside database, using only IDENTITY, that they are going to understand 42? Essentially you given us a local pointer to a physical record on this one disc. In short, you have no logical model whatsoever.

What if I go to the motor pool? Where do I find 42 on the automobile? I know I can look for the VIN on the frame, the door panel, the windows and probably a few other places. I can verify it.

>> If you choose not to avail yourself of something that's cheap, fast, and easy then ... good luck. <<

I also do not avail myself of women who are "cheap, fast and easy" because it's bad for my health 🙂 .If you choose to code a database with the cheapest, fastest and easiest techniques at the expense of validation, verification, and data integrity, that I would make a fortune cleaning up your database. I've been doing that for 30 years because people do write programs like you.

>> Also, the code comparison tool we use (ApexSql Diff) works better with integer IDENTITY primary keys on all the tables. <<

First of all, IDENTITY cannot be a key. By definition, because it's a table property and not a column. That's pure Dr. Codd's definition. What do you mean by works better? It seems that your definition of better is anything that saves you work, without any regard to what it does to your final project.

>> At this point I don't even stop to think about alternatives because it's settled. In production, it's all the tables all the time. <<

I know you don't think when you're designing a database. That would mean you'd have to educate yourself. You have to study the problem under consideration, learn your clients industry to some degree. That means having to sit down and start reading all of those industry-specific, national and international standards. But you just told us that you don't even bother to look for a real key!

Please post DDL and follow ANSI/ISO standards when asking for help.

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 8

jcelko212 32090 wrote:

>> IDENTITY() is a piece of automation that solves a set of tricky problems for you. <<

Actually, IDENTITY creates problems rather than solve them. Essentially, you've decided to throw away all chance of data validation and verification in your schema. Consider a simple motor pool, which identifies the cars by their VIN, but somehow you've decided that you want to use a count of sequential insertion attempts (not even successes!) on one machine to one disc as an identifier.

Generic id = 42

VIN = '3HGGK5H59JM702792'

While it's ugly looking, I can write a regular expression for the VIN.

I now have to carry the indexes and overhead of both the VIN column and the IDENTITY table property. But one of the goals of the database was to reduce redundancy, not increase it. What if I want to put a clustered index on my motor pool? If I cluster on the VIN, it will put cars from the same manufacturer together in physical storage. If I cluster on the generic_id, my cars are pretty much in the order in which I parked them in the motor pool order. I also have to worry about missing numbers, and gaps. But that's another issue.

Now when I want to go to Carmax, my insurance company, the state DMV or any other outside database, using only IDENTITY, that they are going to understand 42? Essentially you given us a local pointer to a physical record on this one disc. In short, you have no logical model whatsoever.

What if I go to the motor pool? Where do I find 42 on the automobile? I know I can look for the VIN on the frame, the door panel, the windows and probably a few other places. I can verify it.

>> If you choose not to avail yourself of something that's cheap, fast, and easy then ... good luck. <<

I also do not avail myself of women who are "cheap, fast and easy" because it's bad for my health 🙂 .If you choose to code a database with the cheapest, fastest and easiest techniques at the expense of validation, verification, and data integrity, that I would make a fortune cleaning up your database. I've been doing that for 30 years because people do write programs like you.

>> Also, the code comparison tool we use (ApexSql Diff) works better with integer IDENTITY primary keys on all the tables. <<

First of all, IDENTITY cannot be a key. By definition, because it's a table property and not a column. That's pure Dr. Codd's definition. What do you mean by works better? It seems that your definition of better is anything that saves you work, without any regard to what it does to your final project.

>> At this point I don't even stop to think about alternatives because it's settled. In production, it's all the tables all the time. <<

I know you don't think when you're designing a database. That would mean you'd have to educate yourself. You have to study the problem under consideration, learn your clients industry to some degree. That means having to sit down and start reading all of those industry-specific, national and international standards. But you just told us that you don't even bother to look for a real key!

Lordy... here we go again. S2D2.

The only correct answer here is "It Depends".

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions

Steve Collins SSCommitted Points: 1686 More actions · Answer 9

Steve Collins

SSCommitted

Points: 1686

December 1, 2019 at 2:57 pm

#3703273

Ok - Joe Celko. Good luck.

Aus dem Paradies, das Cantor uns geschaffen, soll uns niemand vertreiben können

Grant Fritchey SSC Guru Points: 397721 More actions · Answer 10

nova wrote:

I think the topic has strayed from the original question so it may be worth making two points about customer identifiers. Firstly, IDENTITY is likely not very convenient as a business identifier for customers because it can't easily be changed. SEQUENCE is a more powerful and more flexible alternative because it won't stop you updating the column values.

Secondly and more importantly, Grant seemed to imply in his answers that using an "artificially" generated key, such as one generated with the IDENTITY feature, is potentially superior or easier from a GDPR point of view. It's not that simple. GDPR is very clear that any unique identifiers for personal data, including technical identifiers generated by machines, are subject to the GDPR rules. What matters is how the identifiers in question are governed and used. The method used to create them is totally irrelevant as far as GDPR is concerned. IDENTITY is not a get-out clause for any of your GDPR responsibilities.

Good point.

I would say that using an IDENTITY (or a GUID) does act as a protection mechanism within the GDPR...

ASSUMING!!!!!!!!

... That we're using those values as completely unknown entities for internal coding and management. As soon as the business knows that ID=42 means "Karen", the utility having an identity value is eliminated and you'll be dealing with these values as part and parcel of the GDPR.

The point is, a natural key is going to be problematic within the GDPR, by it's nature. An artificial key (and no, doesn't have to be IDENTITY), is how we divorce data management from information management and gives us the ability to do logical deletes (remove the PII information, but leave the other information in place). You still need to have both in order to support both a logical consistency (natural keys) and a physical management of data divorced from the natural key (artificial keys). Yes, added overhead, added management, added maintenance, but absolutely necessary for modern systems.

----------------------------------------------------The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood... Theodore RooseveltThe Scary DBAAuthor of: SQL Server 2017 Query Performance Tuning, 5th Edition and SQL Server Execution Plans, 3rd EditionProduct Evangelist for Red Gate Software

roger.plowman SSChampion Points: 10247 More actions · Answer 11

jcelko212 32090 wrote:

>> IDENTITY() is a piece of automation that solves a set of tricky problems for you. <<

Actually, IDENTITY creates problems rather than solve them. Essentially, you've decided to throw away all chance of data validation and verification in your schema. Consider a simple motor pool, which identifies the cars by their VIN, but somehow you've decided that you want to use a count of sequential insertion attempts (not even successes!) on one machine to one disc as an identifier.

Generic id = 42

VIN = '3HGGK5H59JM702792'

While it's ugly looking, I can write a regular expression for the VIN.

I now have to carry the indexes and overhead of both the VIN column and the IDENTITY table property. But one of the goals of the database was to reduce redundancy, not increase it. What if I want to put a clustered index on my motor pool? If I cluster on the VIN, it will put cars from the same manufacturer together in physical storage. If I cluster on the generic_id, my cars are pretty much in the order in which I parked them in the motor pool order. I also have to worry about missing numbers, and gaps. But that's another issue.

Now when I want to go to Carmax, my insurance company, the state DMV or any other outside database, using only IDENTITY, that they are going to understand 42? Essentially you given us a local pointer to a physical record on this one disc. In short, you have no logical model whatsoever.

What if I go to the motor pool? Where do I find 42 on the automobile? I know I can look for the VIN on the frame, the door panel, the windows and probably a few other places. I can verify it.

>> If you choose not to avail yourself of something that's cheap, fast, and easy then ... good luck. <<

I also do not avail myself of women who are "cheap, fast and easy" because it's bad for my health 🙂 .If you choose to code a database with the cheapest, fastest and easiest techniques at the expense of validation, verification, and data integrity, that I would make a fortune cleaning up your database. I've been doing that for 30 years because people do write programs like you.

>> Also, the code comparison tool we use (ApexSql Diff) works better with integer IDENTITY primary keys on all the tables. <<

First of all, IDENTITY cannot be a key. By definition, because it's a table property and not a column. That's pure Dr. Codd's definition. What do you mean by works better? It seems that your definition of better is anything that saves you work, without any regard to what it does to your final project.

>> At this point I don't even stop to think about alternatives because it's settled. In production, it's all the tables all the time. <<

I know you don't think when you're designing a database. That would mean you'd have to educate yourself. You have to study the problem under consideration, learn your clients industry to some degree. That means having to sit down and start reading all of those industry-specific, national and international standards. But you just told us that you don't even bother to look for a real key!

Not sure what you're talking about, Joe. An identity column solves a LOT of problems in terms of fragmentation. Using it as a primary key (especially a clustered key) in no way increases overhead. As far as validation goes, have you heard of alternate keys? That solves duplication issues and a constraint solves the validation issues.

As far as the "table property" (the next value for the identity column) a single Int32 (or even Int64) is so trivial in terms of overhead bringing it up is ludicrous. Your argument about using the identity column as an external identifier is equally ridiculous. Of course you'd use the VIN for interfacing with third parties. The identity column is for internal use as a link between tables.

The same goes for identifying vehicles physically.

Finally, quibbling over the word "identity" and refusing to acknowledge the user is referring to a column with the identity property is just being pedantic. From a mechanical standpoint using an Int32 instead of a CHAR(17) as a foreign key is so obviously advantageous it's not something that can be argued. It's much smaller, when used as a clustered key it prevents fragmentation of the database and all the juggling required to insert new records...it is, in short, almost always superior, except in truly odd situations.

As for "real" keys, are you serious??? Unique index, possibly a constraint for validation (for keys that have some regular pattern or rules) and boom, done. Move on. How is that hard? How does that prevent use of the alternate key for external parties?

roger.plowman SSChampion Points: 10247 More actions · Answer 12

nova wrote:

I think the topic has strayed from the original question so it may be worth making two points about customer identifiers. Firstly, IDENTITY is likely not very convenient as a business identifier for customers because it can't easily be changed. SEQUENCE is a more powerful and more flexible alternative because it won't stop you updating the column values.

Secondly and more importantly, Grant seemed to imply in his answers that using an "artificially" generated key, such as one generated with the IDENTITY feature, is potentially superior or easier from a GDPR point of view. It's not that simple. GDPR is very clear that any unique identifiers for personal data, including technical identifiers generated by machines, are subject to the GDPR rules. What matters is how the identifiers in question are governed and used. The method used to create them is totally irrelevant as far as GDPR is concerned. IDENTITY is not a get-out clause for any of your GDPR responsibilities.

Um, the point of using an identity column is specifically so the record will have a fixed and unchangeable value that identifies it once and forever more. You don't want it changed because it's intended as a "true name". If you need a user-modifiable key for some reason add it as a column with a unique constraint and validation rules, but leave the actual key used to link tables together as immutable. In other words, just add your customer ID as a new alternate key rather than the primary key.

Grant's point is the key alone tells you nothing if it's just a serial number (which is what identity columns are). In detail tables that use an identity column as a foreign key you have no information that can tie back to an individual without having access to the person's name record itself. While not ideal this is about as about as protected as using "Subject S" in place of the person's real name.

Not perfect, but then every system (even the most militant GDPR compliant) always has a weakness.

Thom A SSC Guru Points: 99009 More actions · Answer 13

roger.plowman wrote:

As for "real" keys, are you serious??? Unique index, possibly a constraint for validation (for keys that have some regular pattern or rules) and boom, done. Move on. How is that hard? How does that prevent use of the alternate key for external parties?

It doesn't prevent anything, but the "Celko train" is stuck on the rails it's on, and it will never deviate. It doesn't matter if you put some cows on the line, or if you add an extra track for it to follow, the "Celko Train" will plow on forwards to its original destination; full speed ahead!

Most of us will be happy to argue both sides, where as there are a select few will only debate their opinion.

Thom~

Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
Larnu.uk

Mr. Brian Gale SSC-Insane Points: 24363 More actions · Answer 14

I think it may depend on the purpose. If the table is going to be used to store vehicle information to be used by an external system, then Joe's method may be a bit nicer to own, manage and maintain. If it is used internally only, having the identity column is a quick, easy way to do it. If it could be used both internally and externally, I'd probably still go with the identity column method.

Now, my personal preference on this is to normalize it and use identity columns to keep the referential integrity and then use views or stored procedures (depending on the purpose) to present the data back to the application layer and not include the identity columns when I push things back.

From a real-world experience though, you need to plan the identity column properly. Are you going to have more than 2 million records in the table? If so, probably want that to be a bigint. We hit that snag at one point where a system completely halted because the identity overflowed an INT. We were not using it for a foreign key; it was more of just a row counter. And anyone who has worked with an Identity column knows it is not a reliable row counter - you can and likely will have gaps.

My opinion, an Identity column has its place and purpose but it is easy to overuse it or misuse it. Pick the right tool (datatype, nullability, defaults, constraints, etc) and you will save yourself a lot of headache in the future. My experience, if you are normalizing your tables, INT lookups are usually quick and cheap.

The above is all just my opinion on what you should do.
As with all advice you find on a random internet forum - you shouldn't blindly follow it. Always test on a test server to see if there is negative side effects before making changes to live!
I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

Jeff Moden SSC Guru Points: 1000539 More actions · Answer 15

Mr. Brian Gale wrote:

Are you going to have more than 2 million records in the table? If so, probably want that to be a bigint.

Heh... I know you meant BILLION but thought I'd point it out for any newbies that might read this. 😀

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
"Change is inevitable... change for the better is not".

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)
Intro to Tally Tables and Functions