Primary keys for an OLTP database

  • Carlo Romagnano - Thursday, December 27, 2018 9:28 AM

    You may have differerent results if you insert rows in separated batch and sessions instead of one shot insert.

    Maybe or maybe not (I've done tests both ways).  As with all else in SQL Server, "It Depends" but I'm tickled to see people thinking about this.

    Some of the biggest parts of "It Depends" on are things like how wide the rows are, what the batch sizes are, how often you do index maintenance, whether or not you've assigned a correct FILL FACTOR (LOTS of surprises there), and whether or not you've made the mistake of using REORGANIZE to reduce logical fragmentation.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Business needs and human factors ought to take priority when designing keys for a table. When you take those requirements into account there's no reason to assume that guids and incremental integers are equally valid choices or that either of them are always suitable. There are many factors other than benchmarks that could play a part in key design. How happy would hypothetical customers be with reciting a 32 character order number over the phone to your customer service desk for example?

    If you add into the picture the need for real keys that people can see and use then the results of these tests and their relevance could be completely changed, e.g. will you cluster on the business key (AKA "natural" key if you must call it that) or a surrogate (whether guid or integer)? What patterns of access would use one key versus another?

  • Theoretically, these clustered indexes usually have a hotspot on left or right side of the index tree. That hotspot can reduce speed of inserts on a busy system.

    What exactly is meant by "hotspot"? Jeff Moden mentioned this term also.

  • Jeff Moden - Thursday, December 27, 2018 9:31 AM

    Hi Ron,

    I'll say it again... I'm right there with you on the multiple suck factors that GUIDs bring to a database, especially when they are introduced for no reason (actually worked for a company where they used GUIDs for everything and it was horrible.  I'd love to exchange war stories with you there).  There's no question there.

    But, you did mention that they were slow and so let me ask... do you ever use REORGANIZE to maintain indexes that have a GUID as a leading column?  If so, then, experienced or not, you've fallen into the same trap that most all of us have (and I'm definitely including myself in that) even though we may have literally decades of experience with SQL Server.

    Slow may be a relative term.  A GUID to GUID join is going to perform slower than an integer to integer join all other things being equal.  Maybe a person won't notice the difference in speed for one join.  But on a busy system, this surely takes it's toll.  If a GUID is necessary, use it.  Otherwise avoid it.

  • Ron Hinds - Thursday, December 27, 2018 11:53 AM

    Theoretically, these clustered indexes usually have a hotspot on left or right side of the index tree. That hotspot can reduce speed of inserts on a busy system.

    What exactly is meant by "hotspot"? Jeff Moden mentioned this term also.

    Where the most activity on the table (clustered index) takes place.  If you are using an always increasing key and have many inserts occurring, they will occur at the end of the table (possibly, it depends on the physical implementation).

  • sqlvogel - Thursday, December 27, 2018 11:44 AM

    Business needs and human factors ought to take priority when designing keys for a table. When you take those requirements into account there's no reason to assume that guids and incremental integers are equally valid choices or that either of them are always suitable. There are many factors other than benchmarks that could play a part in key design. How happy would hypothetical customers be with reciting a 32 character order number over the phone to your customer service desk for example?

    If you add into the picture the need for real keys that people can see and use then the results of these tests and their relevance could be completely changed, e.g. will you cluster on the business key (AKA "natural" key if you must call it that) or a surrogate (whether guid or integer)? What patterns of access would use one key versus another?

    Single look ups of orders would probably best served by a narrow nonclustered index than on the clustered index (ymmv).  Clustering on the Order Number may not be the best option for the clustering key, and again ymmv.

  • Hi Eirikur Eiriksson,

    Results on my i5 laptop

    How many CPU/cores was on that laptop do you use SSD drive? My test was done on a virtual box with SSD drives and I did 3 tests on different number of CPUs.

  • Ron Hinds - Thursday, December 27, 2018 11:53 AM

    Theoretically, these clustered indexes usually have a hotspot on left or right side of the index tree. That hotspot can reduce speed of inserts on a busy system.

    What exactly is meant by "hotspot"? Jeff Moden mentioned this term also.

    A "hot spot", in this case, refers to where most of the insertions of new rows will occur.  For a Clustered Index with an ever increasing key, the last logical page of the index will be the place where all new rows are inserted.  Since everyone will be working on the same page, there can be contention where people may have to wait their turn to insert into that single page.  Because it's the only insertion point used for everyone, it becomes a "hot spot".

    For a Clustered Index with a random GUID key, there are no "hot spots" unless the table is just several pages long (they'd all be a hot spot for a bit until the number of pages grew) because the GUID will land on a random page of the entire index.  The more pages in the table for GUIDs, the less chance there is of a "collision".

    This is why a lot of people recommend using a GUID for the Clustered Index key on high insertion rate tables.  The trouble is, they usually blow that wonderful functionality out of the water by doing index maintenance on such a thing incorrectly, which causes massive page splits by removing freespace in the index from the wrong place at the wrong time instead of adding free space.  To summarize (and Mark Tassin hit the nail squarely on the head in the later part of his post), you have to have the correct FILL FACTOR and you must do REBUILDs instead of REORGANIZE on such indexes.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Thank you Lynn and Jeff!

  • sqlvogel - Thursday, December 27, 2018 11:44 AM

    Business needs and human factors ought to take priority when designing keys for a table. When you take those requirements into account there's no reason to assume that guids and incremental integers are equally valid choices or that either of them are always suitable. There are many factors other than benchmarks that could play a part in key design. How happy would hypothetical customers be with reciting a 32 character order number over the phone to your customer service desk for example?

    If you add into the picture the need for real keys that people can see and use then the results of these tests and their relevance could be completely changed, e.g. will you cluster on the business key (AKA "natural" key if you must call it that) or a surrogate (whether guid or integer)? What patterns of access would use one key versus another?

    Ok... so what would you use for a "Natural" key for an Employee or Customer table?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Evgeny Garaev - Thursday, December 27, 2018 1:03 PM

    Hi Eirikur Eiriksson,

    Results on my i5 laptop

    How many CPU/cores was on that laptop do you use SSD drive? My test was done on a virtual box with SSD drives and I did 3 tests on different number of CPUs.

    The one I ran the test on the last time is a 2 Core, 4 Logical CPUs i5-5200U @ 2.2GHz with a 5K HDD.
    😎

  • Just ran the test on a i7-4600U 4 core @ 2.1GHz, 256Gb SSD, SQL Server 2017 on Ubuntu 18.04.1 LTS
    😎

    T_TXT             DURATION
    ----------------- --------
    IDENTITY 1         2500349
    IDENTITY 2         2534796
    IDENTITY 3         2726870
    NEWSEQUENTIALID 1  3283918
    NEWSEQUENTIALID 2  3371947
    NEWSEQUENTIALID 3  3462895
    NEWID 1            5817904
    NEWID 2            5935902
    NEWID 3            5966122

  • Another run on a more optimized i5-4310U 4 Logical Cores @ 2.00GHz system, 512GB SSD, Microsoft SQL Server 2016 (SP1-GDR) (KB4458842) - 13.0.4224.16 (X64)
    😎


    T_TXT              DURATION
    ------------------ -----------
    IDENTITY 3             1900222
    IDENTITY 1             1900764
    IDENTITY 2             1921514
    NEWSEQUENTIALID 2      2401202
    NEWSEQUENTIALID 1      2404004
    NEWSEQUENTIALID 3      2472017
    NEWID 2                3437846
    NEWID 3                3561873
    NEWID 1                3589680

  • Jeff Moden - Thursday, December 27, 2018 5:42 PM

    Ok... so what would you use for a "Natural" key for an Employee or Customer table?

    Well that's exactly the kind of question that this article ought to be addressing but it's meaningless to give a single answer without more context. All information in a database consists of symbols invented by humans or machines so personally I don't think it helps to refer to data as "natural". A unique identifier that can be used in the business process outside the database is what matters. I don't think most businesses would ask customers to log in to their website using a 32 character guid or a sequential integer.

  • Jeff Moden - Thursday, December 27, 2018 8:50 AM

    While I appreciate the experiences that have made you come to such conclusions (and, to be clear, I used to believe in the same thing), most of that isn't true, although the recommendation for a proper FILL FACTOR and the use of REBUILDs for NEWID usage is absolutely spot on.

    It turns out that the larger (in terms of row count) a NEWID Clustered Index gets, the longer you can go with absolutely ZERO page splits and ZERO fragmentation.  Done properly and maintained correctly (and the current "Best Practices" for index maintenance are absolutely the worst thing to do for index maintenance, especially in this case).  I do have several demonstrable tests that exercise such indexes by adding 1,000 rows per simulated hour for 10 simulate hour per simulated day for 365 simulated days (3.65 Million rows per test and, yeah, it does take a while to run each test) and it demonstrates that you can actually go MONTHs with no page splits (not even good ones) and no fragmentation when it's done properly.  Oddly enough, it turns out the method is a fair bit faster than when using the "Append only" nature of sequential keys because even supposedly "good" page splits write more to the log file than ZERO pages splits do.

    Of course, there are disadvantages to the use of NEWID that we all know but performance of INSERTs, page splits, and fragmentation isn't a part of that (seriously contrary to popular belief).  Well... unless you use REORGANIZE as a part of your index maintenance, which removes free space when you need it the most and, very much like a really bad drug habit, the more you use it the more you need to use it because it actually perpetuates page splits and that perpetuates fragmentation and that perpetuates the need to defragment... over and over and over for the life of the table.

    I'm giving a 2 hour presentation on this and other types of indexes as well as why the current recommendations of supposed "Best Practice" index maintenance is actually causing more harm than good and it's the result of literally hundreds of hours of testing.  The next presentation will be at SQLSaturday in Cleveland on 2 Feb 2019.  If you're in the area, come see it.  I can almost guarantee you've never seen indexes or the harmful effects of current "Best Practice" index maintenance on indexes in the way that I present them.

    Hi Jeff,
    Will the presentation be available online at any point? I'm in a Cleveland, but unfortunately one on a different continent. What you're suggesting sounds intriguing.

    Thanks,
    Steve

Viewing 15 posts - 16 through 30 (of 54 total)

You must be logged in to reply to this topic. Login to reply