Should tables be clustered?

  • We have a number of largish tables which have only unclustered indexes. Is this a sensible design, or do tables without any clustured indexes perform slower when queried.

    cheers

    Ian

  • Sounds facetious, but the answer is that it depends.

    If you post a sample create table script with some typical usage queries and data population statistics we can give you some more useful answers.

     

     

  • In a way that answers my question. I was really wanting to know if there was a hard and fast rule stating that clustered tables will be quicker to query.

    If you say 'it depends' then presumably there is no such hard and fast rule.

  • well, if you can sensibly cluster on the your core search requirements then there is a good chance of a significant benefit.  As I say, it all depends on how you are using it.

  • The advantage of clustered indexes is that the data is physically ordered where as other indexes are pointers.  When used, you should try to place the clustered index on the field used the most in the queries.  I have also heard that they are good for join conditions as well.

    Hope that helps

    Marvin

    Marvin Dillard
    Senior Consultant
    Claraview Inc

  • You have a number of partial answers. I'll try to be more comprehensive.

    First, the allocation behaviour. A table with no clustered index acts as a heap,

    which means that *every* insertion goes at the end. Pages are 8k, allocated in

    extents of 8 pages, or 64 K. When you delete records from the table, pages

    are only ever freed when an entire extent gets emptied. So if you have a table

    from which you delete 99 of 100 records over time, it'll grow infinitely and

    become very sparse. Thus, my default policy is that *every* table is clustered.

    Second, the index behaviour. A record in a non-clustered table is identified by

    a RID, which is 6 bytes. Every index entry has a key value and a RID. After

    finding the key value you want, the page containing that RID must be loaded.

    The index is a tree, and only the leaf pages point to records. Typically you

    should arrange that your key values are of a size, and system memory is such,

    that a least all the non-leaf (internal) nodes of an index are in cache. Thus,

    with a large table accessed by an index, there'll be two disk accesses to fetch a

    single record (but less than 2*N to fetch N records).

    If a record has been updated and is now larger than the space previously

    allocated to it, it might not have been moved and re-indexed. Instead, it might

    have been replaced by a small "forwarding" record which gives the new location.

    This costs an additional disk access, and is another reason to use a clustered

    index - they don't need any forwarding.

    When accessing record by a clustered index, the row data is stored in the leaf

    pages of the index. Only the key values are stored in internal nodes. Typically

    the record will be much bigger than just the key value, so the non-leaf nodes

    are few. In particular, the leaf nodes take as much space as if they were stored

    in data pages in a non-clustered table, and the internal nodes take as much

    space as a normal non-clustered index over the same fields. The system (and

    the key value) should be sized so that at least all but the leaf nodes and the

    outer key-only nodes (the 2nd-outer layer of the tree) can be in cache. This

    takes the same (except for the lack of forwarding) two disk accesses to fetch

    one record.

    In a clustered index which is on a small key (say, a 4-byte IDENTITY field), it's

    likely that *all* the non-leaf nodes fit in cache, so accessing a record takes only

    a single real disk access (as it might with a non-clustered index over a small key).

    So that works well, just as well as if the same data was stored non-clustered.

    Accessing a clustered table using a secondary index costs about the same as if

    you had a non-clustered table, *as long as* the clustering key size is small.

    This is because the first search in the secondary index returns, not a RID, but a

    key value for the clustered index. The clustered index must then be searched

    for the record - but in the case of a small key, that's cheap.

    So here's the crunch. If you cluster over a large key value, not only does it make

    it less likely that the internal nodes of the clustered key aren't in cache (causing

    extra reads), but it does the same to *every* secondary index - because these

    must contain a key value for *both* the secondary index and the clustered one.

    Essentially, using a clustered key value the same size as a RID has about the same

    performance as non-clustered, but using a larger key impacts performance by

    the *square* of the key size ratio.

    A final issue I've observed with some versions of SQL Server is that the query

    optimiser *overestimates* the cost of using a 2ary index on a clustered table,

    and will often choose a much more costly query plan instead. I don't believe

    I've seen this with SQL2KSP4 or anything more recent, so most folk will be ok,

    and the rest should upgrade anyhow.

    My final advice, based on hard experience, is this:

    * Always use a clustered index (for the allocation behaviour)

    * Always use a small (4-16 bytes) clustering key.

    * Add an IDENTITY field if you don't have a suitable key already.

    Clifford Heath, Data Constellation.

  • In my experience, Clifford is spot-on. Good advice.

  • Thanks Clifford - an excellent summary, explains the issues very clearly.

    Ian

  • One issue Clifford skipped was that heap tables cannot be reorganized except by truncating and reloading them.  Tables with clustered indexes can be optimized by using DBCC DBREINDEX or DBCC INDEXDEFRAG on the clustered index.

    The clustered index fields are duplicated in every non-clustered index, so creating a large multi-field clustered index could result in worse performance than using only non-clustered indexes.  Using a small clustered index, such as a single int or datetime field, should usually perform at least as well as a heap.  A bad clustered index design is using a field that varies randomly for each insertion, such as names or a GUID filled by NEWID(), which causes new rows to be inserted in random locations, which causes page splits and extreme fragmentation.  (This can be avoided by regular reindexing using a fill factor large enough to leave room on each page for the insertions.)  On the other hand, using an int identity field or a datetime filled by GETDATE() for the clustered index causes new rows to be inserted at the end of the table.

    The best performance benefit of clustered indexes is seen when you use a field that appears in a lot of range or inequality filter conditions, assuming they are written correctly.  For instance, "WHERE CONVERT(char(7), EnteredOn, 111)='2005/06'" would cause a table scan no matter what the index structure, but "WHERE EnteredOn BETWEEN CAST('2005-06-01' as datetime) AND CAST('2005-06-30' as datetime)" would run very quickly if the table is clustered by EnteredOn.  Identity fields are very convenient for primary keys but are usually used to locate single records, whereas dates are more likely to be used in ranges.

  • Scott and Cliff have really answered well, so listen to them and thanks for the great answers.

    The only other hting I'll add is that the query optimizer works better with clustered indexes on every table. The guys on the SQL Server team expect that tables have clustered indexes when they write their routines. So put a clustered index on, not necessarily on the PK or a FK, but on something.

  • Also, be sure to consider how the data gets populated. I'm lucky in the fact that most of my tables are inserted into daily and I have a char(10) YYYY-MM-DD that's my first column of any of the clustered indexes. That way the hit on insert performance is minimized because all of the inserts happen at the end of the table, thus minimizing page splits.

    You might want to play around a little with fill factor if the tables doesn't see many deletes. Most of the time you can leave it at the default. I have saved some space by setting it at 100% on my insert-only tables and have the clustered index be on my date column.

    Remember, the clustered index actually IS the table, not just a seperate 'table' that references the real data. The columns you specify in the clustered index effect the order in which the data is stored. If the data for the row is changed and one of the columns are in the clustered index, then you may get a performace hit because the row has to be moved.

    It all boils down to how the data is accessed and modified. I try and put clustered indexes on all of my tables, but as everyone else said - it depends.


    Live to Throw
    Throw to Live
    Will Summers

  • This thread contains a very good syntheses regarding Clustering.

    Also keep in mind that for best practise , you should check that your clustering index key is a unique key !

    If it isn't unique, SQLServer will uniquify it by adding a hidden guid kind of thing.

    It enlarges the keypart that will be used in all non clustering indexes !

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution 😀

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

  • Just to add .. MS has released a nice article "Comparing Tables Organized with Clustered Indexes versus Heaps" in the sql2005 best practises.

    you can find it at http://www.microsoft.com/technet/prodtechnol/sql/bestpractice/clusivsh.mspx

     

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution 😀

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

  • Don't follow contrived "best practice" if it doesn't suit your business requirements. In the document referenced by Alzdba, under the heading "Test Methodology", they basically admit to cooking the tests because a real-life workload didn't suit.

    I've yet to find a system that has been effictively tuned using a different workload than whats expected in the production system.

    You might find some of the entries on this blog interesting,

    http://blogs.sqlserver.org.au/blogs/Greg_Linwood/

    Disclaimer: I do work with Greg and we're both heavily involved in tuning real-life production systems on a day-to-day basis.

    --------------------
    Colt 45 - the original point and click interface

  • As stated in Greg's blog, many times we just overlook the fact that most of the time, a CIX is not choosen for a special reason, but just to have one because of BP-guidelines.

    Especialy in those cases, it should be considered and evaluated that changing the CIX may have the effect(s) one seeks.

    Document these choices and reasons case per case.

     

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution 😀

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply