Should tables be clustered?

Question

Post reply

Should tables be clustered?

Ian-386680

Mr or Mrs. 500

Points: 509
More actions
April 11, 2007 at 5:26 am

#70008

We have a number of largish tables which have only unclustered indexes. Is this a sensible design, or do tables without any clustured indexes perform slower when queried.

cheers

Ian

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply

RichB SSCrazy Eights Points: 9651 More actions · Answer 1

RichB

SSCrazy Eights

Points: 9651

April 11, 2007 at 5:50 am

#699740

Sounds facetious, but the answer is that it depends.

If you post a sample create table script with some typical usage queries and data population statistics we can give you some more useful answers.

Ian-386680 Mr or Mrs. 500 Points: 509 More actions · Answer 2

Ian-386680

Mr or Mrs. 500

Points: 509

April 11, 2007 at 5:57 am

#699743

In a way that answers my question. I was really wanting to know if there was a hard and fast rule stating that clustered tables will be quicker to query.

If you say 'it depends' then presumably there is no such hard and fast rule.

RichB SSCrazy Eights Points: 9651 More actions · Answer 3

RichB

SSCrazy Eights

Points: 9651

April 11, 2007 at 6:53 am

#699753

well, if you can sensibly cluster on the your core search requirements then there is a good chance of a significant benefit. As I say, it all depends on how you are using it.

Marvin D SSCarpal Tunnel Points: 4058 More actions · Answer 4

The advantage of clustered indexes is that the data is physically ordered where as other indexes are pointers. When used, you should try to place the clustered index on the field used the most in the queries. I have also heard that they are good for join conditions as well.

Hope that helps

Marvin

Marvin Dillard
Senior Consultant
Claraview Inc

Clifford Heath SSC Journeyman Points: 77 More actions · Answer 5

You have a number of partial answers. I'll try to be more comprehensive.

First, the allocation behaviour. A table with no clustered index acts as a heap,

which means that *every* insertion goes at the end. Pages are 8k, allocated in

extents of 8 pages, or 64 K. When you delete records from the table, pages

are only ever freed when an entire extent gets emptied. So if you have a table

from which you delete 99 of 100 records over time, it'll grow infinitely and

become very sparse. Thus, my default policy is that *every* table is clustered.

Second, the index behaviour. A record in a non-clustered table is identified by

a RID, which is 6 bytes. Every index entry has a key value and a RID. After

finding the key value you want, the page containing that RID must be loaded.

The index is a tree, and only the leaf pages point to records. Typically you

should arrange that your key values are of a size, and system memory is such,

that a least all the non-leaf (internal) nodes of an index are in cache. Thus,

with a large table accessed by an index, there'll be two disk accesses to fetch a

single record (but less than 2*N to fetch N records).

If a record has been updated and is now larger than the space previously

allocated to it, it might not have been moved and re-indexed. Instead, it might

have been replaced by a small "forwarding" record which gives the new location.

This costs an additional disk access, and is another reason to use a clustered

index - they don't need any forwarding.

When accessing record by a clustered index, the row data is stored in the leaf

pages of the index. Only the key values are stored in internal nodes. Typically

the record will be much bigger than just the key value, so the non-leaf nodes

are few. In particular, the leaf nodes take as much space as if they were stored

in data pages in a non-clustered table, and the internal nodes take as much

space as a normal non-clustered index over the same fields. The system (and

the key value) should be sized so that at least all but the leaf nodes and the

outer key-only nodes (the 2nd-outer layer of the tree) can be in cache. This

takes the same (except for the lack of forwarding) two disk accesses to fetch

one record.

In a clustered index which is on a small key (say, a 4-byte IDENTITY field), it's

likely that *all* the non-leaf nodes fit in cache, so accessing a record takes only

a single real disk access (as it might with a non-clustered index over a small key).

So that works well, just as well as if the same data was stored non-clustered.

Accessing a clustered table using a secondary index costs about the same as if

you had a non-clustered table, *as long as* the clustering key size is small.

This is because the first search in the secondary index returns, not a RID, but a

key value for the clustered index. The clustered index must then be searched

for the record - but in the case of a small key, that's cheap.

So here's the crunch. If you cluster over a large key value, not only does it make

it less likely that the internal nodes of the clustered key aren't in cache (causing

extra reads), but it does the same to *every* secondary index - because these

must contain a key value for *both* the secondary index and the clustered one.

Essentially, using a clustered key value the same size as a RID has about the same

performance as non-clustered, but using a larger key impacts performance by

the *square* of the key size ratio.

A final issue I've observed with some versions of SQL Server is that the query

optimiser *overestimates* the cost of using a 2ary index on a clustered table,

and will often choose a much more costly query plan instead. I don't believe

I've seen this with SQL2KSP4 or anything more recent, so most folk will be ok,

and the rest should upgrade anyhow.

My final advice, based on hard experience, is this:

* Always use a clustered index (for the allocation behaviour)

* Always use a small (4-16 bytes) clustering key.

* Add an IDENTITY field if you don't have a suitable key already.

Clifford Heath, Data Constellation.

humbleDBA Hall of Fame Points: 3460 More actions · Answer 6

humbleDBA

Hall of Fame

Points: 3460

April 12, 2007 at 4:49 am

#700006

In my experience, Clifford is spot-on. Good advice.

Ian-386680 Mr or Mrs. 500 Points: 509 More actions · Answer 7

Ian-386680

Mr or Mrs. 500

Points: 509

April 12, 2007 at 7:11 am

#700044

Thanks Clifford - an excellent summary, explains the issues very clearly.

Ian

Scott Coleman One Orange Chip Points: 27451 More actions · Answer 8

One issue Clifford skipped was that heap tables cannot be reorganized except by truncating and reloading them. Tables with clustered indexes can be optimized by using DBCC DBREINDEX or DBCC INDEXDEFRAG on the clustered index.

The clustered index fields are duplicated in every non-clustered index, so creating a large multi-field clustered index could result in worse performance than using only non-clustered indexes. Using a small clustered index, such as a single int or datetime field, should usually perform at least as well as a heap. A bad clustered index design is using a field that varies randomly for each insertion, such as names or a GUID filled by NEWID(), which causes new rows to be inserted in random locations, which causes page splits and extreme fragmentation. (This can be avoided by regular reindexing using a fill factor large enough to leave room on each page for the insertions.) On the other hand, using an int identity field or a datetime filled by GETDATE() for the clustered index causes new rows to be inserted at the end of the table.

The best performance benefit of clustered indexes is seen when you use a field that appears in a lot of range or inequality filter conditions, assuming they are written correctly. For instance, "WHERE CONVERT(char(7), EnteredOn, 111)='2005/06'" would cause a table scan no matter what the index structure, but "WHERE EnteredOn BETWEEN CAST('2005-06-01' as datetime) AND CAST('2005-06-30' as datetime)" would run very quickly if the table is clustered by EnteredOn. Identity fields are very convenient for primary keys but are usually used to locate single records, whereas dates are more likely to be used in ranges.

Steve Jones - SSC Editor SSC Guru Points: 728159 More actions · Answer 9

Scott and Cliff have really answered well, so listen to them and thanks for the great answers.

The only other hting I'll add is that the query optimizer works better with clustered indexes on every table. The guys on the SQL Server team expect that tables have clustered indexes when they write their routines. So put a clustered index on, not necessarily on the PK or a FK, but on something.

Follow me on Twitter: http://www.twitter.com/way0utwest
Forum Etiquette: How to post data/code on a forum to get the best help
My Blog: www.voiceofthedba.com

Will1922 SSCertifiable Points: 7189 More actions · Answer 10

Also, be sure to consider how the data gets populated. I'm lucky in the fact that most of my tables are inserted into daily and I have a char(10) YYYY-MM-DD that's my first column of any of the clustered indexes. That way the hit on insert performance is minimized because all of the inserts happen at the end of the table, thus minimizing page splits.

You might want to play around a little with fill factor if the tables doesn't see many deletes. Most of the time you can leave it at the default. I have saved some space by setting it at 100% on my insert-only tables and have the clustered index be on my date column.

Remember, the clustered index actually IS the table, not just a seperate 'table' that references the real data. The columns you specify in the clustered index effect the order in which the data is stored. If the data for the row is changed and one of the columns are in the clustered index, then you may get a performace hit because the row has to be moved.

It all boils down to how the data is accessed and modified. I try and put clustered indexes on all of my tables, but as everyone else said - it depends.

Live to Throw
Throw to Live
Will Summers

Johan Bijnens SSC Guru Points: 134873 More actions · Answer 11

This thread contains a very good syntheses regarding Clustering.

Also keep in mind that for best practise , you should check that your clustering index key is a unique key !

If it isn't unique, SQLServer will uniquify it by adding a hidden guid kind of thing.

It enlarges the keypart that will be used in all non clustering indexes !

Johan

Learn to play, play to learn !

Dont drive faster than your guardian angel can fly ...
but keeping both feet on the ground wont get you anywhere :w00t:

- How to post Performance Problems
- How to post data/code to get the best help[/url]

- How to prevent a sore throat after hours of presenting ppt

press F1 for solution, press shift+F1 for urgent solution 😀

Need a bit of Powershell? How about this

Who am I ? Sometimes this is me but most of the time this is me

Johan Bijnens SSC Guru Points: 134873 More actions · Answer 12

Just to add .. MS has released a nice article "Comparing Tables Organized with Clustered Indexes versus Heaps" in the sql2005 best practises.

you can find it at http://www.microsoft.com/technet/prodtechnol/sql/bestpractice/clusivsh.mspx

Johan

Learn to play, play to learn !

Dont drive faster than your guardian angel can fly ...
but keeping both feet on the ground wont get you anywhere :w00t:

- How to post Performance Problems
- How to post data/code to get the best help[/url]

- How to prevent a sore throat after hours of presenting ppt

press F1 for solution, press shift+F1 for urgent solution 😀

Need a bit of Powershell? How about this

Who am I ? Sometimes this is me but most of the time this is me

philcart SSC-Forever Points: 47794 More actions · Answer 13

Don't follow contrived "best practice" if it doesn't suit your business requirements. In the document referenced by Alzdba, under the heading "Test Methodology", they basically admit to cooking the tests because a real-life workload didn't suit.

I've yet to find a system that has been effictively tuned using a different workload than whats expected in the production system.

You might find some of the entries on this blog interesting,

http://blogs.sqlserver.org.au/blogs/Greg_Linwood/

Disclaimer: I do work with Greg and we're both heavily involved in tuning real-life production systems on a day-to-day basis.

--------------------
Colt 45 - the original point and click interface

Johan Bijnens SSC Guru Points: 134873 More actions · Answer 14

As stated in Greg's blog, many times we just overlook the fact that most of the time, a CIX is not choosen for a special reason, but just to have one because of BP-guidelines.

Especialy in those cases, it should be considered and evaluated that changing the CIX may have the effect(s) one seeks.

Document these choices and reasons case per case.

Johan

Learn to play, play to learn !

Dont drive faster than your guardian angel can fly ...
but keeping both feet on the ground wont get you anywhere :w00t:

- How to post Performance Problems
- How to post data/code to get the best help[/url]

- How to prevent a sore throat after hours of presenting ppt

press F1 for solution, press shift+F1 for urgent solution 😀

Need a bit of Powershell? How about this

Who am I ? Sometimes this is me but most of the time this is me