Indexing Strategy and Foreign Keys

  • Before I ask my questions, I need to describe the data set. I am working with a copy of a very large data base that grows by 50 to 60 thousand rows per day. My copy of it is not on the same network and can't be. The system has five primary tables (People, Address, Business, MagEvent, and Profile). There is no primary key (beyond a simple table ID field, i.e. PeopleID, AddressID, etc) that is shared between the tables. The only common field that relates the tables one to another is am alphanumeric ProcessID field--varchar(13). In each table, a single ProcessID number can exists dozens of times. This is how I receive the information from the original system and there is no way for me to change it. The system is very large (1.5 billion records and growing).

    First Question: What is the advantage to having a foreign key?

    I created a master ProcessID table (lkuptblProcessID) which uniquely lists every ProcessID value found within the five primary tables and created a foreign key between each table's ProcessID field and the lkuptblProcessID table. However, I still had to create non-clustered indexes on each of the primary table's ProcessID field. So why do I need the foreign key? Does it help speed up a query? Given that it is a copy of another system, data integrity is not a concern—people only search this system they don’t add records to it.

    Second Question: In a non-clustered index, what is the value of including columns as opposed to just including them in the list of columns to be indexed?

    Third Question: If I know that my users will be search the system using a certain number of fields, should I put all of those fields in that same non-clustered index and should each field be added to index key columns list or the included columns list? Or is it better to create non-clustered indexes for each field that is searched?

    If any of you have any suggestions on the best indexing strategy/approach, your help would be very much welcome.

    Thanks

  • shane.barney (9/9/2009)


    Before I ask my questions, I need to describe the data set. I am working with a copy of a very large data base that grows by 50 to 60 thousand rows per day. My copy of it is not on the same network and can't be. The system has five primary tables (People, Address, Business, MagEvent, and Profile). There is no primary key (beyond a simple table ID field, i.e. PeopleID, AddressID, etc) that is shared between the tables. The only common field that relates the tables one to another is am alphanumeric ProcessID field--varchar(13). In each table, a single ProcessID number can exists dozens of times. This is how I receive the information from the original system and there is no way for me to change it. The system is very large (1.5 billion records and growing).

    First Question: What is the advantage to having a foreign key?

    I created a master ProcessID table (lkuptblProcessID) which uniquely lists every ProcessID value found within the five primary tables and created a foreign key between each table's ProcessID field and the lkuptblProcessID table. However, I still had to create non-clustered indexes on each of the primary table's ProcessID field. So why do I need the foreign key? Does it help speed up a query? Given that it is a copy of another system, data integrity is not a concern—people only search this system they don’t add records to it.

    Indexes are for performance. Foreign key constraints are to help control data integrity. Adding a foreign key won't help your performance, but it will help ensure that the same data that is supposed to be related to each other, is, and remains so.

    Second Question: In a non-clustered index, what is the value of including columns as opposed to just including them in the list of columns to be indexed?

    It's a way to create a covering index, an index that satisfies all the needs of the query without having to look up data against a clustered index or a heap, without changing the key structure of the index. So you might have an index with a couple of columns that currently make up the key and this index works extremely well, but you find that you're in a key lookup situation. You can add the one or two columns as INCLUDE to the index without changing it's structure (except at the leaf level) so that performance is not negatively impacted and the index is made as a covering index.

    Third Question: If I know that my users will be search the system using a certain number of fields, should I put all of those fields in that same non-clustered index and should each field be added to index key columns list or the included columns list? Or is it better to create non-clustered indexes for each field that is searched?

    Well, let me ask you a question. What is your clustered index on these tables? If you have a very common access path, meaning these three or four columns are always used to select the data in the table, instead of making a nonclustered index and then attempting to INCLUDE 'x' number of extra columns, make these columns the clustered index and then you eliminate the key lookup scenario.

    However, if you're happy with your clustered index, you're usually better off creating a compound key on the index rather than creating a series of individual indexes.

    If any of you have any suggestions on the best indexing strategy/approach, your help would be very much welcome.

    Thanks

    ----------------------------------------------------The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood... Theodore RooseveltThe Scary DBAAuthor of: SQL Server 2017 Query Performance Tuning, 5th Edition and SQL Server Execution Plans, 3rd EditionProduct Evangelist for Red Gate Software

  • Thanks for the information on my first two questions. As for the third, users can search on a combination of nine different fields which involves more than one primary table. For example, a user might want to search by last name, state, and city--leaving a search field blank returns records with data as well as null values. The last name is found in the Person table and the state and city are found in Address table. The join for the two tables is on the ProcessID field. On the front-end application, we require that they use one of three primary search fields (processID, last name, or an address).

    Would an indexed view work better? Or is a view based on non-clustered indexes work better? Is there a way in which I could narrow down the results by using the primary search field? Meaning, if a user searched based on the last name of "Smith" the stored procedure would first limit the records returned from the other tables to only those records (joined on ProcessID) that include information related to anyone with a last name of Smith?

    We have a requirement to permit broad searches; I am just trying to ensure that we can do it without each query taking five minutes to complete.

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply