What's more efficent for search criteria?

  • I'm in the process of designing some new tables. The decision has been made to always create a new row when the data for that table is different than any other row that already exists. Currently the table is about 8 columns wide and contains personal identification information.

    I'm concerned about building the indexes for this type of table. I'll have a search query with 8 different where clauses.

    For example

    Where firstname = @firstName, middlename = @middlename, lastname=@lastname, phonenumber=@phonenumber, zipcode = @zipcode...ect

    First I'd like your thoughts on how to get the best performance in this type of situation. My first thought is to generate a md5 type of hash based on all of the columns, and then search on that.

  • I'm curious why a search query would have so many criteria. Usually a search query will be by a primary key or other unique identifier, or by 1-3 columns - preferabley idexed (firstname, lastname, city), or with a join to another table.

  • Customer requirement is to maintain each and every combination of data exactly as entered. I'm sure there are different approaches to handle this type of requirement, However we've been directed to maintain each unique combination in the manner described above. As a result we need an efficient method of determining whether the newly supplied data matches an existing row, or needs to be created.

  • The table you describe violates normalization rules and will be inefficient just by the nature of the design. It is my opinion that business/user requirments can state WHAT needs to be stored but should never dictate HOW something is stored in the database. That falls on the professional data modeler to correctly implement in a way that provides the business/user with the data they require.

  • the key to indexing and performance is going to be how they are searched in your WHERE clause.

    if all 8 values are absolutely required for every search, an index with all eight columns would be very selective.

    if you know that most of the time they search on just lastname or lastname + firstname, then two different indexes, to cover both of those queries would be good.

    you'll probably want the GUI to build your query with WHERE statements only for the non-empty values;

    you want to avoid trying to create a catch all query;

    check out the pitfalls in this article:

    http://sqlinthewild.co.za/index.php/2009/03/19/catch-all-queries/

    more details and we can offer more than general advice.

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • You could create a unique index or constraint that has all eight columns to guarantee that you cannot have duplicates.

    There can be a lot of overhead to such a wide index, so that could be a disadvantage.

    In general, code like this should work well with smaller indexes that only have the more selective non nullable columns. If the columns are nullable, you will have to modify the query to allow for that.

    insert into MyTable

    (

    Col1,Col2,Col3,Col4,Col5,Col6,Col7,Col8

    select

    @Col1,@Col2,@Col3,@Col4,@Col5,@Col6,@Col7,@Col8

    where

    not exists (

    select *

    from

    MyTable

    where

    Col1 = @Col1 and

    Col2 = @Col2 and

    Col3 = @Col3 and

    Col4 = @Col4 and

    Col5 = @Col5 and

    Col6 = @Col6 and

    Col7 = @Col7 and

    Col8 = @Col8 )

  • thisted (10/5/2012)


    I'm in the process of designing some new tables. The decision has been made to always create a new row when the data for that table is different than any other row that already exists. Currently the table is about 8 columns wide and contains personal identification information.

    I'm concerned about building the indexes for this type of table. I'll have a search query with 8 different where clauses.

    For example

    Where firstname = @firstName, middlename = @middlename, lastname=@lastname, phonenumber=@phonenumber, zipcode = @zipcode...ect

    First I'd like your thoughts on how to get the best performance in this type of situation. My first thought is to generate a md5 type of hash based on all of the columns, and then search on that.

    This type of table is affectionately known as a "Slowly Changing Dimension" or "SCD" for short. It's a time honored method for auditing all actions taken against the table in the same table instead of having a separate audit or history table (which has it's own set of problems, as well).

    One of the most effective "SCD" types of tables is the "Type 2 Slowly Changing Dimension". Please see the following URL for the different types of "SCDs" available.

    http://en.wikipedia.org/wiki/Slowly_changing_dimension

    One of the purposes of Type 2 SCDs is so that you can actually determine what the status of all of the rows in a table where/are for any given point in time and it's very simple and effective in doing so. Don't let talk of the table being "denormalized" or "ineffecient" deter you because the denormalization is necessary and it's far from being ineffecient if done properly.

    To make a longer story much shorter, Type 2 SCDs are generally maintained by a well written, high performance trigger that end-dates the old row and creates a new row with a new start date and an "open" end-date of either a NULL or a very large date (I use 9999-12-30 which leaves one day of headroom for some critical criteria calculations). On my systems, I usually also add a quick check in the trigger to prevent any direct manipulation of the start and end date columns to make it more tamper proof and to ensure that the dates never overlap from row to row for any given unique data.

    One of the keys to success and performance is how you index for the PK. Whatever you currently use for the PK needs to be modified so that the start-date column is the first column of the PK (and clustered index, BTW) and that what you would normally have for PK column(s) comes after that.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply