Reference tables and performance

  • Hi all!

    I'm working with SQL Server 2KSp3 on a Windows 2000 SP4 Advanced Server box.

    I have a database with several reference tables, that may be quite large (one contains hundreds of thousands of lines), and that look like:

    create table ref_T (

            nid int identity primary key, -- numerical id

            val varchar (50) unique -- value, varchar size is for example

    )

    1) First question: in terms of performance, wouldn't it be better to have char instead of varchar for the value field, and playing with the necessary RTRIM, given that the size is really variable (from 10 to 50 chars in some cases)?

     

    Now, I have an application that takes XML files containing some data, and convert them into SQL scripts that insert this data into the database.  Of course, it may need to add some data into the reference table (add an order id, for instance).

    2) Second question: what is, in your opinion, the most efficient way to insert this new data, while checking it is not already in?

    I already had two ideas, but I don't know which one is the best, and I am sure there is a better one.  Let's assume the new value is in @myval:

    (i) Check for existence first:

    if not exists (select 1 from ref_T where val = @myval)

            insert into ref_T (val) values (@myval)

    (ii) Simply insert and handle the error after.  If it means a duplicate insertion, nevermind (I don't have the code at hand yet.)

    I believe solution (ii) is more efficient, since the check is always performed upon inserting (so it is performed twice in solution (i)).

    What do you think?  Any comment and feeback will be greatly appreciated.

    Concluding remark: I am not nitpicking with this performance matter.  There will be 300,000 files a day, and, in pre-production conditions, I handle about 280,000 files a day.  So, grabbing a few milliseconds for each file may top the bill!

    Thanks,

    Xavier

  • I would avoid using CHAR fields because they ALWAYS use the amount of space allocated where as the VARCHAR only uses the space it needs.

    Secondly I presume you have a unique key on your description field?

    I have used the "trap in VB" approach and the

    IF EXISTS (SELECT 1 FROM Table WHERE <A href="mailtoescription=@myDesc">Description=@myDesc)

        RAISERROR etc.

    approach.

    If I know that the use of my database will only ever be via my VB app then I will rely on the VB error trapping.

    If all unsundry are going to be using the database then I build more error trapping into my database layer.

    The big problem I had was with data that had mis-spellings and plurals creating false entries.

    i.e. Female, Femail, Females.

  • Hi!

    Thanks a lot for your input.

    I agree with you on the char vs. varchar point, but I am ready to trade size for speed if needed, and I wonder whether having fixed sized rows in the tables, or fixed sized items in the index, may improve performance (or the contrary, for that matter).

    Yes there is a unique key on the val column.

    Your "VB Trap" approach is interesting, but I am not able to use it.  Maybe I need to give more details on the inserting application:

    - An XML file is "transformed" into an SQL script.

    - This SQL script is a single transaction, with calls to sp to insert data into the database.

    - Error checking is performed in the script itself - this way I can either commit or rollback it at the end.  A generic error is returned at the end of the script to the calling application if the transaction has been rollbacked.

    - The calling application only deals with this generic error and all the communication troubles you can imagine (connection failure, database down and such).

    Hence, I have to perform error checking in the script itself (actually in the stored procedure).  Furthermore, my goal is to sort out "normal" errors - trying to insert the same thing twice - to "real" errors (inconsistencies in the data, for instance).  Only the last one must be propagated up to the app.

    Last point, I don't have the mis-pellings and such with this particular application, since the reference data is actually internal codes (order ids, part numbers and such) automatically generated or selected from lists by the user, furthermore in another application (which does all the checks for me if some are needed.  Cool! )  But your remark is sensible.

    Thanks,

    Xavier

     

  • One method that I did use in the past was

    • Upload data into an intermediate table (NOT a temp table) with no indices and no checking whatsoever.
    • Create indices on intermediate table (obviously not unique).
    • Insert into tables by using select distinct values from intermediate table.
    • Drop indices on intermediate table
    • Truncate intermediate table ready for the next run.

    The advantages of this approach are

    • Uploading of data is faster.
    • Data cleansing happens as a separate process to uploading.
    • Adding new data from the intermediate table is faster.
  • Hi!

    This looks really interesting!

    Nevertheless, you cannot insert any data that is using these references until you have completed your above algorithm.  Indeed, your data tables must contain the indices, so you have to calculate them first, hence have completed the whole algorithm first.  Am I missing something?

    But you gave me an idea: I can certainly find, in the calling app, all the reference data used in the file before starting to insert the data.  This way, I can at least avoid checking the same thing several times.  I'll have a look at that.

    BTW, Solution (ii) in my original post (try to insert and handle the error) does not work, because the error is propagated to ADO, which cannot decide whether it is a "normal" error or not.  Maybe I can twist that a bit; Im' having a look.

    Bye,

    Xavier

     

  • My blast everything into a table then index it method works because data is inserted into my intermediate table at maximum speed with no overhead for maintaining any indices.

    When the indices are created, after population, this is a relatively fast process.

    The insert of data into the final table from the intermediate table is fast because you are doing a single set based operation rather than  several individual inserts.

    Although it sounds long winded you are lessening the overhead of doing a load of individual inserts into an indexed table, plus all the error checking each statement.

  • The bulk load process described by David has advantages, but the one issue I see is that the select distinct part filters duplicates from the load batch without respect to the existing data in the target table. If the original "duplicate" row is existing data then that step will still fail with a unique constraint error. So, the select distinct statement would still need to have a where clause that ensured each row being selected out was not already in the target table. This may (should) still perform best due to the performance benefits of working in bulk.

    That said, you can avoid the error in Solution (ii) being propogated to ADO by placing the INSERT into a seperate Stored Procedure and EXECing it from the existing stored procedure. Any errors that occur in the second SP get ignored (or by dynamic SQL EXECed from an SP). Since this will ignore ALL errors, not just the constraint one, you may want to do some error checking within the second SP and if you detect that the error was not due to a constraint violation return that fact via an Output parameter.

     

  • Hi all!

    Aaron, thanks for your trick for solution (ii), I will test that tomorrow first thing (Hey, I'm back home now ).

    David, thanks for the clarification, I think I understand your point now.  And Aaron's comments is also sensible.

    OK, so the point is doing "bulk" inserts, and do the indice stuff after.  I think I can try to mix up these ideas and see where it leads me!

    I'll keep you posted about where this will lead me.

    See you,

    Xavier

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply