Performance required on > 4 million records

  • Hi all,

    I have a Table with greater than 4 million records. I need to return a portion of this (say 10,000 records) using a filter of three specific fields.

    The stored prcedure that needs these records agregates them with some joins onto other tables and returns results (of say 200 records).

    The problem I am having is performance of retriving these results. I have played with indexing quite a lot adding a clustered key on the three fields required, adding them seperately and the best performance seems to come from reomivng ALL indexes from the table.

    I have also created a function to return the records required which actually seems to be quicker than joining on a sub query that returns the same data.

    Any tips/advice even if it may not be relevant to the problem would help!

    Colin

  • At a selection-ration of 2.5%, the optimum access method is hard to decide, for the optimiser and for a developer. If you were selecting a continuous range from a clustered index, then clustered index should probably be used; but if you are selecting rows at random, a table-scan might be a reasonable choice.

    If you want to persuade the optimiser to use an index, you can try query hints. You can put

    WITH FAST(1)

    at the end of the query - that sometimes prompts the optimiser to choose an index where otherwise it would not (because it is asking to get the first record fast). You can force an index to be used directly, by putting WITH (index=an_index_name) after the TABLENAME (and alias) in the FROM clause, but this is discouraged.

  • Maybe its your joining that is taking so long?

    Do you other tables have indexes? Do you have relationships defined between these fields on these tables?

    Using Query Analyzer, turn on "Show Execution Plan" (under the Query menu, or [CTRL]+[K]). Have a look in the resulting execution plan to see what is taking the longest to perform. Maybe it's using full table scans to join the other tables on.

     


    Julian Kuiters
    juliankuiters.id.au

  • Colin,

    Would you give us the DDL of your tables and the query itself.

    Besid each table it would be good if you could indicate the number of rows it contains, as well as the indexes it actually has.

    So it will be easier for us to see where we have to look for



    Bye
    Gabor

  • Thanks for the replies,

    I have made some head  way using the execution plan of the two main queries. Because I was always filtering the results of the table via two fields, i created these as a clustered index. This seemed to increase perfromance a great deal.

    The other thing that I did is I took out the aggregation of these results from joins onto other tables (the joins were merely information based resources, not numeric) and this also increased performance.

    So now what I have is a query with no table scans (only clustered index seeks) but the bulk of the work is now being done by joins and the aggregation of results.

    I don't think there is much more performance i can get out of the database itself without putting in more hardware to the server. Its a dual 1Ghz with 2GB or RAM, does anyone know anything about the optimal setup for sql server in this case?

    Thanks,

    Colin

  • Hello Colin,

    as to the server hardware, it depends on the number of users, databases and their size, and activities performed on the server (read/write etc.). Generally and IMHO, it could be a good idea to consider at least 4GB of RAM... but a lot depends on the hard drives as well. It is recommended and very helpful to have TEMPDB on a physically separate drive (fast disk w/o RAID). Also, some levels of RAID can decrease performance considerably, when used on the disk fields that hold your data. We started with RAID 5, but changed it very soon to 1+0 with a significant improvement to performance.

    That said, I'd like to emphasize, that performance monitoring is necessary to decide what is the bottleneck in your conditions, and any general advices might be wrong... because each system is different.

  • Colin,

    First, if you are not using the Enterprise Edition, adding memory over 2 Gig is fruitless because the Standard Edition is limited.

    If this is something like a daily, weekly or month report (ie. batch process), I've been able to process through multiple 67 mega-record tables, aggregates, etc by simply dividing and conquering.  Too often people try to do to much in a single "SQL Paragraph".  Most of the time, I only need to look at a couple hundred thousand records from each table so I create dedicated working tables (NOT temp tables, working tables live in my "ScratchPad" database I created) and populate them with only the data I need.  Then, I process through just those records.

    For one particular report I had to create yesterday, I copied just the data I needed for 720 k records... took about 24 seconds.  I populated several additional columns in the working table using multiple SQL Paragraphs including a fair number of aggregations.  The key was to work it with non-complex joins.  Don't get me wrong... I did like you did with indexes.  The bottom line is I created all the aggregates I needed from over 200 million records (3*67 million) and the whole thing took 3 minutes and 43 seconds.

    Remember, divide and conquer...

    Couple of other things... put the log files (LDF) on a different physical disk than the data files (MDF) (Vladan said that, as well).  Helps performance ALOT!  Also, make sure that the growth settings for the DB are NOT in percent and not too small.  If they are and have been, then both your database AND your physical harddrive will be horribly fragmented.  You'll need to take the server off line, defrag the drive, and bring the server back on line.  Then, you may still need to defrag the database cause it could get pretty well jumbled up.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Thanks to everyone for the tips so far,

    I did know about log files on a seperate physical drive and that has been in place for a while. I was chatting to someone about de-fragmenting the drive as the databases have changed that often we really should get to look at it.

    As for divide and conquer, we actually did produce a table full of aggregate results to begin whith, but the queries that are allowed on the table are extremely varied and can be quite complex, so we decided that rather than try to pre-empt the user's query, we will run it from the raw data.

    I did however seperate out the data from the information (i.e. remove some of the joins into individual queries) which did help a great deal, I just put them back together in the application.

    As it is the Sql server is the enterprise edition SP3 (not SP3a as we had A LOT of problems running particular stored procedures where the same ones pasted into sql analyser would work fine, in the end i gave the sql to the application and let it run the query rather than the stored procedure, but that got very annoying very quickly) with a dual 1Ghz chips and 2GB of RAM.

    On a side note: Why is Sql Server to damned rubbish at managing it's memory? most of the time the sql server is running in process information with 1.7GB.

  • Can you please provide the code you are using. A lot of times placement of where items into related ON parts of the join can speed things along. But seeing your code will help us understand what is happening.

    Also if you can note the table anmes and the number of rows in each.

  • "Why is Sql Server to damned rubbish at managing it's memory? "

    SQL Server will use as much memory as possible to speed up queries by caching data in memory. Every time you read data from a database/table/index, SQL Server will cache most of the data in memory so that the 2nd, 3rd, nth time you run the query it will be faster.

    You can limit the amount of memory that SQL Server uses. In Enterprise Manager, right click your server, choose Properties. Select the Memory tab. You can set SQL Server to use a Fixed or Variable amount of Memory. Either way, you should keep the sliders in the GREEN area, to ensure you don't starve your OS or SQL Server.

    If you use the server for only SQL Server, you should use around 80% or leave 256 - 512MB for your OS. If you have other applications (like IIS) that also run on the server, you will want to use a little less. If SQL Server uses too much memory, other processes will be pushed out to your Paging/Swap file, making the system slower.

    You need to restart SQL Server to make this configuration active.


    Julian Kuiters
    juliankuiters.id.au

  • If some of your data is relatively static you could update a database once a day/hour/week with your aggregate values and set it as read only and pick up you data from there.

    Depends how current the data must be for the users.

     

  • Jeff,

    If you are still around I'd like to see a skeleton of what you accomplished. I'm trying to do the same thing with over 150 million rows. I am limited to SQL STD Edition, so indexed views are not an option. I've played with indexes, index hints till i'm blue in the face. Any help would be appreciated!

    Thanks!

    Jeff Moden (8/16/2004)


    Colin,

    First, if you are not using the Enterprise Edition, adding memory over 2 Gig is fruitless because the Standard Edition is limited.

    If this is something like a daily, weekly or month report (ie. batch process), I've been able to process through multiple 67 mega-record tables, aggregates, etc by simply dividing and conquering. Too often people try to do to much in a single "SQL Paragraph". Most of the time, I only need to look at a couple hundred thousand records from each table so I create dedicated working tables (NOT temp tables, working tables live in my "ScratchPad" database I created) and populate them with only the data I need. Then, I process through just those records.

    For one particular report I had to create yesterday, I copied just the data I needed for 720 k records... took about 24 seconds. I populated several additional columns in the working table using multiple SQL Paragraphs including a fair number of aggregations. The key was to work it with non-complex joins. Don't get me wrong... I did like you did with indexes. The bottom line is I created all the aggregates I needed from over 200 million records (3*67 million) and the whole thing took 3 minutes and 43 seconds.

    Remember, divide and conquer...

    Couple of other things... put the log files (LDF) on a different physical disk than the data files (MDF) (Vladan said that, as well). Helps performance ALOT! Also, make sure that the growth settings for the DB are NOT in percent and not too small. If they are and have been, then both your database AND your physical harddrive will be horribly fragmented. You'll need to take the server off line, defrag the drive, and bring the server back on line. Then, you may still need to defrag the database cause it could get pretty well jumbled up.


    david hay

  • dhay1999 (10/4/2007)


    Jeff,

    If you are still around I'd like to see a skeleton of what you accomplished. I'm trying to do the same thing with over 150 million rows. I am limited to SQL STD Edition, so indexed views are not an option. I've played with indexes, index hints till i'm blue in the face. Any help would be appreciated!

    Thanks!

    Wow... lot's of water has run under this bridge in the last 3 years 😀 Trying to remember which project that was...

    But the skeleton you ask for is described in the 2nd and 3rd paragraphs of my other reply... it all boils down to working with only the data you need and working the report in "baby-steps" instead of a bazillion aggregated joins in a single select...

    Also, I appreciate playing with indexes until "blue in the face"... but you also need to look at the code... are the WHERE clauses actually sargeable so that they can use indexes as INDEX SEEKS or are they going to force an INDEX SCAN or TABLE SCAN because of the way the WHERE clauses are formed?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Thanks. Unfortunately, it's doing an aggregation on the entire table to bring it down into smaller and smaller groups. so there is no where clause. it joins two tables, one with the 150 million records, and another with 350k records. right now it's taking about 5 hours to do one aggregation. I have to do about 6 different aggregations from these tables each month, so this part of the process takes 30+ hours at this point. the further we get into the year, the longer it takes. At the end of the year we will have nearly 225 million rows that represent 3 years of data and that pushes it up to nearly 60 hours from what I am told. I didn't design it I just inherited it. I've looked at Indexing, File groups, partitioned views, drive configuration (we're on a SAN) so I'm looking into that for performance enhancements possibly.

    Thanks again for the info. Sorry to throw you into the way-back machine!


    david hay

  • Are you reporting on all 3 years at the same time?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

Viewing 15 posts - 1 through 15 (of 24 total)

You must be logged in to reply to this topic. Login to reply