Performance tuning

  • Other than moving this to a faster IO, or partioning it, or moving indexes to different spindle, anyone see a way to get a performance increase out of the following T-SQL?  Primary clustered indexes on the WeblogID column and on the urlStemID, as well as a non-clustered one on the logWritten column.  N.B. performance almost exactly the same to the second when using an inner join rather than a 'in' . Currently the DB size is 3.5GB - and about 13 million records in the weblog table - SQL server express (yes I know going to have to partition soon or upgrade since I will soon hit the 4GB limit but this is non production at the moment).  According to the exceution plan the bulk of the cost is in the clustered index scan of the weblog table - and of that 85% is the IO.  I guess I know the answer but was just wondering on any others thoughts to get this below the 22s it is taking to run for long date ranges i.e >6 months:

     

    SELECT

    Year(logWritten) as [Year], MONTH(logWritten) as [Month],Day(logWritten) as [Day of Month],count(WebLogID) as 'Visits' FROM Weblog

    WHERE (urlStemID IN

    (SELECT urlStemID FROM URLStem

    WHERE (StemText LIKE @strSearch)))

    AND (logWritten BETWEEN @statDate AND @endDate)

    GROUP BY Year(logWritten),MONTH(logWritten),Day(logWritten)

    ORDER BY Year(logWritten),MONTH(logWritten),Day(logWritten)

     

  • Clustered index must be on logWritten column.

    And use join anyway.

    _____________
    Code for TallyGenerator

  • SELECT Year(logWritten) as [Year], MONTH(logWritten) as [Month],Day(logWritten) as [Day of Month],

    count(WebLogID) as 'Visits'

    FROM Weblog wl

    join URLStem us

     on wl.urlStemID = us.urlStemID

    where StemText LIKE @strSearch

    and logWritten BETWEEN @statDate AND @endDate

    GROUP BY Year(logWritten),MONTH(logWritten),Day(logWritten)

    ORDER BY Year(logWritten),MONTH(logWritten),Day(logWritten)

     

    that should help a bit but hereare other things you could do.

    Does log written have a time portion?

    IS there a wild card at the of @strSearch, is it needed?

    www.sql-library.com[/url]

  • Nope that is is exactly the same performance (as stated in in my original post - had already tried this with joins) although I did run your version just in case!.  Yes logwritten is a time stamp of type datetime, unfortuantly @strSearch is a variable string which changes depending upon what the user is looking for stats on.  The clustered scan on the urlstem table looking for the string has a 0% cost though - it is the clustered scan of the weblog table ( cost:71%) and the hash match:- right semi join from my code, or inner join from your, (cost:26%) is where the delays are. Thanks for the reply

  • Indexes being rebuilt, with clustered on the time - will have performance info in a bit

  • Still being rebuilt?

    _____________
    Code for TallyGenerator

  • You'll never do better than an INDEX SCAN with all the queries written so far... not sure there's one better, either.  Might be able to tweek a couple of things but I need to know what kind of values can be in @strSearch and StemText...

    Also, please tell me that everything you have labeled as a date is a datetime datatype and everything you have labeled as an ID in an INT or BIGINT.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • INDEX SCAN will be used every time you query for a range of values from a column not having clustered index on it.

    BETWEEN means CLUSTERED INDEX.

    End of story.

    _____________
    Code for TallyGenerator

  • Nah... you can get INDEX SEEKS out of non-clustered indexes... they just won't be CLUSTERED INDEX SEEKS.  And, (as you know, ol' friend), I don't think they'll come close to an INDEX SEEK even with the correct index on the code, so far.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • If you just need to show on a day by day basis and do not need to break down into Year, Month, Day then this might shave a bit off, but not sure how much with 13 mil rows, only had 10000 available to test with. Also you'll note I wnet back to an inner join but I moved the variable for the URLStem StemText into the join out from the where. Sometimes this can impact the queries performance on multi join queries but might have some impact here.

    SELECT

     DATEADD(d,DATEDIFF(d,0,WB.logWritten),0) as [Date],

     COUNT(WebLogID) as [Visits]

    FROM

     dbo.Weblog WB

    INNER JOIN

     dbo.URLStem URL

    ON

     WB.urlStemID = URL.urlStemID AND

     URL.StemText LIKE @strSearch

    WHERE

     WB.logWritten BETWEEN @statDate AND @endDate

    GROUP BY

     DATEADD(d,DATEDIFF(d,0,WB.logWritten),0)

    ORDER BY

     DATEADD(d,DATEDIFF(d,0,WB.logWritten),0)

  • 1) IN and JOIN run in same time because the optimizer is most likely rewriting the IN as a JOIN under the covers (show estimated query plan for both to verify this).

    2) You are actually doing THREE things that make a clustered index on logwritten better for this particular query:  BETWEEN, GROUP BY and ORDER BY.  If this is the main action performed against this table then even paying the overhead of having a uniquefier added onto the 8 bytes of the date column will be worth it (thus having a 12 byte clustering key size which is carried over to all non-clustered indexes as well).  Test and see since it is a non-production environment.  BTW, be prepared for a HUGE transaction log when you shift the clustered index.  I would manually expand the tlog to 1.5 times the size of the table, then shrink it back down when indexing is complete. 

    3) Did you remember to set the initial size of the database and tlog files?  Did you set the growth rates appropriately?  I would check for OS level file framentation if you didn't.  Also make sure your antivirus is set to ignore .mdf and .ldf files.

    4) You must accept that at some point you are just asking for more data than your spindle/cpu can chew threw in a few seconds.  🙂

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • 1) IN and JOIN run in same time because the optimizer is most likely rewriting the IN as a JOIN under the covers (show estimated query plan for both to verify this).

    2) You are actually doing THREE things that make a clustered index on logwritten better for this particular query:  BETWEEN, GROUP BY and ORDER BY.  If this is the main action performed against this table then even paying the overhead of having a uniquefier added onto the 8 bytes of the date column will be worth it (thus having a 12 byte clustering key size which is carried over to all non-clustered indexes as well).  Test and see since it is a non-production environment.  BTW, be prepared for a HUGE transaction log when you shift the clustered index.  I would manually expand the tlog to 1.5 times the size of the table, then shrink it back down when indexing is complete. 

    3) Did you remember to set the initial size of the database and tlog files?  Did you set the growth rates appropriately?  I would check for OS level file framentation if you didn't.  Also make sure your antivirus is set to ignore .mdf and .ldf files.

    4) You must accept that at some point you are just asking for more data than your spindle/cpu can chew threw in a few seconds.  🙂

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Will get around to checking a few of these ideas later in the week - mainly becuase the request for a clustered index on the date field but the DB over the 4GB limit for SQL express and hence it failed - have to move it to a server with a proper license!   yes the date ranges have to be fully adjustable and broken down (when seeking over a couple of days the perforance is OK) - file growth etc was alll sorted from outset, and looked at the OS level defrag - hardly anything - virus checker ignoring it.  the string search is in the form '%/podcast/%.mp3'  or many other combinations depending upon the report.  this side of things has next to 0 cost according the execution plan (the url strings are fully normalised so nothing like the 13m records of the stems).  I think the clustered index on the data range is the answer but will have to wait a few days.  thanks all for the advice.

  • I always use 'with (nolock)' on my queries. It seems to speed things up a bit...

    cheers,

    Ben


    Cheers,

    Ben Sullins
    bensullins.com
    Beer is my primary key...

  • But you run the risk of getting invalid data due to dirty reads.

Viewing 15 posts - 1 through 15 (of 23 total)

You must be logged in to reply to this topic. Login to reply