Appropriate sample sizes for UPDATE STATISTICS?

  • Okay, now that I've click-bated the MVP's, knowing full well the answer is "It Depends", I would like to try and get a feel for what everyone uses as a sample size. 😀

    I'm taking a look at this post and it got me thinking about flipping percentages on a dynamic UPATE STATISTICS query I'm building that joins against sys.sysindexes to get the rowcount. The goal is to get a query built like:

    [font="Courier New"]UPDATE STATISTICS [schema].[object] [statsname] WITH <sample size>[/font]

    I also added to the script a time component (eventually this will become a proc). So for any stats whose last update of statistics occurred more than X days ago, update them.

    Is this time component mixed with Auto sample size a good way to approach this? SP_UPDATESTATS does not work because I confirm the stats date is not updated.

    Thoughts on this approach as well as what sampling rules you tend to follow?

    select quotename(SCHEMA_NAME(schema_id)) + '.' + quotename(object_name(S.object_id)) [TableName],

    quotename(S.NAME) [StatsName],

    STATS_DATE(S.object_id, S.stats_id) as [LastStatsUpdate], I.rowcnt [NumRows]-- , S.*

    into #statsdump

    from sys.stats as S

    inner join sys.tables as T on T.object_id = S.object_id

    inner join sys.sysindexes as I on I.id = S.object_id

    where I.indid <= 1

    declare @ageindays int

    set @ageindays = 3 -- update anything older than N days

    create table #work_to_do

    (

    rownum int identity(1,1) primary key clustered,

    updatecommand varchar(max)

    )

    insert into #work_to_do(updatecommand)

    select 'UPDATE STATISTICS ' + TableName + ' ' + StatsName + ' WITH SAMPLE ' +

    CASE

    when NumRows < 500000 then '100 PERCENT'

    when NumRows < 1000000 then '50 PERCENT'

    when NumRows < 5000000 then '25 PERCENT'

    when NumRows < 10000000 then '10 PERCENT'

    when NumRows < 50000000 then '2 PERCENT'

    else '3000000 ROWS'

    end +

    ' -- Table Rows: ' + cast(NumRows as varchar(max)) [UpdateCommand]

    from #statsdump where laststatsupdate < getdate() - @ageindays

    order by numrows desc

    DECLARE @CMD varchar(max)

    DECLARE UpdateStatsCursor CURSOR

    FOR

    SELECT UpdateCommand from #work_to_do order by rownum asc

    OPEN UpdateStatsCursor

    FETCH NEXT FROM UpdateStatsCursor into @CMD

    WHILE @@FETCH_STATUS = 0

    BEGIN

    print(@cmd)

    FETCH NEXT FROM UpdateStatsCursor into @CMD

    END

    CLOSE UpdateStatsCursor

    DEALLOCATE UpdateStatsCursor

    go

    drop table #statsdump

    drop table #work_to_do

    Gaby
    ________________________________________________________________
    "In theory, theory and practice are the same. In practice, they are not."
    - Albert Einstein

  • GabyYYZ (11/9/2015)


    Okay, now that I've click-bated the MVP's, knowing full well the answer is "It Depends"

    Yup, it does. Hence you'll get n different opinions in x posts (where n >=x)

    I would like to try and get a feel for what everyone uses as a sample size. 😀

    If I'm bothering to update stats manually, I'm going to use FULLSCAN unless I can't due to operational constraints. If I can't, I'll use the highest % I can get away with

    it got me thinking about flipping percentages on a dynamic UPATE STATISTICS query I'm building that joins against sys.sysindexes to get the rowcount.

    Don't user sys.sysindexes. It's deprecated, included only for backward compat with SQL 2000 and will be removed soon (I hope very soon). Also stats have not been based on the rowmodcnt since SQL 2000.

    There's a DMV which returns the column modification count (what stats updates are based on now). I can't remember the name right now, but a browse through BoL will find it. Use that.

    If you want the row count, sys.dm_db_partition_stats or sys.partitions both have it.

    Is this time component mixed with Auto sample size a good way to approach this?

    No. It gives no consideration to the data changes in the table. A table that's static can have year old stats and be fine. I know of systems where some tables needed stats updates every half an hour otherwise the query performance tanked. Age is NOT a good reflection of how stale stats are.

    The code I've snipped basically repeats what SQL's auto sampling does, the larger the table, the smaller the %. If you're not going to use FULLSCAN, just use auto and let SQL pick the sampling size it wants. No need for you to recode the sampling selection yourself.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • If I have the time, I prefer to to do a full scan. Sampling a small percentage, unless you know your data has a standard distribution, is going to lead to inaccurate statistics and bad plans.

    I don't think I'd do time-based updates of stats either because if the base data hasn't changed there isn't any reason to update the statistics because they haven't changed.

    For SQL Server 2008 R2 SP2 and SQL Server 2012 SP1 and later I'd use a query something like this:

    WITH StatisticsInfo

    AS (

    SELECT

    QUOTENAME(SCHEMA_NAME(T.schema_id)) + '.' + QUOTENAME(OBJECT_NAME(S.object_id)) [TableName],

    QUOTENAME(S.name) [StatsName],

    STATS_DATE(S.object_id, S.stats_id) AS [LastStatsUpdate],

    DDSP.rows,

    DDSP.rows_sampled / CONVERT(DECIMAL(38, 4), DDSP.rows) AS SamplingRate,

    DDSP.modification_counter,

    DDSP.modification_counter / CONVERT(DECIMAL(38, 4), DDSP.rows) AS ModificationRate

    FROM

    sys.stats AS S

    INNER JOIN sys.tables AS T

    ON T.object_id = S.object_id

    CROSS APPLY sys.dm_db_stats_properties(S.object_id, S.stats_id) AS DDSP

    WHERE

    DDSP.modification_counter > 0 AND

    DDSP.rows > 1000

    )

    SELECT

    StatisticsInfo.TableName,

    StatisticsInfo.StatsName,

    StatisticsInfo.LastStatsUpdate,

    CASE WHEN StatisticsInfo.rows < 50000 AND

    StatisticsInfo.ModificationRate > .19 THEN 1

    WHEN StatisticsInfo.rows < 100000 AND

    StatisticsInfo.ModificationRate > .14 THEN 1

    WHEN StatisticsInfo.rows < 500000 AND

    StatisticsInfo.ModificationRate > .09 THEN 1

    WHEN StatisticsInfo.rows < 100000 AND

    StatisticsInfo.ModificationRate > .04 THEN 1

    WHEN StatisticsInfo.rows < 5000000 AND

    StatisticsInfo.ModificationRate > .01 THEN 1

    ELSE 0

    END AS UpdateStats,

    StatisticsInfo.SamplingRate,

    StatisticsInfo.ModificationRate

    FROM

    StatisticsInfo;

    Then I'd only update those stats that meet my criteria UpdateStats = 1 and like Gail has said I'd use FULLSCAN or AUTO. This query also gives you the last Sampling rate and you could use that if you aren't having issues with bad plans and you really want to control the sampling rate.

    Jack Corbett
    Consultant - Straight Path Solutions
    Check out these links on how to get faster and more accurate answers:
    Forum Etiquette: How to post data/code on a forum to get the best help
    Need an Answer? Actually, No ... You Need a Question

  • Thanks Gail and Jack for your input. As I'm finally getting exposure to live 2014, I need to unlearn some things in 2008 (non-R2). I would use Ola's scripts, but there seems to be some administrative overhead on it. On a reindex + stats job run of 5.5 hours, 3 of those hours seem to be dedicated to just Updating Statistics, but when I look at the run times in CommandLog, they don't total anywhere near 3 hours. Thoughts on why this may be occurring?

    This was the main reason I was interested in re-inventing the wheel and come up with my own script.

    Gaby
    ________________________________________________________________
    "In theory, theory and practice are the same. In practice, they are not."
    - Albert Einstein

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply