Max degree of parallelism

  • Hi All,

    I was wondering what you would advice the value for max degree of parallesim on the server level on an OLTP server with 24 logical CPUs and 256 GB RAM. It is a pretty busy server. Averaging 13k batch request per second.

    Non NUMA architecture.

    -Roy

  • On a personal level, I'd probably start at serverwide MAXDOP 8 with a cost setting of ~10-15 and then monitor the heck out of it looking for the pain points. I assume you're moving an existing server/instance onto the new hardware. What did the old machine look like?


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Roy Ernest (7/26/2011)


    I was wondering what you would advice the value for max degree of parallesim on the server level on an OLTP server with 24 logical CPUs and 256 GB RAM. It is a pretty busy server. Averaging 13k batch request per second. Non NUMA architecture.

    If it is *truly* an OLTP workload - primarily nested loops joins, very few sorts, single-row lookups, optimal indexes, you might choose server DOP 1. This assumes you have enough query traffic to keep 24 workers busy (i.e. RUNNING or RUNNABLE) most of the time. Any odd queries that would benefit from parallelism, during quieter times, could use an explicit MAXDOP > 1 query hint.

    It's tough to give blanket recommendations though, as so much depends on your particular workload. Other people are happy to set server MAXDOP to 2, 4, or even higher, perhaps along with an elevated cost threshold. 35-50 is a popular number, but not necessarily with me 🙂

  • This is on an existing Hardware. When we moved into this server, we had two queries that was having parallel plan. I gave hint for just those two. During the past 2 years I started noticing the CXPACKET wait stats going up. The DBA usually review all objects that are released to this server. (We do check for missing indexes, race condition and even object qualifiers). Somehow I missed some SPs that had parallelism and now I see a higher wait on CXPACKET. I was able to identify all queries that have parallelism. It is just a handful. I can explicitly specify MAXDOP for these. But I am not sure that is the way to go.

    The max CPU usage is around 40% when it is really busy. usually it stays around 15%.

    -Roy

  • Let me point out something very important... I've never tuned a server with anywhere near that much traffic.

    That being said here are a few ideas :

    It's a little script I wrote to find out if the current cost threshold might be too low (causing extra cpackets waits). While we're at it, what does the cpacket waits look like (http://www.sqlskills.com/BLOGS/PAUL/post/Wait-statistics-or-please-tell-me-where-it-hurts.aspx)?

    SET IMPLICIT_TRANSACTIONS OFF

    GO

    CREATE TABLE #configs

    (

    name nvarchar(35),

    minimum INT,

    maximum INT,

    config_value INT,

    run_value INT

    )

    GO

    EXEC sp_configure 'show advanced options', 1;

    GO

    reconfigure;

    GO

    INSERT INTO #configs (name, minimum, maximum, config_value, run_value)

    EXEC sp_configure 'cost threshold for parallelism'

    GO

    EXEC sp_configure 'show advanced options', 0;

    GO

    reconfigure;

    GO

    SELECT

    optz.counter

    , optz.occurrence

    , CONVERT(DECIMAL(18,2), optz.value) AS AvgValue

    , conf.name

    , conf.config_value

    , conf.run_value

    , Uptime.DaysUptime AS [Days Server UPTIME & Last Stats Reset]

    , CASE WHEN Uptime.DaysUptime < 45 THEN 'You may not have very meaningful stats because of a recent restart' ELSE NULL END AS [Stats Warning]

    , CASE WHEN optz.value > conf.config_value THEN 'Cost threshold for parallelism might be too low' ELSE NULL END AS [Cost Threshold Warning]

    , CASE WHEN conf.run_value <> conf.config_value THEN 'Server needs to be restarted for the setting to take effect' ELSE NULL END [Restart Warning]

    FROM

    sys.dm_exec_query_optimizer_info optz

    CROSS JOIN #configs conf

    OUTER APPLY (

    SELECT

    CONVERT(DECIMAL(18 , 2) , DATEDIFF(hh , create_date , GETDATE()) / 24.0) AS DaysUptime

    FROM

    sys.databases

    WHERE

    name = 'tempdb'

    ) Uptime

    WHERE

    optz.counter = 'final cost'

    GO

    DROP TABLE #configs

    I've also just toyed for about 1 hour to tweak this script. It finds the plan that require parallelism to run. Using the temp table you can get weighted averages. It goes without saying that this is a really heavy query to run. It takes over 1 minute to process 10K plans on our little prod server (2 xeon cpus, 4 Gb ram).

    This version will get you the AVG weigthed estimated cost of the queries with parallelism. You can compare that with the previous one which is server wide to make a really good guess for the "correct" cost threshold to set on the server.

    A final note, as much as I tried I don't have any plans with any info in DegreeOfParallelism. I'm 99.9% certain my code is correct but that's not 100%.

    SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;

    WITH XMLNAMESPACES

    (DEFAULT 'http://schemas.microsoft.com/sqlserver/2004/07/showplan')

    SELECT

    query_plan AS CompleteQueryPlan,

    n.value('(@StatementText)[1]', 'VARCHAR(4000)') AS StatementText,

    n.value('(@StatementOptmLevel)[1]', 'VARCHAR(25)') AS StatementOptimizationLevel,

    n.value('(@StatementOptmEarlyAbortReason)[1]', 'VARCHAR(25)') AS StatementOptmEarlyAbortReason,

    n.value('(@StatementSubTreeCost)[1]', 'DECIMAL(18, 6)') AS StatementSubTreeCost,

    n.value('(QueryPlan/@CompileCPU)[1]', 'DECIMAL(18, 0)') AS CompileCPU,

    n.value('(QueryPlan/@CompileTime)[1]', 'DECIMAL(18, 0)') AS CompileTime,

    n.value('(QueryPlan/@CompileMemory)[1]', 'DECIMAL(18, 0)') AS CompileMemory,

    n.value('(QueryPlan/@DegreeOfParallelism)[1]', 'DECIMAL(18, 0)') AS DegreeOfParallelism,

    n.query('.').exist('//MissingIndexes') AS HasMissingIndexes,

    n.query('.') AS ParallelSubTreeXML,

    ecp.usecounts,

    ecp.size_in_bytes

    INTO #X

    FROM sys.dm_exec_cached_plans AS ecp

    CROSS APPLY sys.dm_exec_query_plan(plan_handle) AS eqp

    CROSS APPLY query_plan.nodes('/ShowPlanXML/BatchSequence/Batch/Statements/StmtSimple') AS qn(n)

    WHERE n.query('.').exist('//RelOp[@PhysicalOp="Parallelism"]') = 1

    SELECT * FROM #X

    SELECT SUM(usecounts * DegreeOfParallelism) / SUM(usecounts) AS Weigthed_Avg_DOP,

    SUM(usecounts * StatementSubTreeCost) / SUM(usecounts) AS Weigthed_Avg_SubTreeCost

    FROM #X

  • Maybe check out the replies by Grant on another thread can get you started:

    http://qa.sqlservercentral.com/Forums/FindPost1123444.aspx

    I did Grant wrong with this mentioning to read his reply :blush:

    I'm sorry Grant :crying:

    I should have mentioned "replies" ... I corrected it.

    Priceless info Grant.:smooooth: Thank you once more.

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution 😀

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

  • ALZDBA (7/26/2011)


    Maybe check out the reply by Grant on another thread can get you started:

    http://qa.sqlservercentral.com/Forums/FindPost1123444.aspx

    Yup, I "gut" my script inspiration after that thread. Very nice info.

  • Your second query is still running. But your first query here is the output.

    counter final cost

    occurrence 3189815

    AvgValue 2.77

    name cost threshold for parallelism

    config_value 5

    run_value 5

    Days Server UPTIME & Last Stats Reset 167.50

    Stats Warning NULL

    Cost Threshold Warning NULL

    Restart Warning NULL

    And here is the top 10 Wait type stats.

    WAIT TYPE PERCENTAGE

    CXPACKET 57.72

    LATCH_EX 11.88

    CMEMTHREAD 4.74

    SOS_SCHEDULER_YIELD 4.71

    ASYNC_NETWORK_IO 1172668.81 3.67 82.72

    MSQL_XP 916031.63 2.86

    PREEMPTIVE_OS_GETPROCADDRESS 2.86

    OLEDB 2.02

    TRACEWRITE 2.01

    BACKUPBUFFER 1.19

    -Roy

  • See what happens when you rise the cost threshold for parallisme and be surprised, like I was.

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution 😀

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

  • Roy Ernest (7/26/2011)


    Your second query is still running. But your first query here is the output.

    counter final cost

    occurrence 3189815

    AvgValue 2.77

    name cost threshold for parallelism

    config_value 5

    run_value 5

    Days Server UPTIME & Last Stats Reset 167.50

    Stats Warning NULL

    Cost Threshold Warning NULL

    Restart Warning NULL

    And here is the top 10 Wait type stats.

    WAIT TYPE PERCENTAGE

    CXPACKET 57.72

    LATCH_EX 11.88

    CMEMTHREAD 4.74

    SOS_SCHEDULER_YIELD 4.71

    ASYNC_NETWORK_IO 1172668.81 3.67 82.72

    MSQL_XP 916031.63 2.86

    PREEMPTIVE_OS_GETPROCADDRESS 2.86

    OLEDB 2.02

    TRACEWRITE 2.01

    BACKUPBUFFER 1.19

    What's the avg wait time in ms for cxpackets?

    BTW how come you got 57% for cxpackets and 82.72% for async_network_IO???? That doesn't seem to add up really well!

    Here's the modified version I use to include the avgs :

    WITH Waits AS

    (SELECT

    wait_type,

    wait_time_ms / 1000.0 AS WaitS,

    (wait_time_ms - signal_wait_time_ms) / 1000.0 AS ResourceS,

    signal_wait_time_ms / 1000.0 AS SignalS,

    waiting_tasks_count AS WaitCount,

    100.0 * wait_time_ms / SUM (wait_time_ms) OVER() AS Percentage,

    ROW_NUMBER() OVER(ORDER BY wait_time_ms DESC) AS RowNum

    FROM sys.dm_os_wait_stats

    WHERE wait_type NOT IN (

    'CLR_SEMAPHORE', 'LAZYWRITER_SLEEP', 'RESOURCE_QUEUE', 'SLEEP_TASK',

    'SLEEP_SYSTEMTASK', 'SQLTRACE_BUFFER_FLUSH', 'WAITFOR', 'LOGMGR_QUEUE',

    'CHECKPOINT_QUEUE', 'REQUEST_FOR_DEADLOCK_SEARCH', 'XE_TIMER_EVENT', 'BROKER_TO_FLUSH',

    'BROKER_TASK_STOP', 'CLR_MANUAL_EVENT', 'CLR_AUTO_EVENT', 'DISPATCHER_QUEUE_SEMAPHORE',

    'FT_IFTS_SCHEDULER_IDLE_WAIT', 'XE_DISPATCHER_WAIT', 'XE_DISPATCHER_JOIN', 'BROKER_EVENTHANDLER',

    'TRACEWRITE', 'FT_IFTSHC_MUTEX', 'SQLTRACE_INCREMENTAL_FLUSH_SLEEP', 'BROKER_RECEIVE_WAITFOR')

    )

    SELECT

    W1.wait_type AS WaitType,

    CAST (W1.WaitS AS DECIMAL(14, 2)) AS Wait_S,

    CAST (W1.WaitS * 1000 / W1.WaitCount AS DECIMAL(14, 2)) AS AvgWait_MS,

    CAST (W1.ResourceS AS DECIMAL(14, 2)) AS Resource_S,

    CAST (W1.SignalS AS DECIMAL(14, 2)) AS Signal_S,

    W1.WaitCount AS WaitCount,

    CAST (W1.Percentage AS DECIMAL(4, 2)) AS Percentage

    FROM Waits AS W1

    INNER JOIN Waits AS W2

    ON W2.RowNum <= W1.RowNum

    GROUP BY W1.RowNum, W1.wait_type, W1.WaitS, W1.ResourceS, W1.SignalS, W1.WaitCount, W1.Percentage

    HAVING SUM (W2.Percentage) - W1.Percentage < 97 -- percentage threshold

    AND W1.WaitCount > 0;

    GO

  • ALZDBA (7/26/2011)


    See what happens when you rise the cost threshold for parallisme and be surprised, like I was.

    Good or bad surprised? What and how did you trace the changes?

  • I thought I edited it properly. But I guess not.

    wait_type wait_time_s pct

    CXPACKET 18468243.63 57.72

    LATCH_EX 3802397.93 11.88

    CMEMTHREAD 1516464.84 4.74

    SOS_SCHEDULER_YIELD 1508456.11 4.71

    ASYNC_NETWORK_IO 1172668.81 3.67

    MSQL_XP 916031.63 2.86

    PREEMPTIVE_OS_GETPROCADDRESS 915633.18 2.86

    OLEDB 646687.82 2.02

    TRACEWRITE 641578.03 2.01

    BACKUPBUFFER 381806.38 1.19

    Your Second query ran in 19 min.

    Weigthed_Avg_DOP Weigthed_Avg_SubTreeCost

    NULL 17.060217

    Total of 117 rows. Two out of the 117 has timeout as the reason for StatementOptmEarlyAbortReason. There are quite a few that has missing indexes. DegreeOfParallelism column is NULL. But the missing indexes SPs I have to look at and see why it got missed at the time of review.

    EDIT (how come you got 57% for cxpackets and 82.72% for async_network_IO) The 82.72% was actually the Running_pct column. The query I use to get the WAIT was done by Glenn Berry. One of his 5 Super DMV queries.

    -Roy

  • Hey Roy,

    I don't think there's much to choose here between DOP 1 and query hints on the very few queries that would benefit, versus DOP 4 (say) and a higher cost threshold. The effect is likely to be very similar, from what you've told us. If I could see your system and queries for a few hours, I might sway one way or the other, but as it is I say 'meh'. If you have a number of queries that would meet a much higher cost threshold, then by all means go the DOP + cost route. If you have a stable system now where almost all queries are simple and run acceptably fast, I might not take the (small) risk. Only you can really make this decision, though as I say, I don't think there's too much in it either way.

    I do want to respond to the comments about CXPACKET though. Very few people understand what CXPACKET truly means (and I do not number storage engine experts among them). For almost all intents and purposes, CXPACKET should be ignored. It is a consequence of running a parallel query, nothing more. People do tend to see CXPACKET 'waits' and assume they mean something similar to PAGEIO_LATCH* or whatever - but they really really don't. Yes, if you run fewer parallel queries, and/or fewer parallel regions, you will see CXPACKET numbers decline. No surprises there. It does not mean your server has become more efficient, or that response time has improved, or much of anything else.

  • Edit : Saw Paul's answer after I posted. I'd deffer to his experience rather than my very limited knowledge with this.

    I'm sure you're on top of this but stale stats will increase cxpackets as well.

    From what you've sent so far I probabely wouldn't go to 30-50 for the cost threshold, I'd actually try something much lower like 10-15 so to NOT penalize the queries that are using it ATM.

    You seem to have either very very low cost queries or relatively cheap "big" queries.

    That's why I wanted you to see the real average and then the average of bigger queries. Of course if you use recompile hints you're not going to see those in the cache. @ 13K queries / sec, saving the actual plan with stats is obviously out of the question :w00t:.

    The other question is also how long do the plans stay in the cache... that would be another good info to have in those queries.

    That bugs me that you don't see anything for the DegreeOfParallelism either. Looks like there's something I don't understand in there (either the caching or the XML query). :unsure:

  • SQLkiwi (7/26/2011)


    I do want to respond to the comments about CXPACKET though. Very few people understand what CXPACKET truly means (and I do not number storage engine experts among them). For almost all intents and purposes, CXPACKET should be ignored. It is a consequence of running a parallel query, nothing more. People do tend to see CXPACKET 'waits' and assume they mean something similar to PAGEIO_LATCH* or whatever - but they really really don't. Yes, if you run fewer parallel queries, and/or fewer parallel regions, you will see CXPACKET numbers decline. No surprises there. It does not mean your server has become more efficient, or that response time has improved, or much of anything else.

    I'm glad you clarified this Paul. You'd made the comment before on another thread and I had become concerned I'd missed something hugely important. I'd been meaning to do some research but you just eased my mind.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

Viewing 15 posts - 1 through 15 (of 30 total)

You must be logged in to reply to this topic. Login to reply