CXPACKET/Parallelism question

  • Paul,

    Please do tell me where I misspoke. I would hate to have misinformation up to confuse people for years to come. I'd like to edit and correct anything wrong.

    Thanks,

    Jim

    Jim Murphy
    http://www.sqlwatchmen.com
    @SQLMurph

  • Hi,

    I think I sounded bit negative on parallelism in my post, it was not in my intention to say that you should always turn it off as a general fix to all your performance issues, my apologies on not being clearer on that. The reason I've personally often gone with MAXDOP 1 is that wait stats don't lie (or atleast I think they dont), but that's not really SQL Servers fault as much as it's the people who designed the databases and queries that I work with 🙂

    Mika Sutinen, Senior DBA
    @SQLFinn on Twitter
    Blogging at SQLStarters[/url]

  • [Jim].[dba].[Murphy] (2/28/2011)


    Please do tell me where I misspoke. I would hate to have misinformation up to confuse people for years to come. I'd like to edit and correct anything wrong.

    Hi Jim,

    Turns out the parallelism comments here aren't the ones I was thinking of (a different thread, perhaps). There are a couple points I should respond to though, so I'll do that shortly. Sorry for mixing things up and casting unwarranted aspersions.

    Paul

  • Paul,

    No sweat man. You post kind of a lot so it is easy to have all of the threads blend together.

    I'm sure you will address the CXPacket always being parallelism and nothing else. I bet you are going to note that this blanket statement ignores the underlying cause and by looking at the active waits for a given query, we can discern the one or two threads which are waiting on something else - revealing the underlying cause rather than treating the symptoms (since parallelism is rarely the actual and only cause of cxpacket). My blanket statement was focused ONLY on the symptom at hand and I was clarifying the actual definition of this specific wait in an effort to begin to discern the underlying issue - not highlighting the underlying issue itself.

    At any rate, I'd love to hear your feedback and adjust as necessary. Accuracy is quite important to me.

    Thanks for taking the time to share,

    Jim

    Jim Murphy
    http://www.sqlwatchmen.com
    @SQLMurph

  • Ok. First thing is about memory: SQL Server Standard x86 is not limited to 2GB memory (though it is limited by default to 2GB address space aka VAS). Starting the server with the /3GB switch will give SQL Server 3GB of VAS (can be fine tuned between 2GB and 3GB using /USERVA). Running it on a 64-bit OS expands that further to 4GB. All that aside, the primary mechanism to access more memory on 32-bit is AWE, where the AWE-mapped memory can only be used for allocations from the buffer pool. Even on SQL Server Standard, AWE allows access to all the memory the host OS can see - for example up to 64GB (assuming no /3GB) on Windows Server 2003/2008 Enterprise. See http://msdn.microsoft.com/en-us/library/aa366778.aspx and http://blogs.msdn.com/b/psssql/archive/2008/04/05/sql-server-2005-memory-limits-and-related-questions.aspx.

    CXPACKET is only seen in parallel execution plans, but it is almost never possible to draw any useful conclusions by looking at CXPACKET 'waits' alone. A serial query batch executes entirely on a single worker thread. That worker is dedicated to that task until it completes. Even if the task waits for a lock for an hour, the worker thread is never used for any other queries during that time. In a parallel plan, that same worker runs the portion of the plan above the final Gather Streams exchange operator. That worker thread, known as the coordinator, runs that part of the plan inside execution context zero.

    Assuming that the remainder of the plan is a single parallel zone, executing with a runtime degree of parallelism (DOP) of four, there will be four other execution contexts, each with a single worker thread running a serial version of the parallel zone's plan. In plans with more than one parallel zone (where zones are separated by exchange operators - Distribute, Repartition, and Gather Streams) there can be as many worker threads as DOP * zones - but there may be less if the scheduling properties of the plan mean that 'finished' worker threads can be reused. SQL Server can reuse worker threads in a parallel query in much the same way as it can reuse memory grant.

    If the coordinator thread does nothing except fetch rows from the final Gather Streams and return them to the client, it will spend almost all of its time idle (passing rows back is an asynchronous activity). Consider a query than runs in parallel, where each of the four parallel workers take 100ms each to do the work required. All the data is in memory, and there are no waits for locks, disk accesses or anything else - in short, execution is perfect. All four workers complete at exactly the same time, so the whole parallel section takes 100ms (4 lots of 100ms simultaneously). The co-ordinator thread still has to wait 100ms to return the results to the client - and while it waits for packets of rows from the consumer side of the Gather Streams exchange operator, it's wait type is CXPACKET - Consumer eXchange PACKET. Nothing can be done to speed this perfect query, but you will still see close to 100ms of CXPACKET waits for a query that runs in 100ms! There is an example of this common behaviour at http://www.mssqltips.com/tip.asp?tip=2027.

    If you look at sys.dm_exec_requests while that query is executing, you will see that it is waiting on CXPACKET. You would need to look deeper - at sys.dm_os_tasks or sys.dm_os_workers - to see that the four parallel tasks were RUNNING with no wait type, and only the coordinator task running in execution context zero was waiting on CXPACKET. If the final serial portion of the plan (the part run by the coordinator) contains high-cost query plan operators, the coordinator will not be waiting - it will be RUNNING or RUNNABLE, processing the rows through the operators in its part of the plan. As a consequence, CXPACKET waits will be very low, but does that mean the query is any more efficient than the previous example? Of course not.

    A couple of posts referred to 'CPU deadlocks'. This is more commonly known as intra-query parallel thread deadlocks. They come in two forms: detected and undetected. The detected form is picked up by the normal deadlock resolver and the query will eventually complete, though detection and resolution can be a slow process (deadlock check runs by default once every five seconds, more frequently once the first deadlock is discovered). Undetected deadlocks result in all parallel threads waiting on CXPACKET forever. The query will never make progress, and the only option is to kill it. SQL Server 7 and 2000 were especially prone to intra-query deadlocks, but the problem was all but eliminated in SQL Server 2005 SP3 - at least as far as undetected deadlocks are concerned. See http://blogs.msdn.com/b/bartd/archive/2008/09/24/today-s-annoyingly-unwieldy-term-intra-query-parallel-thread-deadlocks.aspx.

    The other common case is where one or more parallel worker threads are waiting on a resource like a lock or a read from disk. Each parallel task will show an appropriate wait type in sys.dm_os_tasks, but the coordinator will still be waiting on CXPACKET, as it waits for rows from the Gather Streams exchange (sys.dm_exec_requests will again show CXPACKET). There are a huge number of other scenarios - but I hope that gives you a feeling for some of the reasons that CXPACKET is often completely normal and expected.

    Hyper-threading. There were indeed problems running SQL Server on pre-Nehalem architecture processors. Mostly, these were associated with the shared-cache architecture of those older processors. Consider two tasks running on two hyper-threads on a single core, with shared L2 cache. The first task is a regular SQL batch, and the second is a backup, Lazy Writer or Checkpoint - all of which process a large number of memory pages. The backup/LW/Chkpt process floods the L2 cache, resulting in the normal SQL batch finding very little data in L2, and having to access much slower main memory instead. The Nehalem architecture solves these problems and HT typically results in 15-30% performance increase. See http://rolfed.com/nehalem/nehalemPaper.pdf for details.

    MAXDOP and OLTP. For a genuine OLTP workload, the optimizer will almost never choose a parallel plan. Most queries will use either trivial plan, or the early stages of QO which do not consider parallelism. It is rare to find a pure-OLTP environment - there will be at least some queries that process a large number of rows, and might benefit from parallelism. It is valid to set server MAXDOP to 1, and override on a per-procedure basis with a MAXDOP hint, but it is not the only workable configuration by any means.

    One of the comments stated that "SQL Server won't schedule another task on a CPU which is waiting to sync with the other CPUs. Meanwhile, that CPU is at 0%. Nice waste of CPU!". This is not the case at all. A worker thread that needs to wait for something signals another worker from the scheduler's runnable queue before yielding. This is one of the essential features of a cooperative environment like SQLOS. There's a great description of how this works at http://msdn.microsoft.com/en-us/library/aa175393(v=sql.80).aspx and http://blogs.msdn.com/b/slavao/archive/2005/07/20/441058.aspx (scroll down to Schedulers & Tasks).

    There's so much to say about parallelism - merging exchanges, SLEEP_TASK and IO_COMPLETION waits, NUMA... but a forum post is not the place. This post is already far too long 🙂

    Paul

    edit: added link for intra-query parallelism deadlock

  • Awesome info, Paul - thanks for that.

    Time to write an article (or ten), maybe...? 🙂

    Well, I managed to reproduce the error on our test system with MAXDOP = 1.

    Looks like I'll be sending out a "Please fix your design" email to the software vendor...

    ------------------------------------------------------------------------
    Bite-sized fiction (with added teeth) [/url]

  • Paul,

    You rock. Thanks for the treaure chest of info!

    Jim

    Jim Murphy
    http://www.sqlwatchmen.com
    @SQLMurph

  • mazzz (3/1/2011)


    Awesome info, Paul - thanks for that.

    Time to write an article (or ten), maybe...? 🙂

    Cheers. Coincidentally, I just now received the editor's comments on my first article for Simple Talk - "Understanding and Using Parallelism". If I can get the minor revisions done tomorrow, it should go out on Monday. I'll post a link here.

    Well, I managed to reproduce the error on our test system with MAXDOP = 1. Looks like I'll be sending out a "Please fix your design" email to the software vendor...

    They'll just ask you to switch to SIMPLE recovery mode or add a few files to your tempdb first :w00t:

    Good luck with your new venture.

    Paul

  • SQLkiwi (3/1/2011)


    CXPACKET is only seen in parallel execution plans, but it is almost never possible to draw any useful conclusions by looking at CXPACKET 'waits' alone.

    Hi!

    Thank you for the parallelism information, good stuff, hopefully we get to read your article soon. I am curious though, is there a percentage of CXPACKET waits that'd immediately tells you that parallelism just isn't working for you? Im asking because we recently started moving our databases from SQL 2000 to SQL 2005 and I was feeling adventurous with parallelism settings. I say adventurous, because our experiences from SQL 2000 were somewhat disappointing with poor performing queries, blocking, etc.

    In our environment I set up parallelism equal to CPU cores per NUMA node (6) and after going live, we had parallelism waits totaling 20-24% of all waits in the server, which to me that seemed like a lot. Dropping parallelism value to 4 had basically no effect whatsoever and going down to 2 left us still with 10-12% of waits being CXPACKETs.

    I'm bit of a paranoid when it comes to performance, so I ended up disabling parallelism at this particular case.

    Best regards,

    Mika

    Mika Sutinen, Senior DBA
    @SQLFinn on Twitter
    Blogging at SQLStarters[/url]

  • mika.sutinen (3/1/2011)


    Thank you for the parallelism information, good stuff, hopefully we get to read your article soon. I am curious though, is there a percentage of CXPACKET waits that'd immediately tells you that parallelism just isn't working for you? Im asking because we recently started moving our databases from SQL 2000 to SQL 2005 and I was feeling adventurous with parallelism settings. I say adventurous, because our experiences from SQL 2000 were somewhat disappointing with poor performing queries, blocking, etc.

    In our environment I set up parallelism equal to CPU cores per NUMA node (6) and after going live, we had parallelism waits totaling 20-24% of all waits in the server, which to me that seemed like a lot. Dropping parallelism value to 4 had basically no effect whatsoever and going down to 2 left us still with 10-12% of waits being CXPACKETs.

    I'm bit of a paranoid when it comes to performance, so I ended up disabling parallelism at this particular case.

    Mika,

    On our server 91.64% of the waits are CXPACKET, and I don't see that as being a problem. In some cases a parallel plan runs a very large query in ~100 minutes, while a serial plan takes ~500. (And that is with just 4 threads.)

    Rather than concentrate on the percentage of waits you should do some actual performance comparisons. (How long are the queries taking when MAXDOP is set of 6, 4, 2, and 1?)

  • Ok, so the first part of my parallelism article went out a couple of days earlier than I expected :blush:

    Link: http://www.simple-talk.com/sql/learn-sql-server/understanding-and-using-parallelism-in-sql-server/

  • SQLkiwi (3/4/2011)


    Ok, so the first part of my parallelism article went out a couple of days earlier than I expected :blush:

    Link: http://www.simple-talk.com/sql/learn-sql-server/understanding-and-using-parallelism-in-sql-server/

    That is one fantastic article. When's the book coming out? 😉

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • WayneS (3/4/2011)


    SQLkiwi (3/4/2011)


    Ok, so the first part of my parallelism article went out a couple of days earlier than I expected :blush:

    Link: http://www.simple-talk.com/sql/learn-sql-server/understanding-and-using-parallelism-in-sql-server/

    That is one fantastic article. When's the book coming out? 😉

    Which one? 😉

  • SQLkiwi (3/4/2011)


    Ok, so the first part of my parallelism article went out a couple of days earlier than I expected :blush:

    Link: http://www.simple-talk.com/sql/learn-sql-server/understanding-and-using-parallelism-in-sql-server/

    What a fantastic article!!! GREAT job Paul! Thank you very much for sharing it!!



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • A belated congrats from me for your MVP status, Paul - very well deserved!

    I read half your article a few weeks ago, and very much enjoyed what I read so far. haven't had the chance to get back to it since - but I have a four hour train journey to SQLBits this evening, so it's a perfect opportunity 🙂

    ------------------------------------------------------------------------
    Bite-sized fiction (with added teeth) [/url]

Viewing 15 posts - 31 through 44 (of 44 total)

You must be logged in to reply to this topic. Login to reply