Random excessive CPU consumption

  • Hi all. I'm in a bit of a quandary with one of our Analysis Services servers, and I'm hoping you may have some suggestions on how to solve our issue. Basically we have 7 Cube servers, all with identical system specs. Startup log:

    (8/6/2020 11:23:20 AM) Message: Service stopped. (Source: \\?\R:\Analysis\Log\msmdsrv.log, Type: 1, Category: 289, Event ID: 0x41210001)
    (8/6/2020 11:26:22 AM) Message: The Query thread pool now has 1 minimum threads, 32 maximum threads, and a concurrency of 32. Its thread pool affinity mask is 0x000000000000ffff. (Source: \\?\R:\Analysis\Log\msmdsrv.log, Type: 1, Category: 289, Event ID: 0x4121000A)
    (8/6/2020 11:26:22 AM) Message: The ParsingShort thread pool now has 4 minimum threads, 32 maximum threads, and a concurrency of 32. Its thread pool affinity mask is 0x000000000000ffff. (Source: \\?\R:\Analysis\Log\msmdsrv.log, Type: 1, Category: 289, Event ID: 0x4121000A)
    (8/6/2020 11:26:22 AM) Message: The ParsingLong thread pool now has 4 minimum threads, 32 maximum threads, and a concurrency of 32. Its thread pool affinity mask is 0x000000000000ffff. (Source: \\?\R:\Analysis\Log\msmdsrv.log, Type: 1, Category: 289, Event ID: 0x4121000A)
    (8/6/2020 11:26:22 AM) Message: The Processing thread pool now has 1 minimum threads, 64 maximum threads, and a concurrency of 32. Its thread pool affinity mask is 0x000000000000ffff. (Source: \\?\R:\Analysis\Log\msmdsrv.log, Type: 1, Category: 289, Event ID: 0x4121000A)
    (8/6/2020 11:26:22 AM) Message: The IOProcessing thread subpool with affinity 0x000000000000ffff now has 1 minimum threads, 160 maximum threads, and a concurrency of 32. (Source: \\?\R:\Analysis\Log\msmdsrv.log, Type: 1, Category: 289, Event ID: 0x4121000B)
    (8/6/2020 11:26:31 AM) Message: The flight recorder was started. (Source: \\?\R:\Analysis\Log\msmdsrv.log, Type: 1, Category: 289, Event ID: 0x41210005)
    (8/6/2020 11:26:31 AM) Message: The query log was started. (Source: \\?\R:\Analysis\Log\msmdsrv.log, Type: 1, Category: 289, Event ID: 0x41210003)
    (8/6/2020 11:26:31 AM) Message: Started listening on TCPIP: Status = ListenToPortSuccess, ComputerName = PRODCUBE03, PhysicalNetBIOSName = PRODCUBE03, ListenOnIP = [::]:2383, Port = 2383, ProtocolType = IP6 (Source: \\?\R:\Analysis\Log\msmdsrv.log, Type: 1, Category: 289, Event ID: 0x41210040)
    (8/6/2020 11:26:31 AM) Message: Started listening on TCPIP: Status = ListenToPortSuccess, ComputerName = PRODCUBE03, PhysicalNetBIOSName = PRODCUBE03, ListenOnIP = 0.0.0.0:2383, Port = 2383, ProtocolType = IP4 (Source: \\?\R:\Analysis\Log\msmdsrv.log, Type: 1, Category: 289, Event ID: 0x41210040)
    (8/6/2020 11:26:31 AM) Message: Processor Information: NumberOfNumaNodes = 1, NumberOfAffinityGroups = 1, NumberOfCPUs = 1, NumberOfPhysicalCores = 16, NumberOfLogicalCores = 16, PhysicalMemoryAvailable = 63 GB (Source: \\?\R:\Analysis\Log\msmdsrv.log, Type: 1, Category: 289, Event ID: 0x41210041)
    (8/6/2020 11:26:31 AM) Message: Service started. Microsoft SQL Server Analysis Services 64 Bit Enterprise Core (x64) SP2 12.0.5687.1. (Source: \\?\R:\Analysis\Log\msmdsrv.log, Type: 1, Category: 289, Event ID: 0x41210000)
    (8/6/2020 11:26:31 AM) Message: Software usage metrics are disabled. (Source: \\?\R:\Analysis\Log\msmdsrv.log, Type: 1, Category: 289, Event ID: 0x4121002E)

    6 of the cube servers are customer facing, and synchronize from a main cube server, which gets fully processed every 12 hours. Each of the customer facing cube servers are issued identical queries for our client portal. About 10 days ago, suddenly one of the customer facing servers started consuming 100% (normally 5-20%) cpu when queries are run against the cubes. All the consumption is by msmdsrv.exe. because it hits 100% and holds there, query threads stack up, and the execution duration goes through the roof. We verified that the queries are still identical to the other servers, and monitored the network ports to make sure someone isn't accessing the cubes that should be. We have tried restarting the OLAP service, rebooting the server, performing a full processing directly on that server, and have applied all the recent cumulative updates, but the CPU is still being completely consumed  by msmdsrv.exe. CPU is normal during synchronization, but once it puts back into the rotation and customer queries hit it, things go south immediately.

    Again, this all started out of the blue, with no rhyme or reason as to why. Any suggestions would be very much appreciated!

  • Thanks for posting your issue and hopefully someone will answer soon.

    This is an automated bump to increase visibility of your question.

  • My first thought on this is that the one SSAS cube is different than the others which is apparent to me since it uses all the CPU.  Now, the code behind it may be identical, but is the data?  If the data is different, it might be that for that one company you are paging to disk or it is set to be multithreaded.

    If you have, for example, a 4 core system, 20% of the CPU being used might be 20% per core for a total of 20% OR it could be 80% on one core.

    If the data often fits in memory and 10 days ago they crossed the threshold where it fits in memory and needs to page to disk, that will hurt performance and I can see that causing a lot of paging to disk.

    I'd be looking at how much memory each of the 6 cubes uses and see if you see any abnormal values on the one that uses 100% CPU.  If that isn't the case, I'd be checking other things like disk I/O and network I/O.  Check the resources.  Might not hurt to do some investigatory work on the server itself.  Is the antivirus going crazy?  Do the windows logs indicate anything else going goofy at that time?  Is there a server backup happening at the same time? etc.

    Task manager indicating it is msmdsrv.exe likely means it IS SSAS using all thee CPU, but it could be due to something else like paging to disk or dealing with a much larger data set on that one company than the others.

     

    That being said, I've not seen this particular problem before, so I may be leading you down some rabbit holes... but hopefully one of them leads to the rabbit!

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • Thanks for the reply, Brian! Great ideas, and I have looked at all the areas you mentioned before I made my initial post. memory is consistent across all 6 cube servers, it is just this one server having the CPU issue. One thing I should have mentioned, is that on the CPU side, 'Privileged time' is the counter I see with all the consumption. DPC, Interrupt  and user time all look normal. We are running 16 cores, all of which are at 99-100% consumption. I believe all the data should be the same, as all 6 customer facing cube servers synchronize from the same source cube on a different server every 12 hours, but I will take a closer look to see i I can find additional differences.

  • How is the page file utilization while CPU is spiking like that?  If page file utilization is starting to climb while the CPU is spiking, it might indicate something is asking the OS for too much memory to put in RAM and the OS is paging it to disk?

    As another thought, if you use a different server to process the data on that one server, do you notice that spike?  Pain to get that set up I'm sure, but if you can replicate the problem from that one server on another box somewhere, troubleshooting can be nicer as you can do repeated runs of that refresh without impacting anyone.

    Is there a chance that someone installed some other tool on that box that isn't on the other 5?  Just wondering if something like a malware scan or something is being triggered.  Or different service packs/CU/patch level on both the SQL and the Windows side?

    And just as a thought, in your log on the 5 other servers, do the number of threads look to be configured the same?  Just wondering if MAYBE the one is making a lot more threads than the other 5 servers.

     

    Now, everything I indicated above might also be rabbit holes leading nowhere.  As a dumb question, could it be the server just needs a reboot?  Windows does like its reboots 🙂

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • What happens if you submit one of the same queries that causes problems directly to the server? And using user level security?

    Have you redeployed the cube to the troubled machine?

    Verify again server instance on this machine that configuration file matches.

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply