Distrbution cleanup job causing high CPU usage

  • I am running SQL Server 2005 sp1 on a Windows Server 2003 sp1 machine with 2 3.6 GHz processors and 4 GBs of RAM.

    We have about 12 publications and about 24 push subscribers on the box.

    We upgraded from SQL Server 2000 back in September. Since that time, there has been a general feeling that the server has been underperforming.

    As the SQL Server enthusiast in a firm of developers, I've taken it upon myself to investigate. I am quite certain the problem traces down to the distribution cleanup job. It was orginally scheduled to run every 10 minutes but I noticed it never finished and appeared from the Activity Monitor to accummulate a lot of CPU time and disk I/O. Looking at the Performance tab in Task Manager also confirmed that this job was causing a high level of CPU (i.e. when we killed the process, the CPU usage would drop off dramatically).

    I have experimented with a few things such as scheduling the cleanup job to run once a week (Sunday mornings at 5:00 am) but nothing has helped. Every time the job runs, I get the same results (i.e. it doesn't seem to end and causes high CPU usage).

    When looking at the process in Activity Monitor, I see that the process is running dbo.sp_MSdistribution_cleanup @min-2=0, @max-2=72. The Command is DELETE.

    Does anybody have any ideas as to what the problem is and how I can get it resolved? At the moment, I am having to kill the process everytime the job runs.

    Many thanks.

    - Mike

     

  • This was removed by the editor as SPAM

  • I didn't hear back from anybody on this but in the end, I was able to resolve the problem myself. It was a rather obscure issue so I'm not sure that anybody is interested to hear the solution. However, it did remind me that it is always useful to 'look under the hood' sometimes to see what is going on in system stored procedures.

  • I'd like to hear. Only time I've seen problems with this is when there was a ton of data piled up and it was just performing poorly in general.

  • Sure thing Andy. I will put together the details of my trouble shooting experiences and post it. Would this be too obscure to warrant an article?

  • Here is how I troubleshooted this replication problem.

    As mentioned already in this thread, we had moved the distribution cleanup job to run once a week, Sunday mornings at 5:00 am but it didn't help.

    I next decided take a look ‘under the hood’ to see what was really going on.

    I could tell from the Activity Monitor that the job was taking a long time to run the sp_MSdistribution_delete stored procedure, which is a system stored procedure in the Distribution database. Looking at the code in this stored procedure, one could see that it runs through the MSrepl_transactions table looking for distinct databases and for each, it calls sp_MSdelete_publisherdb_trans stored procedure to get rid of published transactions (in the MSrepl_transactions table) that are older than the default retention period of 72 hours.

     

    Looking at the data in MSrepl_transactions, I noticed that we had 5 databases with transactions going back to September, indicating that the cleanup job had yet to do anything! 3 of these databases I could easily recognize but I had no idea what the other 2 were (more on these later).

    After carefully going through the logic of the stored procedures involved, I decided to run them manually to delete the transactions that should not have been there had the cleanup job worked properly. NOTE: I do not recommend doing this unless you are clear on what is involved. Doing this reduced the number of rows in MSrepl_transactions by 90%!

    I then theorized the next run of the distribution cleanup job would take a few minutes. Unfortunately, it took 9 hours to complete the next time I ran it (but at least is completed successfully). The next time took about the same time too.

    I was about to give up but thought I would take a look at the two databases in MSrepl_transactions that I could not recognize earlier. There was only a single row for each of these. I decided to delete them and run the job again. This time it worked in a few minutes!

    I know have the job running once a day and all seems well.

    Some of the lessons learned from this are (a) don't be afraid to look into the system stored procedures to understand what is happening and (b) you should be able to account for all the databases represented in MSrepl_transactions.

    Also, I've started changing some of the snapshot replications from push to pull, to take a load off the publisher.

     

  • Thanks for the follow up. If you ever run into it again, I'd be curious what the query plan looked like, determine why those spurious entries were causing so much pain!

  • we've been having similar issues and we solved it in two ways

    first we noticed there are now indexes in the distribution db. we set up a job to maintain those indexes because in our case they were extremely fragmented. we run it every few hours and it runs for a few minutes in most cases.

    second if you have any errors in any publications set up the skiperrors parameter and resync once a week or so. this way you have less commands in the queue and it speeds things up.

  • I ran into this issue. For me, it was caused by conflicts. Conflicts were blocking the cleanup job. I resolved the conflicts and the cleanup job finished within a few minutes and CPU usage returned to normal.


    My blog: SQL Soldier[/url]
    SQL Server Best Practices:
    SQL Server Best Practices
    Twitter: @SQLSoldier
    My book: Pro SQL Server 2008 Mirroring[/url]
    Microsoft Certified Master: SQL Server, Data Platform MVP
    Database Engineer at BlueMountain Capital Management[/url]

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply