SQL Server 2000: The agent is suspect. No response within 10 minutes.

  • hHi,

    I have two servers, one production server (PS) and one backup server (BS) which have transactional replication with a pull subscription.

    When I configure replication, it works fine during our test weekends testing production load. After tests, replication looks fine for a random number of days. Then, all of a sudden, an error message is displayed on one of the agents: "The agent is suspect, no response within 10 minutes."  This has happened a number o times. If I remove replication and configures it again, it always works. Sometimes it works by just updating one of the tables and the error message disappears. The last time (today) that did not work. Updating the database did not replicate and the error message remained.

    Has anyone experienced this same problem and has a god solution. One thing that is common is that the error message appears after long times of inactivity on the servers, or perhaps after a restart but that I am not sure about.

    Question 1: How can I prevent this error message?

    Question 2: Are there any special things to think about when I need to restart the servers and replication is configured, e.g. after installing updates from Windows Update.

    I would be very grateful for any answers regarding this.

    Best,

    /M

  • Did you ever get a solution to this?

  • This problems tend to happens when the replica is didconnected from the primary for a long time usually due to time-outs, network outages or heavy locking, etc. It will stop working if it takes you more than 72 hrs to restart the agent.

    We monitor the data replication with separated scripts and if the agent message shows "retrying" or "Error Time-out expired" we re-run the agent with sp_start_job.

    M$ said that there is more resiliency built in the 2005 agents. I know that 2000's suck when you have such problems.

    Cheers,


    * Noel

  • Hi we had similar errors and the cause was that the agent user's password has changed and could not be started. Also make sure that sufficiant access is allowed for your agent user. (sql admin and the user should exist on the machine as atleast a power user).

    ,l0n3i200n

  • Its a disheartening message to see! Try stopping your agent, then modify the job by adding -outputverboselevel 3 -output c:\repllog.txt - then you can see real time what is going on. You can also just run the agent from the cmd line instead of the job so that it writes to the console, but I've found the text file method easier to use. Just restart the agent, let it run a minute or two, then check the text file to see what is going on.

  • Just check out what's the distribution expiry period the default is 72 hrs as said before you can try out increasing the hrs.

     If you are not sure when the replication happens then try adding a dummy article and make an insert and delete periodically so as to keep the replication active.

  • Also, you can increase the retry attempts by modifying the relevant replication job

    But before that include the logging as andy said. That log will give you the exact cause for the problem




    My Blog: http://dineshasanka.spaces.live.com/

  • To activate the replication, Drop and recreate the subscription (without dropping articles). If you are doing drop and recreate article, taking long time to populate the records if it is a prod. server.

    Balaji L

     

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply