High Latency on Publisher to Distributor but Low Latency to Subscriber

  • Setup:

    Two SQL Server boxes setup identically, drive size, file location, memory, CPU.

    -Primary/Transactional Database has the transactional database as well as the distributor.

    -Publish about 729 articles which about 100 will actually change frequently.

    -One table that is published contains close to 100M rows the next biggest only has about 1.5M rows. The 100M table has inserts very often but that's about it.

    -A bad practice but not by choice: The distributor mdf and transactional DB mdf are on the same drive.

    -IO Contention does not seem high on the disc with the mdfs

    I've never seen the latency for the distributor this high in this environment. In the past 2-3 seconds at most and the subscriber about the same.

    Last tracer token I inserted took 1 hour 52 minutes from Publisher to Distributor while only take 1 minute 34 seconds from distributor to subscriber. The second number is also a lot higher than normal but I need to tackle the higher one first.

    Any thing I should be looking for in particular to help determine the root cause? If you need more info about the environment let me know.

  • Its possible there is a single large transaction being copied between servers. Alternatively, is there any contention being shown on the subscriber? Can you see commands being applied at the subscriber using profiler?

    Sometimes it is helpful to restart the distribution agent on the distributor.

    Can you run this SQL and give us some information regarding the undistributed commands?

    use distribution

    go

    SELECT ag.name

    , ag.publication

    , a.article

    , s.UndelivCmdsInDistDB

    , s.delivCmdsInDistDB

    FROM dbo.MSdistribution_status s

    INNER JOIN dbo.msdistribution_agents ag ON ag.id = s.agent_id

    INNER JOIN dbo.msarticles a ON a.publisher_id = ag.publisher_id AND a.article_id = s.article_id

    ORDER BY

    s.UndelivCmdsInDistDB DESC

    ,ag.publication

  • Interesting, so there is one specific article that has a very high amount of undelivered commands, but it does seem to fluctuate up and down and not constantly grow up. The number never seems to drop to 0 though. There are at most 11 articles showing up in this query over the few runs of it.

    These the numbers from the top two article results from this query:

    Undelivered Delivered

    218161 421636

    179 685

    4/5 runs the second table listed had 0 undelivered commands.

  • But the commands appear to be delivered rapidly?

    If so there would appear to be a bottleneck with your logreader as this is where the trace flags are read from. This agent tends to be CPU constrained. Are you seeing a higher than usual CPU usage? It may be worth stopping the agent and restarting it to see if this makes a positive improvement. If not we need to look at the DB transaction log for a potential issue.

  • CPU Usage is fairly low, I don't think the bottleneck is there.

    So if the log reader agent is having issues it wouldn't pick up a transaction to insert into the Distribution DB to replicate to the subscriber? That makes sense in terms of what we are seeing, I am curious what may cause this issue. I could restart the agent tonight and see if the issue is resolved during peek hours tomorrow.

  • Your understanding is correct. It is also possible to tune the agent by adjusting different profile parameters. This is especially useful on high transaction databases. The key ones are batch size and polling interval.

    I have my polling interval set to 1 second and the batch size at the default of 500. You may want to test/tweak as appropriate.

  • I appreciate your help Jimbo. I am relatively new to replication. I get the basics, of transactional atleast, but I am trying to learn how to tweak and optimize replication as well as troubleshoot some of these stranger issues.

    Here are the current Log Reader Agent settings, these were setup before me and I believe are the default settings.

    -HistoryVerboseLevel: 1

    -LoginTimeout: 15

    -LogScanThreshold: 500000

    -PollingInterval: 5

    -QueryTimeout: 1800

    -ReadBatchSize: 500

    You suggest the polling interval be set to 1. Do any of these other settings need to be tweaked?

    I am using this page as reference for the settings:

    http://msdn.microsoft.com/en-us/library/ms146878%28v=SQL.90%29.aspx

    Do you recommend any other pages for reference on this?

  • Those look like the default parameters. Do you encounter a number of larger transactions which effect a lot of rows? For example an update statement which updates 100,000 in a single transaction? This would be written into the distribution db as 100,000 commands but would be treated as 1 transaction in the default batchsize.

    It may be worth setting up a custom agent profile of pollings interval 1 second and a -ReadBatchThreshold of between 500 - 10000. This is the limit at which the agent will write commands in the distribution database breaking down the example given above into several chunks. This will want tweaking depending on your environment.

    Can you run this on your published db please? It will give us some info on the logfile structure.

    DBCC LOGINFO

  • We use transactional replication to sync up our web server.

    The publication was scheduled to sync publisher to distributor every 5 minutes.

    Some time ago, we found that the latency from publisher to distributor was very high, to the point where it wouldn't receive any updates, but the latency from distributor to subscriber was nearly non-existent.

    We were also receiving an error message similar to "log reader agent could not execute sp_replcmds"

    The solution was to increase the -QueryTimeout: property of the agent profile from 1800 to 7200.

    This helped some but we would still notice the queue getting backed up after a larger amount of transactions.

    We found that setting the schedule to run continuously instead of every 5 minutes solved the issue.

    I'm wondering if this would help your situation as well.

    Could you queue just be backed up to the point where it can't process any more?

    What do you have your sync schedule set to?

    HTH,

    Ben

  • DBCC LOGINFO gives me 14740 rows, that is running this morning which is a non-peak time when there are probably very few users in the application. Anything in particular you would want see from that? We have one table which I noticed the highest amount of rows returned from the original sql you asked me to run. This table on average would update 50,000 to 150,000 rows for any given update statement. Based on that it sounds like you are recommending changing the ReadBatchThreshold to maybe 10000?

  • There are two things I'd probably recommend initially. Reduce the number of Virtual Logfiles you have. 14740, Thats a LOT! and it will reduce log file performance for a number of things. There are several articles on how to do this (a couple on SSC).

    The second things I would do is probably change the ReadBatchThreshold to quite a high number. In this instance 150000 is quite a high number of rows to be updated in a single transaction and much higher than I expected so 10000 may too low. Def reduce the PollingInterval to 1 second.

  • Ok, I'll take a look for articles on that that. I've changed the polling interval and set the ReadBatchThreshold to 10,000 for now. We are running a load test on the application now and if i notice the latency increasing I'll try tweaking that again.

  • Jimbo, you sending me down the route of looking into VLF I came across by Linchi Shea about VLF and performance with high number vs lower number of VLF. I decided to search around the blog briefly and came across this:

    http://sqlblog.com/blogs/linchi_shea/archive/2009/06/05/distribution-latency-in-transactional-replication-is-a-volume-surge-the-culprit.aspx

    I'm dissecting perl script to pull out the queries as parametrized t-sql.

    Related to the VLF, it is something that I am going to put into motion but I can't really do. Basically the client's DBA team uses a backup tool that if we shrink the transaction log or backup the transaction log without using this tool it will lose place where it left off taking a backup and as a result will fail. Once they shrink it they will need to change the initial size and the growth settings to ensure we don't have 15k+ VLF created when/if the log grows.

  • I hope you manage to get it sorted but it looks like you're the right track to get it resolved.

    One other thought, was there a new software release/update around the time you saw the replication latency for the first time? Might be worth a look as well.

  • Update:

    I think we have slowly started to resolve the issue. Reducing the # of VLF has helped a lot! We then noticed the sheer # of commands was high. We have a few stored procedures that do on average 50,000 row updates. I then found out these stored procedures were replicating schema only. We have changed these articles to replicate the execution and have reduced the # of commands significantly. One thing we also did was changed the # of subscription streams. This has seemed to create another issue which I think we will solve by setting this value back to 1. It appears that we have had a few blocking situations with this and it causes the agent to fail. It also blocks everything else running on the subscriber database.

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply