Transaction log filling up - transactional replication issue

  • Hi All,

    I have an issue whereby once in a while (it happened about a month ago and its happening again today) the transaction log on my publication database starts to grow at an alarmingly fast rate, expanding the physical size every 5 mins or so. I have tracked it down to a problem with the log reader agent, which if i look in the history has multiple messages where it is reading the transaction log file for command to be replicated, and constantly running the same command, filling the log.

    A snapshot of the log is here

    2008-11-12 14:00:56.886 Status: 4, code: 22021, text: 'Batch committed. Batch consisted of 1 commands, 1 xacts. Last xact: 0x0005dc4100004de10001, '0x0005dc4500017a580088'.'.

    2008-11-12 14:02:01.247 Status: 4, code: 22043, text: 'The Log Reader Agent is scanning the transaction log for commands to be replicated. Approximately 500000 log records have been scanned in pass # 1, 499955 of which were marked for replication, elapsed time 219 (ms).'.

    2008-11-12 14:02:01.247 Status: 4, code: 22043, text: 'The Log Reader Agent is scanning the transaction log for commands to be replicated. Approximately 500000 log records have been scanned in pass # 3, 499931 of which were marked for replication, elapsed time 1183 (ms).'.

    2008-11-12 14:02:01.263 Status: 4, code: 22021, text: 'Batch committed. Batch consisted of 7 commands, 6 xacts. Last xact: 0x0005dc4500017a9a0001, '0x0005dc4a000132280093'.'.

    2008-11-12 14:02:06.279 Status: 4, code: 22021, text: 'Batch committed. Batch consisted of 1 commands, 1 xacts. Last xact: 0x0005dc4a000132700002, '0x0005dc4a000132700005'.'.

    2008-11-12 14:02:11.482 Status: 4, code: 22021, text: 'Batch committed. Batch consisted of 6 commands, 5 xacts. Last xact: 0x0005dc4a000132b80001, '0x0005dc4a000132b8000d'.'.

    2008-11-12 14:02:58.249 Status: 4, code: 22021, text: 'Batch committed. Batch consisted of 6 commands, 5 xacts. Last xact: 0x0005dc4d000006280091, '0x0005dc4d000006280096'.'.

    2008-11-12 14:03:06.546 Status: 4, code: 22043, text: 'The Log Reader Agent is scanning the transaction log for commands to be replicated. Approximately 500000 log records have been scanned in pass # 3, 499274 of which were marked for replication, elapsed time 90 (ms).'.

    2008-11-12 14:03:06.562 Status: 4, code: 22021, text: 'Batch committed. Batch consisted of 1 commands, 1 xacts. Last xact: 0x0005dc4a000132c00007, '0x0005dc4f000018e8008e'.'.

    2008-11-12 14:03:11.562 Status: 4, code: 22021, text: 'Batch committed. Batch consisted of 6 commands, 5 xacts. Last xact: 0x0005dc4f000019720001, '0x0005dc4f00001972000d'.'.

    2008-11-12 14:04:10.048 Status: 4, code: 22043, text: 'The Log Reader Agent is scanning the transaction log for commands to be replicated. Approximately 500000 log records have been scanned in pass # 3, 499284 of which were marked for replication, elapsed time 300 (ms).'.

    2008-11-12 14:04:10.048 Status: 4, code: 22021, text: 'Batch committed. Batch consisted of 1 commands, 1 xacts. Last xact: 0x0005dc4f000019790007, '0x0005dc530000da08005a'.'.

    2008-11-12 14:04:10.064 Status: 4, code: 22021, text: 'Batch committed. Batch consisted of 2 commands, 1 xacts. Last xact: 0x0005dc530000da340001, '0x0005dc530000da34000a'.'.

    2008-11-12 14:05:48.004 Status: 4, code: 22043, text: 'The Log Reader Agent is scanning the transaction log for commands to be replicated. Approximately 500000 log records have been scanned in pass # 3, 499715 of which were marked for replication, elapsed time 40603 (ms).'.

    2008-11-12 14:05:48.051 Status: 4, code: 22021, text: 'Batch committed. Batch consisted of 1 commands, 1 xacts. Last xact: 0x0005dc530000da3d0001, '0x0005dc570000bed800e4'.'.

    The system has a fairly low transactional throughput, replicated command are probably around 10 a minute, so the figures i am seeing in the snapshot above of ca.500000 command marks as to be replicated i know is incorrect. Its almost as if the log reader is not marking the commands as replicated in the log, and they are not being purged 🙁

    The only way i could sort this last time was dropping my publication and re setting it up again, but i dont really want to be doing that all the time.

    Any advice would be very appreciated.

    John

  • Is the transaction log growing or is the error log growing?

    Did that happen straight after an index rebuild, by any chance? Or an alter table statement?

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • There is one error out there on the latest patches that references transactional replication.

    What service pack are you on ...we are on CU6 and not had these problems...i have had the log reader suddenly stop and not restart for 10 minutes or so but eventually it restarted.

    You could restart the log reader in the job and see if that helps - but i should imagine now the counts are out on your replicated database and the only way to get these back in sync would be to do another snapshot.

    Check the undistributed counts it in one of the windows on replication monitor not got SQL up right now is there a count in here.

    Try restart log reader and see if this fixes it...otherwise it think your need to re do another snapshot.

    There are tools to get the data over to the replication but they only do a table at a time..never got htis working myself.

  • Thanks for the quick reply Gail, its the transaction log on the publicated database. I have just found out one of the developers has indeed run an alter table command on one of the published tables. Am i quicker dropping the table from the publication and readding it again? its only 10000 rows so will only take a few seconds to republish.

    Can i be cheeky and ask how you possibly came up with them suggestions from the info i gave? 😛

    John

  • TRACEY (11/12/2008)


    There is one error out there on the latest patches that references transactional replication.

    What service pack are you on ...we are on CU6 and not had these problems...i have had the log reader suddenly stop and not restart for 10 minutes or so but eventually it restarted.

    You could restart the log reader in the job and see if that helps - but i should imagine now the counts are out on your replicated database and the only way to get these back in sync would be to do another snapshot.

    Check the undistributed counts it in one of the windows on replication monitor not got SQL up right now is there a count in here.

    Try restart log reader and see if this fixes it...otherwise it think your need to re do another snapshot.

    There are tools to get the data over to the replication but they only do a table at a time..never got htis working myself.

    The publication database is SQL 2005 Standard SP2 no CU applied

    distribution and subscriber are SQL 2008 Wrkgroup No CU applied

    I have restarted the log reader agent and no luck, even restarted the server and no luck. The unditributed commands is around 30 (the distribution job runs every 15 mins)

    It sounds like Gail is onto something so im not going to drop the article just yet, there is still a fair bit of disk space to play with before i will do that.

    I would have just dropped and readded again straight away but its handy to know why these kind of things happen for when you are in a situation where dropping and readding is a last resort.

  • Animal Magic (11/12/2008)


    Thanks for the quick reply Gail, its the transaction log on the publicated database. I have just found out one of the developers has indeed run an alter table command on one of the published tables. Am i quicker dropping the table from the publication and readding it again? its only 10000 rows so will only take a few seconds to republish.

    How big's the table? The replication should catch up, though how long it takes may be another story. It may be an idea to increase the number of records scanned per pass temporarily. You'll find that under the log reader agent's profile (somewhere in replication monitor)

    Can i be cheeky and ask how you possibly came up with them suggestions from the info i gave? 😛

    It's not cheeky to ask.

    I worked on a replicated system a while back and the replication was always dead on a monday morning. After investigating for a while (I was a developer, not a DBA) I found out that all the indexes were been rebuilt sunday evening. Those are logged operations and the log reader has to read through them.

    Besides, the only things likely to generate 500000 or more log records in a single command is something that affects an entire table (like a rebuild or certain alter table statements)

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Good point Gila...there is also on the default profile a way to increase the max bcp i believe it is this and increase the record sizes from the default incase your being hit with too much traffic in and out of SQL ...due to some nightly jobs still running etc.

    Right click on the replication monitor onto the SERVERNAME not the publications and do

    Agent Properties...and click two little dots..and change MAX BCP Threads from 1 to 15.

    Commit Batch size to 1000 and query timeout increase this a bit too....

    I have found sometimes the replicated data does not get over in time so after changing above especially the query time out works every time - this gives more time for the records to get there.

    Save the profile under a new name called mynewdefault and then it will use this the next time....But your have to stop the log reader in the Job Agent..Look for LOG-Reader when open job under category.

    You may have to stop job then restart job...first

    (there are three jobs so becareful which one you stop and start just do edit on job first).

  • Max bcp threads is for the snapshot agent and is concerned with how many threads are used to generate the snapshot. It's got nothing to do with the log reader or with applying replicated transactions.

    If the log reader is falling behind, it's usually recommended that the log reader's profile is changed and the Read Batch Size be increased from it's default of 500.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • ok a quick update, i amended the logscan threshold from 500000 to 5million. I left it an hour and still no luck, tran log was still growing and disk space was running short (<20%) so i ended up dropping the publication. It was when the log still continued to grow i decided i would drop the other two articles that look at the same publication database (they dont look at the affected table but they will still need to read the same log). This stopped the log growing, i backed it up and shrunk the log to its normal size.

    One thing i did notice is someone has changed the recovery model of the publication db from bulk logged to full, so this probably didnt help the log reader. unfortunately security here is not something that is easy to get on top of but thats a story for another day.

  • GilaMonster (11/12/2008)


    indexes were been rebuilt sunday evening. Those are logged operations and the log reader has to read through them.

    not logged under simple or bulk logged recovery models 😉

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • Perry Whittle (11/12/2008)


    GilaMonster (11/12/2008)


    indexes were been rebuilt sunday evening. Those are logged operations and the log reader has to read through them.

    not logged under simple or bulk logged recovery models 😉

    I have a feeling this is what my problem was. the amended table has 6 indexes including the PK clustered index, so the alter table would have rebuilt those indexes. With it being in full recovery model the log reader got too far behind, bulk logged as i normally have it probably wouldnt have caused the same issues.

  • Oh someone changed it to FULL Recovery mode - oh that may do it if your not backing up logs every x minutes.

  • I go look at the trace files now and see who changed this to the mode...im sure you can get that out of the trace files...or the right click on server and get the schema changes, system changes then go ask why.

  • TRACEY (11/12/2008)


    Oh someone changed it to FULL Recovery mode - oh that may do it if your not backing up logs every x minutes.

    Why did someone do it :unsure:, if they did, why were the logs not backed up? Would be interesting to know..

    Thanks!!

    The_SQL_DBA
    MCTS

    "Quality is never an accident; it is always the result of high intention, sincere effort, intelligent direction and skillful execution; it represents the wise choice of many alternatives."

  • ive sussed it, the recovery model (tran logs are backed up hourly) and the alter table were not the culprit after all. I had started to replicate a table last night (500000 rows) and today one of the apps stopped so i restarted it, it was 30 mins after i noticed things starting to go wrong. What is happening (and i should have clicked earlier) was one of the statements this app runs is an update with no where clause, every 3 seconds. so half a million updates every 3 seconds was why the log reader couldnt keep up (not surprising really!!)

    I have asked the 3rd party devs on numerous occasions to sort this our but it still hasnt been done, until it has i will use snapshot repl once a day overnight.

    Thanks all for your help today though! even if most of it was a wild goose chase

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply