Transactional Replication - not delivering all commands

  • I have a transactional replication publication that is replicating a single table to another server. Occasionally, about once every 1-2 days, there are some commands not delivered to the subscriber. I check for this by doing a full join on the two tables across servers looking for rows in either table but not the other. When rows are found to be missing from either side I have looked at the results of sp_browsereplcmds and have been able to find the corresponding command to either insert or delete the rows. I recently started checking msdistribution_history and did not find the corresponding xact_seqno there.

    An average day contains approximately 10,000-15,000 replicated commands with only 5-10 not being replicated when this problem occurs. Sometimes the 5-10 rows comprise their own transaction (xact_seqno), sometimes they are a portion of the commands for a larger transaction with all the rest being delivered.

    What would cause the replication to skip over a transaction/command and continue on without error? Are there any other system tables or stored procedures where I can find more information about whether the command was replicated? Any thoughts are appreciated.

    Thanks in advance.

  • hi

    Answer to your question, yes!

    I have seen replication skip command on a default profile where it's not mean to skip, however this is not due to SQL's fault, it's always something we miss.

    I suggest the following, first of all you don't mention whether you have full sync subcription. If you don't then it's possible that these rows where never delivered by the snapshot in the setup phase. If you have a full sync then you need to check your replication agent profile. You can do this by going into replication monitor, go onto you distribution agent and then agent profile, there are a couple of profiles, the default one has no "skip" errors. This means that it should deliver all commands and throw errors when there is a problem.

    Now remeber that unless your subscriber is hidden to all developers it could always be a case where the data could have been deleted, in order to find out if this is the case we need to look if the data that is missing has made it's way into the distributor, you can do this by using sp_browsereplcmds on the distribution database and use the xact_seqno you have. This should return the command. Once you have the command you can also run this individually on the subscriber to ensure that the command gets applied.

    You might find there is some sort of exception that occurs which replication is not meant to or doesn't handle.

    But you definately need to verify that it made it's way into the distribution database.

    If you browse the command in the distributor you can also check the column , is_partial_command or something like that then the statement would be broken into 2 or more rows in the distribution db, the specific table is msrepl_commands.

    So, to summarize,

    1 - make sure it made it's way into distributor

    2 - if it didn't the log reader might not have pick it up, alternatively someone could perform an sp_changearticle or modify the article on the publisher which may also cause the existing data in the distributor to be scrapped, remeber to quiesce replication (flush all pending commands) before doing anything to schemas.

    3 - Once you get the command from distributor run the command manually on subscriber and check for errors.

    4 - depending on the article properties it might skip the row due to FK violations, I have seen these not being handled and end up with missing data.

    5 - verify you replication agent profile and ensure that you are not skipping any errors

    6 - veryify security and make sure nobody deleted the data, remeber this could be a process not always a person.

    7 - check the system log folders for replication output, any errors with distrib.exe will be dumped, if there is any memory dumps that would be a problem.

    8 - make sure it's replication related, it could be that the data is just poked, DBCC checkdb.

    9 - Another thing, if you ran SQL 2005, get SP3 if you don't have it yet. This fixed a whole bunch of replication goodies.

    Cheers

    Jannie

  • Thank you Jannie for the detailed response.

    First, I don't have any filters on the subscription. All rows would have been delivered with the snapshot. The default agent profile is selected. I know that it stops on errors because it has gotten locked up on PK violations before. For example, a delete command was not delivered and later when the same PK was reinserted at the publisher the insert was delivered causing the violation. The agent kept retrying that command and couldn't move on.

    The developers who use that subscriber database only read from the replicated table. I'm the only one who has made updates when I found a non-delivered command. Also, there used to be FK references to the replicated table but they have been removed and the problem still continued. I'm running SQL Server 2008 SP2, maybe I should upgrade to R2?

    I'll have to wait to catch another undelivered command to check the other steps you mentioned.

    Thanks again for the help thus far.

  • no easy fix eh :/

    Well I'd sit and wait for it to happen, then grab the xact_seqno and browse it.

    I bet ya it will give a clue.

    if object_id('pr_check') is not null

    drop proc pr_check

    go

    create proc pr_Check(@table sysname,@Check int=1)

    As

    Begin

    Set nocount on

    --Declare @Table nvarchar(250)

    Declare @Publisher nvarchar(250)

    Declare @Publisherdb nvarchar(250)

    Declare @IndexCount int

    --Declare @Check int

    --Set @Check = 1

    /* 1 - Check missing */

    /* 2 - fix missing */

    /* 3 - Check mismatch */

    /* 4 - fix mismatch */

    Declare @SelectList nvarchar(4000)

    Declare @SelectListB nvarchar(4000)

    Declare @UpdateList nvarchar(4000)

    Declare @InsertList nvarchar(4000)

    Declare @KeyList nvarchar(4000)

    --Set @table = 'tb_useraccount'

    Set @SelectList = ''

    Set @SelectListB = ''

    Set @InsertList = ''

    Set @KeyList = ''

    Set @UpdateList = ''

    if object_id('msreplication_objects') is not null

    Begin

    select top 1 @publisher = left(publisher,charindex('\',publisher,0)-1),

    @publisherdb = publisher_db from msreplication_objects where article = @table

    End

    If object_id('sysarticles') is not null

    Begin

    select top 1 @publisher=@@servername,@publisherdb=DB_NAME() from syssubscriptions a

    inner join sysarticles b

    on a.artid = b.artid

    where b.name = @table

    End

    if object_id('tempdb.dbo.#ColumnList') is not null drop table #ColumnList

    if object_id('tempdb.dbo.#UpdateList') is not null drop table #UpdateList

    if object_id('tempdb.dbo.#KeyList') is not null drop table #KeyList

    /* Get column list*/

    Select name as name into #ColumnList from sys.columns

    where object_name(object_id) = @table

    And name not in ('jtimestamp','helpercolumn')

    /* Get PK*/

    Create table #KeyList (table_qualifier sysname,table_owner sysname,table_name sysname,column_name sysname,key_seq int,PK_Name sysname)

    Insert into #KeyList

    exec sp_pkeys @table

    /* Update list*/

    Select name as name into #UpdateList from sys.columns a

    left join #KeyList b

    on a.[name] = b.[column_name]

    where object_name(a.object_id) = @table

    And a.name not in ('jtimestamp','helpercolumn')

    and b.[column_name] is null

    while exists(select * from #ColumnList)

    Begin

    if (select count(*) from #ColumnList)> 1

    Begin

    Select top 1 @InsertList = @InsertList + '[' + name + '],' + char(10) ,

    @SelectList = @SelectList + 'a.[' + name + '],' + char(10),

    @SelectListB = @SelectListB + 'b.[' + name + '],' + char(10) from #ColumnList

    End

    if (select count(*) from #ColumnList) = 1

    Begin

    Select top 1 @InsertList = @InsertList + '[' + name + ']' + char(10) ,

    @SelectList = @SelectList + 'a.[' + name + ']' + char(10),

    @SelectListB = @SelectListB + 'b.[' + name + ']' + char(10) from #ColumnList

    End

    Delete top (1) from #ColumnList

    End

    while exists(select * from #UpdateList)

    Begin

    if (select count(*) from #UpdateList)> 1

    Begin

    Select top 1 @UpdateList = @UpdateList + 'b.[' + name + '] = a.[' + name + '],' + char(10) from #UpdateList

    End

    if (select count(*) from #UpdateList) = 1

    Begin

    Select top 1 @UpdateList = @UpdateList + 'b.[' + name + '] = a.[' + name + ']' + char(10) from #UpdateList

    End

    Delete top (1) from #UpdateList

    End

    --Print @InsertList

    --Print @SelectList

    Set @IndexcoUNT = 1

    while exists(select * from #KeyList)

    Begin

    if @IndexCount = 1

    Begin

    Select top 1 @KeyList = @KeyList + ' On a.[' + column_name + '] = b.[' + column_name + ']' + char(10) from #KeyList

    End

    ELse

    Begin

    Select top 1 @KeyList = @KeyList + ' And a.[' + column_name + '] = b.[' + column_name + ']' + char(10) from #KeyList

    End

    Delete top (1) from #KeyList

    sET @IndexCount = @IndexCount + 1

    End

    --Print @KeyList

    Insert into #KeyList

    exec sp_pkeys @table

    Declare @String nvarchar(max)

    Select top 1

    @String =

    Case when (OBJECTPROPERTY (object_id(@table),'TableHasIdentity') = 1 And @Check =2 )

    then ('SET IDENTITY_INSERT ' + @table + ' ON') else ('') end + char(10) +

    Case when (@Check = 2) then ('Insert into [' + @table + '](' + @InsertList + ')') else ('') end + char(10) +

    Case when (@Check in (2)) then ('Select ' + @SelectList )

    when (@Check in (1,3)) then ('Select ''' + @table + ''',count(*) ') else ('') end + char(10) +

    Case when (@check = 4) then ('Update b set ' + char(10) + @UpdateList) else ('') end +

    'From '+ @publisher + '.' + @publisherdb + '.dbo.[' + @Table +'] a with (nolock)'+ char(10) +

    'Left Join [' + @table + '] b ' + char(10) +

    @KeyList + char(10) +

    Case

    when (@Check in (3,4)) then ('Where binary_checksum(' + @SelectList + ') <> binary_checksum('+ @SelectListB+ ')')

    When (@Check in(1,2)) then ('Where b.[' + column_name + '] is null') else ('') End+ char(10) +

    Case when (OBJECTPROPERTY (object_id(@table),'TableHasIdentity') = 1 And @Check =2)

    then ('SET IDENTITY_INSERT ' + @table + ' OFF ') else ('') end

    From #KeyList

    Select @string = case when (@check = 2)

    then ('SET ROWCOUNT 1

    SELECT 1

    WHILE @@rowcount > 0

    BEGIN ' + char(10) + @string + char(10) + ' END') else (@string) end

    print @string

    exec sp_executesql @string

    End

    You can use the code above to highlight discrepancies. Or alternatively use tablediff, it also generates the insert \ update \ delete.

    Cheers

    Jannie

  • I caught some more data discrepancies with rows deleted from the publisher but not the subscriber. There was a corresponding command in sp_browsereplcmds for all 7 rows that weren't deleted. Manually running the commands at the subscriber was successful. I've created some triggers to track whether anybody or anything else is updating the replicated table. The search continues.

  • did you check your location of the distrib.exe for any logs.

    You can also increase the verbosity of your agent to give more error output.

    You can also the the msrepl_errors or somethinglike that in the distribution DB.

    Does the problem seem to be related to specific, table, time or some pattern?

    What is the security that the distributor agent & SQL server agent is running under?

    Have you seen any errors like "application failed to initialize 0X0000000F" .

    Have you check correlating windows logs?

    Cheers

    Jannie

  • After increasing the output verbosity as suggested, looking at the replication system tables and adding a few triggers to track what was and wasn't happening to the replicated table on each side I finally found which transactions were not making it to the subscriber.

    The current set up:

    On Server A is Table1 which is replicated to Server B. B.Table1 has a trigger which updates Table2. B.Table2 is replicated back to Server A. Commands performed on B.Table2 that were a result a change on A.Table1 were never sent back to Server A. However, all other updates on B.Table2 from different processes were sent to A.Table2. The publication of B.Table2 had another subscriber at a test server which was receiving all commands but SQL didn't like the "round-trip" replication.

    In short:

    A.Table1 ---(repl)---> B.Table1 ---(trigger)---> B.Table2 ---(repl)---> A.Table2 (last step simply doesn't happen)

    There's nothing in msrepl_errors or in msrepl_distributionhistory.ErrorID. In the comments for msrepl_distributionhistory for that agent it just said zero commands to be delivered. The same thing was reflected in the logging output file with no commands to be delivered to Server A.

    My first thought is to change the subscription for A.Table2 from push to pull with the thinking that it separates the chain of events into two processes. I don't want to make this change until off-hours so I'll have to wait until Monday to see it in production.

    Jannie, thank you for all the suggestions. This already has been quite a learning experience and I still have work to do.

  • I found the answer. I had to create the subscriptions on both sides using sp_addsubscription setting @loopback_detection = 'false'. Thank you for all the advice and suggestions.

  • Hi BT3,

    I had a similar problem.

    The distributor did not show any undistributed commands and the log reader as well as the subscription had no errors and was showing "Running". But there was data inconsistency between the publication database and the subscription database. The counts on the articles that was replicated was not the same. The subscriber was behind with about 450k rows.

    At the end I dropped the subscriber and renamed the original table it was replicating to and recreated the subscriber and ran the snapshot and the replication worked. When the replication broke the first time data was inserted into the table manually. I think the data that was inserted manually was the reason why the replication did not transfer the records.

    This worked for me 😀

    Hopefully it can help you or maybe in future perhaps :w00t:

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply