[389-users] Multimaster replication out of sync

mitja.mihelic at arnes.si (Mitja MiheliÄ) · Wed, 16 Dec 2009 14:01:28 +0100

On 12/12/2009 12:06 AM, Rich Megginson wrote:
> Mitja Miheli? wrote:
>>
>>
>> On 12/07/2009 05:18 PM, Rich Megginson wrote:
>>> Mitja Mihelic wrote:
>>>> Hi!
>>>>
>>>> We have two instances of the DS in a multimaster replication setup.
>>>> We had to restore the database of one of the servers from backup.
>>>> While the second master was down, the first was receiving updates.
>>>> After we fired up the restored master it started receiving updates as
>>>> soon as a change occurred on the first master (i.e. after 15 minutes)
>>>> After the sync finished, we noticed they weren't identical.
>>>> Clicking "Send updates now" from the replication agreement does not 
>>>> help.
>>>>
>>>> Is there a way to get them synced up again ? Other than reinitializing
>>>> the second/restored master ?
>>> How long was the server down?  How old was the backup it was 
>>> restored from?
>> The server was not down long, but the backup was about 10 hours old.
>> This was a backup at filesystem level made by ufsdump. It was not a 
>> "regular" DS backup.
>> When we restored the database file from the dump the server booted OK.
>>
>> Then we made little test:
>> - made another ufsdump of the second master
>> - shut down the server
>> - let the primary master update for an hour
>> - restored the second master's database from the dump
>> - started the second master
>> - let them do their replication magic
>> - isolated both servers (i.e. no updates)
>> - compared the LDIF dumps
>> Again, they were not the same.
>>
>> We probably should have used the built in backup functionality, right ?
> Yes, although I'm not sure what would be causing the problems you see.
>
> In general, when the database state changes, you have to reinitialize 
> replication.
We tried the built-in backup:
/usr/lib/dirsrv/serverReplica/db2bak 
/var/lib/dirsrv/serverReplica/bak/`date +%Y_%m_%d_%H_%M_%S`

Executed the same test procedure as described above.

There are still entries on the primary server that do not get replayed 
on the secondary.

An error message (repeated every 5 minutes) from the primary master 
SERVER1 occurs when a record, that is missing on the secondary, gets 
updated on the primary:
[16/Dec/2009:10:26:02 +0100] NSMMReplicationPlugin - agmt="cn=MM to 
SERVER2" (SERVER2:389): Consumer failed to replay change (uniqueid 
25ab6e01-1dd211b2-bdbbda0a-92130000, CSN 4b28a7ac0000000b0000): No such 
object. Skipping.

My reasoning would be: if the entry does not exist on the consumer, 
create it. But I guest that is not how the mechanism works.
I'm still scratching my head about this one...

Regards,
Mitja