Re: Determining max CSN of running server

Thierry Bordaz <tbordaz@xxxxxxxxxx> · Fri, 1 Mar 2024 11:12:22 +0100

On 2/29/24 21:31, William Faulk wrote:
Thanks, Pierre and Thierry.

After quite some time of poring over these debug logs, I've found some anomalies and they seem like they're matching up with the idea that the affected replica isn't updating its own RUV correctly.

The logs show a change being made, and it lists the CSN of the change. The first anomalies are here, but they probably aren't terribly significant. The CSN includes a timestamp, and the timestamp on this CSN is 11 hours into the future from when the change was made and logged. Also, the next part of the CSN is supposed to be a serial number for when there are changes made during the same second of the timestamp. In the case I was looking at, that serial was 0xb231. I'm certain that this replica didn't record another 45000 changes in that second.

Hi William,

Are you running DS on a VM, container, HW ?
The fact that the CSN timestamp is some time in the future is not 
frequent but can happen. Generated CSN should always been increasing, so 
the generation of CSN ajust its timestamp with the received CSN.
What looks weird is the number of serial number. Do you have a full 
error log sample where we can see sequence number moving to such high 
number (0xb231) ? C

Then it shows the server committing the change to the changelog. It shows it "processing data" for over 16000 other CSNs, and it takes about 25 seconds to complete.

It then starts a replication session with the peer and prints out the peer's (consumer's) RUV and then its own (supplier's) RUV. The RUV it prints out for itself shows the maxCSN for itself with a timestamp from almost 4 months ago. It is greater than the maxCSN for itself in the consumer's RUV, though, by a little. (The replicagenerations are equal, though.)
IIUC the consumer is currently catching up. Is the RUV, on the consumer, 
evolving ?

It then claims to send 7 changes, all of which are skipped because "empty". It then claims that there are "No more updates to send" and releases the consumer and eventually closes the connection.
Do you have fractional replication ? (some attributes are skipped from 
replication)

I like the idea that there's a list of pending operations that's blocking RUV updates. Is there any way for me to examine this list? That said, I do think it updated its own maxCSN in its own RUV by a few hours. The peer I'm looking at does seem to reflect the increased maxCSN for the bad replica in the RUV I can see in the "mapping tree". I've tried to reproduce this small update, but haven't been able to yet.
difficult to say. pending list has likely a different meaning in my 
understanding.

I also have another replica that seems to be experiencing the same problem, and I've restarted it with no improvement in symptoms. It might be different, though. It doesn't look like it discarded its changelog.

I definitely don't relish reinitializing from this bad replica, though. I'd have to perform a rolling reinitialization throughout our whole environment, and it takes ages and a lot of effort.

--
_______________________________________________
389-users mailing list -- 389-users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to 389-users-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/389-users@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue