Re: Determining max CSN of running server

Thierry Bordaz <tbordaz@xxxxxxxxxx> · Thu, 29 Feb 2024 10:48:16 +0100

On 2/29/24 05:12, William Faulk wrote:
Might be worth re-reading
Well, I still don't really know the details of the replication process.

I have deduced that changes originated on a replica seem to prompt that replica to start a replication process with its peers, but I don't really know what happens then.
Replication is done by replica agreement that is waken up when a new 
updates gets into the changelog. The new updates can be received 
directly from a LDAP client or from replication itself.
There's a comparison of the RUVs of the two replicas, but does the initiating system send its RUV to the receiver, or does it go the other way, or do both happen?
IIRC only the remote replica sends its RUV. Then the RA receiving the 
RUV will compare it with its own RUV to detect what is the oldest update 
that the remote replica ignore.
Does the comparison prompt the comparing system to send the changes it thinks the other system needs, or does it cause the comparing system to request new changes from the other?
Yes the RUV contains latest received updates for all the replicas.
Maybe none of this really makes much difference, but the lack of technical detail around this makes me just question everything.
It makes perfectly sense and show you already know deeply replication 
process.

It doesn't send a single CSN, the replication compares the RUVs and determines the
range of CSNs that are missing from the consumer.
Sure, but notionally any changes that originated on that replica would be reflected in the max CSN for itself in the RUV that is used to compare. And at least one side is sending its RUV to the other during the replication process.
Yes the remote replica (named consumer IIRC) sends back its RUV to the 
request send by the RA.

It's also not immediate. Between the server accepting a change (add, mod etc), the
change is associated to a CSN. But then there may be a delay before the two nodes actually
communicate and exchange data.
Sure, but the changes originated on this replica haven't made it to other replicas in weeks. This isn't a mere delay in replication.
Usually replication occurs in few seconds. if it is not replicated for 
weeks, then replicaiton is broken and you need to identify in the 
replication debug log from the both sides (supplier/consumer) the reason 
of that breakage

Generally you'd need replication logging (errorloglevel 8192). But it's very noisy
and can be hard to read. What you need to see is the ranges that they agree to send.
Okay. I've done that and haven't had a chance to pore through them yet.
Quite difficult to read, espcially if there are multiple RA playing 
around. You may look in parallel to the code to understand the purpose 
of those messages

Also remember CSN's are a monotonic lamport clock. This means they only ever advance
and can never step backwards. So they have some different properties to what you may
expect. If they ever go backwards I think the replication handler throws a pretty nasty
error.
I don't think it's going backwards. What I'm trying to rule out is that the replica is failing to advance its max CSN in the RUV being used to compare.
Comparison of RUV. You need to dump RUV on both servers 
(consumer/supplier) then compare PER replica the maxcsn. The replication 
will start from the CSN that is the smallest of the maxcsn. So  a maxCSN 
may not move until all the others are in sync

I *think* so. It's been a while since I had to look. The nsds50ruv shows the ruv of
the server, and I think the other replica entries are "what the peers ruv was last
time".
Well, it's at least nice to hear that my guess at least isn't asinine. :)

replication monitoring code in newer versions does this for you, so I'd probably
advise you attempt to upgrade your environment. 1.3 is really old at this point
I've been trying to get the current environment stable enough that I feel comfortable going through the relatively lengthy upgrade process. I think I'm going to have to adjust my comfort level.

I'm not sure if even RH or SUSE still support that version anymore).
RedHat does, as it's what's in RHEL7.9, which is supported for another, uh, 4 months. They're working on this with me. I'm still just trying to understand the system better so that I can try to be productive while I'm waiting on them to come up with ideas.

The problem here is that to read the RUV's and then compare them, you need to read
each RUV from each server and then check if they are advancing (not that they are equal).
The problem is that the changes in my environment are few enough that all the replicas' RUVs _are_ equal the majority of the time. I'm not in front of that system as I respond right now, so my details might be wrong, but I'm asking about all of this because every RUV I see in all of the replicas is the same, and it shows a max CSN for this one replica that's much older than the CSNs I see it reference in the logs about changes originating on the replica. The CSNs I see in the logs when a new change is made are referencing the current time in them, while the max CSN I see in the RUVs is from 4 months ago.

Maybe it *did* go backwards somehow and that's why it's not working. Not that that would really help me understand what actually went wrong any better than I do now.
Something important with RUV is the 'replicageneration' it should be 
identical on both side.
For the problematic server, does the RUV evolve or not ?

If you want to assert that "Some change I made at CSN X is on all servers" then
you would need to read and parse the ruv and ensure that all of them are at or past that
CSN for that replica id.
Well, you'd think so. I've got that problem, too, where some CSNs just seem to get missed, but the max CSN in the RUV is well past that. But that's a different problem and not the one I'm working on now.

Thanks for the input.

--
_______________________________________________
389-users mailing list -- 389-users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to 389-users-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/389-users@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue