Hi,
I agree that it is complex task to master such FreeIPA deployment. FreeIPA enables many components, 389ds is just one of them, and several of them could contribute when a problem occurs. My main concern here is that you express a need to monitor (how well FreeIPA deployment works) rather than pointing a clear misbehavior of the topology that we could focus on.
Replication is an important functionality of 389ds/FreeIPA and
monitoring replication is a common demand. One common demand is to
monitor replication lag (how much time it takes for the topology
to converge: an update is replicated on all replicas). There are
several ways to monitor that but I think an easy way is to rely on
dirsrv access logs. Each replicated update is identified uniquely
with its 'csn' and you can find csn like 'csn=57eb7dbc000000600000'. Then
grepping this value among all access logs on all instances,
could give you an indication of lag. Lag could typically be in
the range of seconds or 1-2 minutes, if it spikes to many
minutes or never hit some replica then you could start
investigating why it is slow.
A difficulty with that procedure is that some updates are not
replicated (fractional replication).
Investigation in replication are quite complex and difficult
to explain in general. This forum is a good place to get
answers to specific questions.
best regards
Thierry
Yes, I'll try to explain my needs more clearly. As it happens a lot I recently inherited a FreeIPA installation and am now responsible for managing the service. As someone who was not previously familiar with FreeIPA, I am in the process of building my expertise in managing it. When I started the monitoring setup was represented with node_exporter, process_exporter for the host and 389ds_exporter (https://github.com/terrycain/389ds_exporter) for the ldap data. However, as the FreeIPA installation grew in size, we started encountering issues and realized that we lacked critical information to pinpoint the root causes of these problems. To address this, I have taken steps to improve the monitoring setup. I have started monitoring FreeIPA's bind service using a separate_exporter and exporting DNS queries to opensearch. Additionally, I have rewritten the 389ds_exporter to include cn=monitor metrics to provide more visibility into the 389 Directory Server. I recently realized that I could also include 'cn=ldbm database' metrics, which are low-level but could be useful in troubleshooting the issues we are facing. The problems we are encountering are related to disk IO, and having these metrics could provide valuable insights into the following: 1) Excessive paging out and increased swap usage without spikes in load. For example after restarting of replica the swap usage increases to 30% (of 3GB swap space) over 1-2 days while there are at least 4GB of availiable RAM present on the host. And the general swap consumer is ns-slapd service. For now I only tested to configure swappiness parametr to zero, which did not help, so I guess there are some other factors involved. 2) Spikes in IO latency observed during modifying and adding operations, which were not present when the cluster was smaller (up to 10 replicas). I need to determine whether the issue lies with service tuning or with the cloud provider and its SAN, as we recently migrated to SSD disks without improvement. As I said about "replication lag" those problems just started more appearing as new replicas were added, but for now we mostly observe it by outage of services that rely on ldap. The "waves" refers to the way problem apprear, as different clients VDCs are having problems one after the other which is looks like replication propagation. 3) Master-master replication just seems to me as a big "black cloud", which I have no control or knowledge of. When you have couple of hosts it is maybe fine to rely on documented way of looking up replicationStatus attribute, but when you have couple of dozens I guess things could get quite not so straitforward, at least relying on intuition suggests it. When I say about replication observability what I mean and what I'd like to see is following: Graph representation... - ...of time it took to replay a change (or I guess time of full replication session) - ...of the amount simutanialous connections that Suppliers trying to establish with Consumer - ...of time spent waiting to acquire replica access I just pointed a few of the top of my head. I don't know for sure (and first post was about it) is it really worth it to try and get those kind of metrics or I just don't know what I'm talking about and it would be a waste of time and hard to implement. As I mentioned bpf cause I see it as only option I could get it, the other option is to parse logs that are in DEBUG mode which is not the option. With replication metrics besides the ability to see its impact on the problems above, I'm also trying to solve more administrative task - I need to convince the architerture departament to change the model of adding new replicas. Right now we kinda adding two replicas for every new client. +------------------------------+ | client#1 | | VDC | | | | +--------------+ | +---------------------+ +---------------------+ | | +-------------->+ +---------------->+ | ... | | replica-01 | | | common-replica-01 | | common-replica-02 | | | +<--------------+ +<----------------+ | | +--------------+ | +---------------------+ +---------------------+ | | ^ | | ^ | ^ | v | | | | | | | +--------------+ | | | | | | | | | | | | | | | replica-02 | | | | | | | | | | | | | | | +--------------+ | | | | | | | | | | | | | v | v | +------------------------------+ +---------------------+ +---------------------+ | +---------------->+ | | common-replica-03 | | common-replica-04 | | +<----------------+ | +---------------------+ +---------------------+ Which is not ideal at all (and as I said we started to face problems). From their side the answer is that they are following documentation restrictions for no more than 4 replica agreements for replica and no more than 60 simutanialous replicas in master-master replications. And for now this is indeed being followed and I need to come with deeper analysis or find that problem lies in fine tuning the service. So it's kind mishmash of everything at the same time, hope I answered you question. best regards, v.zh _______________________________________________ 389-users mailing list -- 389-users@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to 389-users-leave@xxxxxxxxxxxxxxxxxxxxxxx Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/389-users@xxxxxxxxxxxxxxxxxxxxxxx Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue
_______________________________________________ 389-users mailing list -- 389-users@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to 389-users-leave@xxxxxxxxxxxxxxxxxxxxxxx Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/389-users@xxxxxxxxxxxxxxxxxxxxxxx Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue