[MDS] mds stuck in laggy state, CephFS unusable

kvesligaj@xxxxx · Fri, 24 Nov 2023 08:21:50 -0000

Hi,

we're having a peculiar issue which we found out during HA/DR testing in our Ceph cluster.

Basic info about cluster: 
Version: Quincy (17.2.6)
5 nodes configured in stretch cluster (2 DCs with one arbiter node which is also admin node for the cluster)
On every node beside the admin node we have OSD and MON services. We have 3 MGR instances in cluster.

Specific thing that we wanted to test is multiple CephFS with each having multiple MDS (with HA in mind). 
We deployed MDS on every node, increased max_mds to 2 for every CephFS and other two MDS-es are in standby-replay mode (they are automatically configured during CephFS creation to follow specific CephFS - join_fscid).

We did multiple tests and when we have only one CephFS it behaves as expected (two MDS are in up:active state and clients can connect and interact with CephFS as if nothing has happened).

When we test with multiple CephFS (two for example) and we shutdown two nodes one of MDS is stuck in up:active laggy state and when this happens the CephFS for which this happens is unusable, client hangs and it is stuck like that until we power on other DC. This happens even when there are no clients connected to this specific CephFS.

We can provide additional logs and do any tests necessary. We checked the usual culprits and our nodes don't show any excessive CPU or memory usage.

We would appreciate any help.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx