CephFS HA: mgr finish mon failed to return metadata for mds

kvesligaj@xxxxx · Thu, 23 May 2024 14:41:50 -0000

Hi,

we have a stretched cluster (Reef 18.2.1) with 5 nodes (2 nodes on each side + witness). You can se our daemon placement below.

[admin]
ceph-admin01 labels="['_admin', 'mon', 'mgr']"

[nodes]
[DC1]
ceph-node01 labels="['mon', 'mgr', 'mds', 'osd']"
ceph-node02 labels="['mon', 'rgw', 'mds', 'osd']"
[DC2]
ceph-node03 labels="['mon', 'mgr', 'mds', 'osd']"
ceph-node04 labels="['mon', 'rgw', 'mds', 'osd']"

We have been testing CephFS HA and noticed when we have active MDS (we have two active MDS daemons at all times) and active MGR (MGR is either on admin node or in one of the DC's) in one DC and when we shutdown that site (DC) we have a problem when one of the MDS metadata can't be retrieved thus showing in logs as:

"mgr finish mon failed to return metadata for mds"

After we turn that site back on the problem persists and metadata of MDS in question can't be retrieved with "ceph mds metadata"

After I manually fail MDS daemon in question with "ceph mds fail" the problem is solved and I can retrieve MDS metadata.

My question is, would this be related to the following bug (https://tracker.ceph.com/issues/63166) - I can see that it is showed as backported to 18.2.1 but I can't find it in release notes for Reef.

Second question is should this work in current configuration at all as MDS and MGR are both at the same moment disconnected from the rest of the cluster?

And final question would be what would be the solution here and is there any loss of data when this happens?

Any help is appreciated.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx