Dear all, I'm having a hard time troubleshooting a file-system failure on my 3 node cluster (deployed with cephadm + docker). After moving some files between folders, the cluster became laggy and Metadata Servers started failing and got stuck in rejoin state. Of course I already tried to restart the cluster multiple times. The mds units are now in a failed state because of too many restarts, the file-system is degraded and cannot be mounted because no mds is up. I think the data pool is ok because I can get files using rados. I can trigger the standby mds to become the "major" with ceph orch daemon rm mds <mds-in-error-id> or deploy a new one but the new "major" mds go again in error state. I don't find the mds logs really helpful but you can find one in the attachments for someone more expert than me. I am hesitant to follow the guide https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/ because of the warnings and because the ceph-journal-tool is poorly documented. The following might be useful seppia:~# ceph fs status starfs - 0 clients ====== RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 rejoin(laggy) starfs.polposition.njarir 539 25 17 0 POOL TYPE USED AVAIL cephfs.starfs.meta metadata 9900M 1027G cephfs.starfs.data data 12.1T 1027G MDS version: ceph version 16.2.0 (0c2054e95bcd9b30fdd908a79ac1d8bbc3394442) pacific (stable) seppia:~ # ceph health detail HEALTH_WARN 2 failed cephadm daemon(s); 1 filesystem is degraded; insufficient standby MDS daemons available; 7 pgs not deep-scrubbed in time [WRN] CEPHADM_FAILED_DAEMON: 2 failed cephadm daemon(s) daemon mds.starfs.polposition.njarir on polposition.starfleet.sns.it is in error state daemon mds.starfs.seppia.wdwrho on seppia.starfleet.sns.it is in error state [WRN] FS_DEGRADED: 1 filesystem is degraded fs starfs is degraded [WRN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons available have 0; want 1 more [WRN] PG_NOT_DEEP_SCRUBBED: 7 pgs not deep-scrubbed in time pg 3.a8 not deep-scrubbed since 2021-04-20T20:07:48.346677+0000 pg 3.a2 not deep-scrubbed since 2021-04-21T08:10:55.220263+0000 pg 3.7 not deep-scrubbed since 2021-04-21T07:24:20.073569+0000 pg 2.0 not deep-scrubbed since 2021-04-21T05:01:18.439456+0000 pg 9.1a not deep-scrubbed since 2021-04-21T05:18:20.171151+0000 pg 3.1cb not deep-scrubbed since 2021-04-20T21:54:38.251349+0000 pg 3.1ef not deep-scrubbed since 2021-04-21T07:07:18.842132+0000 Thanks for any suggestions, Alessandro Piazza _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx