Troubleshoot MDS failure

Alessandro Piazza <alepiazza@xxxxxxx> · Mon, 3 May 2021 15:11:07 +0000

Dear all,

I'm having a hard time troubleshooting a file-system failure on my 3 node cluster (deployed with cephadm + docker). After moving some files between folders, the cluster became laggy and Metadata Servers started failing and got stuck in rejoin state. Of course I already tried to restart the cluster multiple times.

The mds units are now in a failed state because of too many restarts, the file-system is degraded and cannot be mounted because no mds is up. I think the data pool is ok because I can get files using rados.
I can trigger the standby mds to become the "major" with ceph orch daemon rm mds <mds-in-error-id> or deploy a new one but the new "major" mds go again in error state.

I don't find the mds logs really helpful but you can find one in the attachments for someone more expert than me.
I am hesitant to follow the guide https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/ because of the warnings and because the ceph-journal-tool is poorly documented.

The following might be useful

seppia:~# ceph fs status
starfs - 0 clients
======
RANK      STATE                 MDS             ACTIVITY   DNS    INOS   DIRS   CAPS
 0    rejoin(laggy)  starfs.polposition.njarir             539     25     17      0
       POOL           TYPE     USED  AVAIL
cephfs.starfs.meta  metadata  9900M  1027G
cephfs.starfs.data    data    12.1T  1027G
MDS version: ceph version 16.2.0 (0c2054e95bcd9b30fdd908a79ac1d8bbc3394442) pacific (stable)

seppia:~ # ceph health detail
HEALTH_WARN 2 failed cephadm daemon(s); 1 filesystem is degraded; insufficient standby MDS daemons available; 7 pgs not deep-scrubbed in time
[WRN] CEPHADM_FAILED_DAEMON: 2 failed cephadm daemon(s)
    daemon mds.starfs.polposition.njarir on polposition.starfleet.sns.it is in error state
    daemon mds.starfs.seppia.wdwrho on seppia.starfleet.sns.it is in error state
[WRN] FS_DEGRADED: 1 filesystem is degraded
    fs starfs is degraded
[WRN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons available
    have 0; want 1 more
[WRN] PG_NOT_DEEP_SCRUBBED: 7 pgs not deep-scrubbed in time
    pg 3.a8 not deep-scrubbed since 2021-04-20T20:07:48.346677+0000
    pg 3.a2 not deep-scrubbed since 2021-04-21T08:10:55.220263+0000
    pg 3.7 not deep-scrubbed since 2021-04-21T07:24:20.073569+0000
    pg 2.0 not deep-scrubbed since 2021-04-21T05:01:18.439456+0000
    pg 9.1a not deep-scrubbed since 2021-04-21T05:18:20.171151+0000
    pg 3.1cb not deep-scrubbed since 2021-04-20T21:54:38.251349+0000
    pg 3.1ef not deep-scrubbed since 2021-04-21T07:07:18.842132+0000

Thanks for any suggestions,
Alessandro Piazza
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx