Re: How to recover from an MDs rank in state 'failed'

Dhairya Parmar <dparmar@xxxxxxxxxx> · Thu, 30 May 2024 14:38:18 +0530

Hi Noe,

If the MDS has failed and you're sure of the fact that there are no pending
tasks or sessions associated with the failed MDS, you can try to make use
of `ceph mds rmfailed` but beware this MDS is really doing nothing and
doesn't link to any file system otherwise things can go wrong and can lead
to an inaccessible file system, more info regarding the command can be
found at [0] and [1].

[0] https://docs.ceph.com/en/quincy/man/8/ceph/
[1] https://docs.ceph.com/en/latest/cephfs/administration/#advanced
--
*Dhairya Parmar*

Associate Software Engineer, CephFS

<https://www.redhat.com/>IBM, Inc.

On Wed, May 29, 2024 at 4:24 PM Noe P. <ml@am-rand.berlin> wrote:

> Hi,
>
> after our desaster yesterday, it seems that we got our MONs back.
> One of the filesystems, however, seems in a strange state:
>
>   % ceph fs status
>
>   ....
>   fs_cluster - 782 clients
>   ==========
>   RANK  STATE     MDS        ACTIVITY     DNS    INOS   DIRS   CAPS
>    0    active  cephmd6a  Reqs:    5 /s  13.2M  13.2M  1425k  51.4k
>    1    failed
>         POOL         TYPE     USED  AVAIL
>   fs_cluster_meta  metadata  3594G  53.5T
>   fs_cluster_data    data     421T  53.5T
>   ....
>   STANDBY MDS
>     cephmd6b
>     cephmd4b
>   MDS version: ceph version 17.2.7
> (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)
>
>
>   % ceph fs dump
>   ....
>   Filesystem 'fs_cluster' (3)
>   fs_name fs_cluster
>   epoch   3068261
>   flags   12 joinable allow_snaps allow_multimds_snaps
>   created 2022-08-26T15:55:07.186477+0200
>   modified        2024-05-29T12:43:30.606431+0200
>   tableserver     0
>   root    0
>   session_timeout 60
>   session_autoclose       300
>   max_file_size   4398046511104
>   required_client_features        {}
>   last_failure    0
>   last_failure_osd_epoch  1777109
>   compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable
> ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds
> uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline
> data,8=no anchor table,9=file layout v2,10=snaprealm v2}
>   max_mds 2
>   in      0,1
>   up      {0=911794623}
>   failed
>   damaged
>   stopped 2,3
>   data_pools      [32]
>   metadata_pool   33
>   inline_data     disabled
>   balancer
>   standby_count_wanted    1
>   [mds.cephmd6a{0:911794623} state up:active seq 44701 addr [v2:
> 10.13.5.6:6800/189084355,v1:10.13.5.6:6801/189084355] compat
> {c=[1],r=[1],i=[7ff]}]
>
>
> We would like to get rid of the failed rank 1 (without crashing the MONs)
> and have a 2nd MD from the standbys step in .
>
> Anyone have an idea how to do this ?
> I'm a bit reluctant to try 'ceph mds rmfailed', as this seems to have
> triggered the MONs to crash.
>
> Regards,
>   Noe
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx