Re: How to recover from an MDs rank in state 'failed'

"Noe P." <ml@am-rand.berlin> · Wed, 29 May 2024 14:41:12 +0200 (CEST)

On Wed, 29 May 2024, Eugen Block wrote:

> I'm not really sure either, what about this?
>
> ceph mds repaired <rank>

I think it works only for 'damaged' MDs.

N.

> The docs state:
>
> >Mark the file system rank as repaired. Unlike the name suggests, this command
> >does not change a MDS; it manipulates the file system rank which has been
> >marked damaged.
>
> Maybe that could bring it back up? Did you set max_mds to 1 at some point? If
> you do it now (and you currently have only one active MDS), maybe that would
> clean up the failed rank as well?
>
>
> Zitat von "Noe P." <ml@am-rand.berlin>:
>
> >Hi,
> >
> >after our desaster yesterday, it seems that we got our MONs back.
> >One of the filesystems, however, seems in a strange state:
> >
> >  % ceph fs status
> >
> >  ....
> >  fs_cluster - 782 clients
> >  ==========
> >  RANK  STATE     MDS        ACTIVITY     DNS    INOS   DIRS   CAPS
> >   0    active  cephmd6a  Reqs:    5 /s  13.2M  13.2M  1425k  51.4k
> >   1    failed
> >        POOL         TYPE     USED  AVAIL
> >  fs_cluster_meta  metadata  3594G  53.5T
> >  fs_cluster_data    data     421T  53.5T
> >  ....
> >  STANDBY MDS
> >    cephmd6b
> >    cephmd4b
> >  MDS version: ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2)
> >  quincy (stable)
> >
> >
> >  % ceph fs dump
> >  ....
> >  Filesystem 'fs_cluster' (3)
> >  fs_name fs_cluster
> >  epoch   3068261
> >  flags   12 joinable allow_snaps allow_multimds_snaps
> >  created 2022-08-26T15:55:07.186477+0200
> >  modified        2024-05-29T12:43:30.606431+0200
> >  tableserver     0
> >  root    0
> >  session_timeout 60
> >  session_autoclose       300
> >  max_file_size   4398046511104
> >  required_client_features        {}
> >  last_failure    0
> >  last_failure_osd_epoch  1777109
> >  compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable
> >  ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds
> >  uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline
> >  data,8=no anchor table,9=file layout v2,10=snaprealm v2}
> >  max_mds 2
> >  in      0,1
> >  up      {0=911794623}
> >  failed
> >  damaged
> >  stopped 2,3
> >  data_pools      [32]
> >  metadata_pool   33
> >  inline_data     disabled
> >  balancer
> >  standby_count_wanted    1
> >  [mds.cephmd6a{0:911794623} state up:active seq 44701 addr
> >  [v2:10.13.5.6:6800/189084355,v1:10.13.5.6:6801/189084355] compat
> >  {c=[1],r=[1],i=[7ff]}]
> >
> >
> >We would like to get rid of the failed rank 1 (without crashing the MONs)
> >and have a 2nd MD from the standbys step in .
> >
> >Anyone have an idea how to do this ?
> >I'm a bit reluctant to try 'ceph mds rmfailed', as this seems to have
> >triggered the MONs to crash.
> >
> >Regards,
> >  Noe
> >_______________________________________________
> >ceph-users mailing list -- ceph-users@xxxxxxx
> >To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx