Re: mds crashes after up:replay state

Patrick Donnelly <pdonnell@xxxxxxxxxx> · Fri, 5 Jan 2024 13:52:13 -0500

Hi Lars,

On Fri, Jan 5, 2024 at 9:53 AM Lars Köppel <lars.koeppel@xxxxxxxxxx> wrote:
>
> Hello everyone,
>
> we are running a small cluster with 3 nodes and 25 osds per node. And Ceph
> version 17.2.6.
> Recently the active mds crashed and since then the new starting mds has
> always been in the up:replay state. In the output of the command 'ceph tell
> mds.cephfs:0 status' you can see that the journal is completely read in. As
> soon as it's finished, the mds crashes and the next one starts reading the
> journal.
>
> At the moment I have the journal inspection running ('cephfs-journal-tool
> --rank=cephfs:0 journal inspect').
>
> Does anyone have any further suggestions on how I can get the cluster
> running again as quickly as possible?

Please review:

https://docs.ceph.com/en/reef/cephfs/troubleshooting/#stuck-during-recovery

Note: your MDS is probably not failing in up:replay but shortly after
reaching one of the later states. Check the mon logs to see what the
FSMap changes were.

Patrick Donnelly, Ph.D.
He / Him / His
Red Hat Partner Engineer
IBM, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx