Hi Lars, On Fri, Jan 5, 2024 at 9:53 AM Lars Köppel <lars.koeppel@xxxxxxxxxx> wrote: > > Hello everyone, > > we are running a small cluster with 3 nodes and 25 osds per node. And Ceph > version 17.2.6. > Recently the active mds crashed and since then the new starting mds has > always been in the up:replay state. In the output of the command 'ceph tell > mds.cephfs:0 status' you can see that the journal is completely read in. As > soon as it's finished, the mds crashes and the next one starts reading the > journal. > > At the moment I have the journal inspection running ('cephfs-journal-tool > --rank=cephfs:0 journal inspect'). > > Does anyone have any further suggestions on how I can get the cluster > running again as quickly as possible? Please review: https://docs.ceph.com/en/reef/cephfs/troubleshooting/#stuck-during-recovery Note: your MDS is probably not failing in up:replay but shortly after reaching one of the later states. Check the mon logs to see what the FSMap changes were. Patrick Donnelly, Ph.D. He / Him / His Red Hat Partner Engineer IBM, Inc. GPG: 19F28A586F808C2402351B93C3301A3E258DD79D _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx