Re: MDS lost, Filesystem degraded and wont mount

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Fri, 4 Dec 2020 11:39:18 +0100

Excellent!

For the record, this PR is the plan to fix this:
https://github.com/ceph/ceph/pull/36089
(nautilus, octopus PRs here: https://github.com/ceph/ceph/pull/37382
https://github.com/ceph/ceph/pull/37383)

Cheers, Dan

On Fri, Dec 4, 2020 at 11:35 AM Anton Aleksandrov <anton@xxxxxxxxxxxxxx> wrote:
>
> Thank you very much! This solution helped:
>
> Stop all MDS, then:
> # rados -p cephfs_metadata_pool rm mds0_openfiles.0
> then start one MDS.
>
> We are back online. Amazing!!!  :)
>
>
> On 04.12.2020 12:20, Dan van der Ster wrote:
> > Please also make sure the mds_beacon_grace is high on the mon's too.
> >
> > it doesn't matter which mds you select to be the running one.
> >
> > Is the processing getting killed, restarted?
> > If you're confident that the mds is getting OOM killed during rejoin
> > step, then you might find this useful:
> > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-August/028964.html
> >
> > Stop all MDS, then:
> > # rados -p cephfs_metadata_pool rm mds0_openfiles.0
> > then start one MDS.
> >
> > -- Dan
> >
> > On Fri, Dec 4, 2020 at 11:05 AM Anton Aleksandrov <anton@xxxxxxxxxxxxxx> wrote:
> >> Yes, MDS eats all memory+swap, stays like this for a moment and then
> >> frees memory.
> >>
> >> mds_beacon_grace was already set to 1800
> >>
> >> Also on other it is seen this message: Map has assigned me to become a
> >> standby.
> >>
> >> Does it matter, which MDS we stop and which we leave running?
> >>
> >> Anton
> >>
> >>
> >> On 04.12.2020 11:53, Dan van der Ster wrote:
> >>> How many active MDS's did you have? (max_mds == 1, right?)
> >>>
> >>> Stop the other two MDS's so you can focus on getting exactly one running.
> >>> Tail the log file and see what it is reporting.
> >>> Increase mds_beacon_grace to 600 so that the mon doesn't fail this MDS
> >>> while it is rejoining.
> >>>
> >>> Is that single MDS running out of memory during the rejoin phase?
> >>>
> >>> -- dan
> >>>
> >>> On Fri, Dec 4, 2020 at 10:49 AM Anton Aleksandrov <anton@xxxxxxxxxxxxxx> wrote:
> >>>> Hello community,
> >>>>
> >>>> we are on ceph 13.2.8 - today something happenned with one MDS and cephs
> >>>> status tells, that filesystem is degraded. It won't mount either. I have
> >>>> take server with MDS, that was not working down. There are 2 more MDS
> >>>> servers, but they stay in "rejoin" state. Also only 1 is shown in
> >>>> "services", even though there are 2.
> >>>>
> >>>> Both running MDS servers have these lines in their logs:
> >>>>
> >>>> heartbeat_map is_healthy 'MDSRank' had timed out after 15
> >>>> mds.beacon.mds2 Skipping beacon heartbeat to monitors (last acked
> >>>> 28.8979s ago); MDS internal heartbeat is not healthy!
> >>>>
> >>>> On one of MDS nodes I enabled more detailed debug, so I am getting there
> >>>> also:
> >>>>
> >>>> mds.beacon.mds3 Sending beacon up:standby seq 178
> >>>> mds.beacon.mds3 received beacon reply up:standby seq 178 rtt 0.000999968
> >>>>
> >>>> Makes no sense and too much stress in my head... Anyone could help please?
> >>>>
> >>>> Anton.
> >>>> _______________________________________________
> >>>> ceph-users mailing list -- ceph-users@xxxxxxx
> >>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx