Re: MDS lost, Filesystem degraded and wont mount

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Fri, 4 Dec 2020 11:20:48 +0100

Please also make sure the mds_beacon_grace is high on the mon's too.

it doesn't matter which mds you select to be the running one.

Is the processing getting killed, restarted?
If you're confident that the mds is getting OOM killed during rejoin
step, then you might find this useful:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-August/028964.html

Stop all MDS, then:
# rados -p cephfs_metadata_pool rm mds0_openfiles.0
then start one MDS.

-- Dan

On Fri, Dec 4, 2020 at 11:05 AM Anton Aleksandrov <anton@xxxxxxxxxxxxxx> wrote:
>
> Yes, MDS eats all memory+swap, stays like this for a moment and then
> frees memory.
>
> mds_beacon_grace was already set to 1800
>
> Also on other it is seen this message: Map has assigned me to become a
> standby.
>
> Does it matter, which MDS we stop and which we leave running?
>
> Anton
>
>
> On 04.12.2020 11:53, Dan van der Ster wrote:
> > How many active MDS's did you have? (max_mds == 1, right?)
> >
> > Stop the other two MDS's so you can focus on getting exactly one running.
> > Tail the log file and see what it is reporting.
> > Increase mds_beacon_grace to 600 so that the mon doesn't fail this MDS
> > while it is rejoining.
> >
> > Is that single MDS running out of memory during the rejoin phase?
> >
> > -- dan
> >
> > On Fri, Dec 4, 2020 at 10:49 AM Anton Aleksandrov <anton@xxxxxxxxxxxxxx> wrote:
> >> Hello community,
> >>
> >> we are on ceph 13.2.8 - today something happenned with one MDS and cephs
> >> status tells, that filesystem is degraded. It won't mount either. I have
> >> take server with MDS, that was not working down. There are 2 more MDS
> >> servers, but they stay in "rejoin" state. Also only 1 is shown in
> >> "services", even though there are 2.
> >>
> >> Both running MDS servers have these lines in their logs:
> >>
> >> heartbeat_map is_healthy 'MDSRank' had timed out after 15
> >> mds.beacon.mds2 Skipping beacon heartbeat to monitors (last acked
> >> 28.8979s ago); MDS internal heartbeat is not healthy!
> >>
> >> On one of MDS nodes I enabled more detailed debug, so I am getting there
> >> also:
> >>
> >> mds.beacon.mds3 Sending beacon up:standby seq 178
> >> mds.beacon.mds3 received beacon reply up:standby seq 178 rtt 0.000999968
> >>
> >> Makes no sense and too much stress in my head... Anyone could help please?
> >>
> >> Anton.
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users@xxxxxxx
> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx