Excellent! For the record, this PR is the plan to fix this: https://github.com/ceph/ceph/pull/36089 (nautilus, octopus PRs here: https://github.com/ceph/ceph/pull/37382 https://github.com/ceph/ceph/pull/37383) Cheers, Dan On Fri, Dec 4, 2020 at 11:35 AM Anton Aleksandrov <anton@xxxxxxxxxxxxxx> wrote: > > Thank you very much! This solution helped: > > Stop all MDS, then: > # rados -p cephfs_metadata_pool rm mds0_openfiles.0 > then start one MDS. > > We are back online. Amazing!!! :) > > > On 04.12.2020 12:20, Dan van der Ster wrote: > > Please also make sure the mds_beacon_grace is high on the mon's too. > > > > it doesn't matter which mds you select to be the running one. > > > > Is the processing getting killed, restarted? > > If you're confident that the mds is getting OOM killed during rejoin > > step, then you might find this useful: > > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-August/028964.html > > > > Stop all MDS, then: > > # rados -p cephfs_metadata_pool rm mds0_openfiles.0 > > then start one MDS. > > > > -- Dan > > > > On Fri, Dec 4, 2020 at 11:05 AM Anton Aleksandrov <anton@xxxxxxxxxxxxxx> wrote: > >> Yes, MDS eats all memory+swap, stays like this for a moment and then > >> frees memory. > >> > >> mds_beacon_grace was already set to 1800 > >> > >> Also on other it is seen this message: Map has assigned me to become a > >> standby. > >> > >> Does it matter, which MDS we stop and which we leave running? > >> > >> Anton > >> > >> > >> On 04.12.2020 11:53, Dan van der Ster wrote: > >>> How many active MDS's did you have? (max_mds == 1, right?) > >>> > >>> Stop the other two MDS's so you can focus on getting exactly one running. > >>> Tail the log file and see what it is reporting. > >>> Increase mds_beacon_grace to 600 so that the mon doesn't fail this MDS > >>> while it is rejoining. > >>> > >>> Is that single MDS running out of memory during the rejoin phase? > >>> > >>> -- dan > >>> > >>> On Fri, Dec 4, 2020 at 10:49 AM Anton Aleksandrov <anton@xxxxxxxxxxxxxx> wrote: > >>>> Hello community, > >>>> > >>>> we are on ceph 13.2.8 - today something happenned with one MDS and cephs > >>>> status tells, that filesystem is degraded. It won't mount either. I have > >>>> take server with MDS, that was not working down. There are 2 more MDS > >>>> servers, but they stay in "rejoin" state. Also only 1 is shown in > >>>> "services", even though there are 2. > >>>> > >>>> Both running MDS servers have these lines in their logs: > >>>> > >>>> heartbeat_map is_healthy 'MDSRank' had timed out after 15 > >>>> mds.beacon.mds2 Skipping beacon heartbeat to monitors (last acked > >>>> 28.8979s ago); MDS internal heartbeat is not healthy! > >>>> > >>>> On one of MDS nodes I enabled more detailed debug, so I am getting there > >>>> also: > >>>> > >>>> mds.beacon.mds3 Sending beacon up:standby seq 178 > >>>> mds.beacon.mds3 received beacon reply up:standby seq 178 rtt 0.000999968 > >>>> > >>>> Makes no sense and too much stress in my head... Anyone could help please? > >>>> > >>>> Anton. > >>>> _______________________________________________ > >>>> ceph-users mailing list -- ceph-users@xxxxxxx > >>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx